**3. Methods**

Figure 2 demonstrates the flowchart of our proposed method. First, the GF-2 image was pre-processed and split into training set and test set. Second, the trained model was used to classify the rural settlements. Finally, accuracy assessment was conduct on the test set. The details of the proposed method are described in the following subsections.

**Figure 2.** Flowchart of the proposed research framework: (**A**) generate data sets, (**B**) model training, and (**C**) accuracy assessment.

#### *3.1. Data Preprocessing*

For the cloud-free and haze-free atmospheric condition in the acquired image, there was no need for atmospheric correction in the preprocessing. After orthorectification, the MSS image and PAN images were then fused using the Gram–Schmidt pan-sharpening method [27]. The fusion image (1 m) had a dimension of 29,970 <sup>×</sup> 34,897 pixels, equivalent to about 700 km2. The reference map was generated based on (1) the land-use change surveying map and (2) visual interpretation by local experts. Note that the ground truth data in our study was spatially sparse, which thereby was more in line with real-world scenarios, where densely annotated data is rarely available.

#### *3.2. Rural Settlement Detection Using FCN*

#### 3.2.1. Dilated Residual Convolutional Network

The task of automatically extracting settlement information in a large rural region can be formulated as a semantic labeling problem to distinguish pixels of categories. In this section, we wish to put forward an end-to-end method based on semantic segmentation scheme to identify rural settlements. Our approach used ResNet50 architecture as the feature extractor of FCN-based method. The ResNet consists of five stages of convolutional layers. In the first stage, a convolution layer performs 7 × 7 convolution and is followed by a maxpooling operation, which outputs the features that are a quarter of the size of the original image. In the remaining four stages, each stage contains several blocks, which is a stack of two 3 × 3 convolutional layers. Moreover, two types of shortcut connection are introduced in the blocks to fuse input feature maps with output feature maps according to the size of input and output features. More details about ResNet50 can be found in [28].

As the network goes deeper, the resolution of feature maps becomes smaller while the channels increase. For example, the output feature maps of the last stage are reduced to 1/32 of the size of the original input. Compared with the complex background in HSR images, the objects of our interest (i.e., rural settlements) are smaller and sparser. Besides, the loss of spatial information due to the progressive down-sampling in the network is harmful for identifying small objects. To retain the large receptive field and increase spatial resolution in higher layers of network simultaneously, we adopt convolutions with dilated kernels into the ResNet. In the last two stages of original ResNet50, the strided convolution layer, which is used to reduce the output resolution at the beginning of each stage, is substituted by a convolution layer with the stride of 1 (meaning no downsampling). Recent studies [29,30] indicate this conversion does not affect the receptive field of the first layer of the stage, but it reduces the receptive field of subsequent layers by half. In order to increase the receptive field of those layers, convolutions with different dilation factors were adopted. Specifically, the dilation ratio of convolutional kernel was set as 2 and 5 in the fourth and the fifth stage, respectively. Dilated convolutions were thus expected to enlarge the receptive field of layers and to generate features with high spatial resolution. As a result, the output size would increase from 1/32 to 1/8 of the input image.

#### 3.2.2. Multi-Scale Context Subnetwork

Some upgraded low-density houses and high-density buildings may have used similar roofing materials. In order to distinguish between the two categories of rural settlements, their spatial distribution and context need to be fully considered. Due to a great variety of the size of rural settlements, it is necessary to capture multiple scales information to identify objects in rural residential areas. Instead of using multiple rescaled versions of an image as input to obtain multi-scale context information, we introduced a multi-scale spatial context structure to handle the scale variation of rural residential objects. Commonly, the deep layers in CNNs respond to global context information and the shallow layers are more likely to be activated by local texture and patterns. Benefit from the dilation convolution maintaining spatial resolution, the three scale-levels features extracted by the backbone ResNet50 can be utilized at the same time. Our structure further enhanced the information propagation across layers. As shown in Figure 3, the output features of last three stages were filtered by 1 × 1 convolution layers to shrink the channel to 256 and then concatenated together. It is notable that we appended 3 × 3 convolution on the merged map to generate the subsequent feature map, which was to reduce the misalignment when fusing features of different levels. Secondly, a residual correction was introduced to alleviate the lack of information during feature fusion. Finally, feature selection was conducted by employing an advanced channel encoding module named "squeeze and excitation" block (SE block) [26], which adaptively recalibrated channel-wise feature responses. Once features were input into the module, global average pooling was used to generate a vector as channel-wise descriptors of the input features. Subsequently, two fully connected layers were applied to the vector to learn nonlinear interaction among channels. The sigmoid activation function would then generate a weight vector as a scale factor for the class-dependent features. The features refined by the above reweighting process had discriminative representations, which were helpful for object identification. Based on abundant positioning and identification information, the successive 3 × 3 convolution layer was expected to produce more accurate features. Finally, the refined deep features were then concatenated with the corresponding low-level features (e.g., Res-layer1 in ResNet50) in order to restore spatial details. After the fusion, we applied another 3 × 3 convolution layer and a simple bilinear upsampling to get the final segmentation. Table S1 shows the specific design of our segmentation network.

**Figure 3.** Overview of the proposed detection architecture. (**A**) the Dilated-ResNet extracted multi-level features with high spatial resolution; (**B**) the context subnetwork exploited the multi-scale context and mapped features to desired outputs.

#### 3.2.3. Multi-Spectral Images-Based Transfer Learning

CNNs are generally data-driven approach and are usually trained on large datasets. In practice, a sufficiently large data set is rare. Instead, it is more practical to use a deep network previously trained on a big dataset (e.g., ImageNet) as an initial model or a feature extractor for the target task. This scheme is known as transfer learning [31]. In brief, the idea of transfer learning is to leverage knowledge from the source domain to boost learning in the target domain, as features of CNNs are more generic in early layers. Compared with training from scratch, the cost of fine-tuning the pre-trained network is much lower. Several attempts have been made to improve the learning task in remote sensing datasets by using transfer learning [32–34].

The ResNet50 is initially designed for RGB images [28]. To better adapt to multispectral remote sensing data which have the red (R), green (G), blue (B) and near-infrared (NIR) bands, the network was expanded to take advantage of more input bands than RGB. Different from the idea of using an additional convolution layer at the beginning of network [35] or adding a branch to accept multiband inputs [34], we directly modified the original 7 × 7 convolution layer in the first stage of ResNet to make it flexible to receive multispectral images and output 64 features.

#### *3.3. Method Implementation and Accuracy Assessment*

A total of 7605 tiles of a size of 256 × 256 pixels were cropped from the training area of the preprocessed GF-2 imagery, and we randomly selected about 20% of image patches as the validation set. Data augmentations consisting of flipping and rotation of 90 degrees were applied to enlarge the training set. The proposed network was trained on a 24 GB Nvidia P6000 GPU. The weights of network were initialized using the pre-trained ResNet50 model. We copied the weights of the first channel to initialize the newly added channel in the first convolution layer. An adaptive algorithm Adam [36] was employed as the optimizer, and the learning rate was set to 0.001. A batch size of 8 was used, running the optimizer for 30 epochs with an early stopping strategy which stopped training process when the monitored quantity (i.e., validation loss) had stopped improving for 5 epochs. The proposed method was implemented on the Pytorch framework.

Figure 4 shows the training area and the test area in the experiment. The point test samples were all over the entire study area except the training area. In order to further evaluate the area accuracy, we selected a small area in the test area as the polygon test subset, and rural settlements in the polygon test subset were densely labeled. The random point generating algorithm in ArcGIS [37] was applied to generate a total of 11,628 sample points. After that, we manually annotated these sample points based on higher resolution images of Google Earth and visual interpretation. In addition to the two

types of settlements about which we were concerned, all other objects in the image were included in the background category. Table 1 lists the number of test set samples.

**Figure 4.** The (**a**) Tongxiang data set used in the experiments. (**b**) Example of test samples.

**Table 1.** The number of testing samples.


Following the previous studies, the overall accuracy (OA), producer's accuracy (PA), user's accuracy (UA) and Kappa coefficient (Kappa) [38] were used to assess the performances of methods. The producer's accuracy represents the probability that pixels of a category are correctly classified, whereas the user's accuracy indicates the probability that the classified pixels belong to this category. Overall accuracy is the percentage of correctly classified pixels. The Kappa analysis is a discrete multivariate technique used in accuracy assessment to test whether one error matrix is significantly different from another [39], and Kappa coefficient calculated based on the individual error matrices can be regarded as another measure of accuracy.

### **4. Results and Discussions**
