*2.1. Learning and Recovering High-Resolution Details from Low-Resolution Feature Maps*

The representative model in this category is the fully convolutional network (FCN) [35]. In each stage of an FCN, an up-sampling subnetwork, like a decoder, was used as the up-sampling procedure, which attempts to recover the fine-spatial resolution details from the coarse-spatial resolution feature maps [33,34,48]. In SegNet [36], the up-sampling strategy is a mirrored symmetric version from the pooling subnetwork by grabbing the indices directly for the pooling subnetwork. The up-sampling strategy can be combined with the deconvolution process, such as in DeconvNet [37], where the locations and values of the highest gradience are kept by the up-sampling strategy, and the sparseness of the up-sampling output is repaired by the deconvolution layers. In RefineNet [38], instead of using only one feature map from one pooling layer, the long-range residual connections were used to combine all information along with all pooling layers to refine the high-resolution details. Other asymmetric structures, such as the light up-sampling process [49], light pooling, heavy up-sample processes [50], and re-combinator networks [51], were all reported have good performance for object detection.

#### *2.2. Maintaining High-Resolution Details throughout the Network*

Representative models include convolutional neural fabrics [45], interlinked CNNs [46], and highresolution networks (HRNet) [47]. In an HRNet, a high-resolution subnetwork was first established as the first stage, then the high-to-low resolution subnetworks were added consecutively to form more

low-level stages. This structure maintains the high-resolution details through the whole process and has achieved state-of-the-art performance in the field of human pose estimation [47]. Fu et al. (2019) and Wu et al. (2018) also improved skip connections by stacking multiple DeconvNets/UNets/Hourglasses with dense connections [52,53].

#### **3. Study Area and Datasets**

In this paper, the intra-class spectral variation of cultivated land can be reflected in three perspectives: (1) intra-class spectral variation over different time, (2) intra-class spectral variation over different geo-locations, (3) intra-class spectral variation over different crop types. These three variation factors can be represented with multiple times (both winter and summer) and different locations within a large area. The study area is located in the Urumqi and Bosten farmlands in Xinjiang, China (Figure 1), which mainly grow cash crops, such as cotton and pears. The crops are planted in large areas with high yield and require a huge amount of water supply every year. Extracting cultivated land of these two regions is of great significance to the agricultural and water resource monitoring to ensure the national food security of Xinjiang and China.

**Figure 1.** The study area.

Landsat5 thematic mapper (TM) top of atmosphere (TOA) reflectance (the USGS Earth Explorer: https://earthexplorer.usgs.gov/) from 2009 to 2011 was collected as the dataset in this study. The TM sensor has seven spectral bands (Table 1), but we only selected six bands with a resolution of 30 m: B1 (blue), B2 (green), B3 (red), B4 (near-infrared, NIR), B5 (short-wave-infrared, SWIR 1), and B7 (short-wave-infrared, SWIR 2). The thermal band was not used in this study as it could vary during the different observation dates, which was caused by the different local environmental factors, such as the radiative energy the land received or the wind speed. Only cloud free images were chosen in this study. The image details are shown in Table 2.


**Table 1.** Parameters of Landsat4—5 thematic mapper (TM).


We used the historical landcover map in the 2010 version from the local government to extract the ground truth manually based on the Landsat 5 image at 30 m scale. The changes in the land cover types were considered to be consistent from 2009 to 2011 and were neglected in this study. The original historical landcover map contained five land cover types (the urban area, cultivated land, forest, water, and desert). We classified the historical landcover map by only two types (cultivated land and other). The historical landcover map was then transformed from the original polygon to the raster format with the same spatial resolution of the Landsat data. For convenience, we added the ground truth data (the historical landcover map) to the Landsat dataset as the seventh band. After that, the TM images and corresponding ground truth were split into 256 × 256-pixel tiles to keep the memory consumption low during the training and validation. These tiles were adjacent and non-overlapping.

To evaluate different combinations of spectral bands on the performance of cultivated land extraction, we defined three datasets, namely, TM-NRG, TM-RGB, TM-All, with a varying number of spectral bands. An overview of each dataset is provided in Table 3. To avoid overfitting during training, we selected 4050 tiles (approximately 70%) randomly for training, 867 tiles (approximately 15%) as validation data for adjusting the model hyperparameters during training, and the remaining 868 tiles (approximately 15%) for independent testing. The methods we used for comparation (RF, U-Net, and U-Net++) were all based and tested on the same datasets.


**Table 3.** Three different TM datasets used in this study.

### **4. Methodology**

In this paper, a new end-to-end cultivated land extraction algorithm, high-resolution U-Net (HRU-Net), was proposed, with the aim to extract the same land-cover type with different spectra accurately and preserve the image details by improving the skip connection structure and loss function of the original U-Net. Figure 2 shows an overview of the workflow of this study.

**Figure 2.** Overview of the performance evaluation framework. High-resolution U-Net (HRU-Net).

#### *4.1. The Original U-Net and U-Net*++

Initially, the U-Net was developed for biomedical image segmentation. We chose it as the base network to extract cultivated land as it achieves state-of-the-art performance on benchmark datasets even with limited training data [27,28]. Figure 3a shows the structure of the original U-Net network. It contains two main pathways: the contracting pathway on the left side and the expansive pathway on the right side.

In the contracting path, the input image was first sent to the feature detection by operating a 2-dimensional convolution by the typical architecture of a convolutional network, which repeated the block of two 3 × 3 convolutions, a rectified linear unit, and a 2 × 2 max-pooling operation, iteratively. To enlarge the "sense field" of the convolution kernel and give the network more ability for a global view of the features of the object, the "pooling operation" was added to contract the feature map into the lower level. Meanwhile, a skip connection structure attempted to reduce the loss of image details in the "pooling operation" in the contraction path by adding a feature vector to the expansive path at the same level, as indicated by the gray arrow in Figure 3a.

In the expansive path, the central idea was to combine the low-level feature maps to expand the image size. First, the low-level feature map was up-sampled by a 2 × 2 transpose convolution. Secondly, the output was combined with the corresponding feature map from the skip connection at the same level. Thirdly, two 3 × 3 convolutions and the rectified linear unit (ReLU) activation function were applied for further feature detection.

At the final layer, to match the number of channels to the number of classes in the final output, a 1 × 1 convolution with the Softmax activation function was used. The output of this network was the predicted probabilities of each class *p*(*x*). The final class labels were calculated by selecting the highest probability class in the vector *p*(*x*). In this structure, the skip connection was the only path to restore the high-resolution details in every convolution level.

**Figure 3.** (**a**) U-Net architecture [28] and (**b**) Simplified U-Net topology diagram from (**a**).

As shown in Figure 3b, in order to emphasize the skip connections between the feature maps at the different levels, the structure of the U-Net (Figure 3a) was simplified by replacing the convolution process in Figure 3a with the symbol *Xij*, where *i* is the level index and *j* is the convolution node index at the same level. For example, the *X*<sup>10</sup> represented the first convolution module at the second level.

The other benefit of the U-Net is that the number of the trainable parameters is relatively small. Other networks, such as FCN and DeconvNet, are more complicated with more trainable parameters, and require a bigger training set and a longer time to train [35,37]. Usually, to reduce the training time of networks, a pre-trained network can be used to retrain the top layer on a new dataset. However, the pre-trained network is usually trained on natural pictures with RGB bands. As we hope to take full advantage of the multi-band data of remote-sensing images instead of only RGB channels, this strategy cannot work well when the channel difference happens between the pre-trained and the new datasets. For this reason, the U-Net network in this study was trained from scratch.

Under the hypothesis that the feature maps from contracting path (encoder networks) can enrich the prior for the expansive path (decoder networks), UNet++ was proposed to increase the segmentation accuracy for medical images [54]. In UNet++, a small down-triangle structure was designed as the basic unit. With this unit, UNet++ can be easily extended to different levels depending on the accuracy and performance required for the different tasks. The intuitive purpose of the UNet++ is to reach the high overall accuracy of segmentation in medical images for improving disease diagnosis. In this paper, we focused on the application of a deep learning model for satellite images, specifically to recover the edge details of the land cover types which were lost during the "pooling" process. More details of the HRU-Net will be described in the next section.

#### *4.2. The High-Resolution U-Net*

Giving the network the ability to learn the high-resolution details of the image is the key to solving the problems of insufficient accuracy of cultivated land extraction due to a loss of image details. The idea of the U-Net network is to learn and recover high-resolution details directly from a low-resolution feature map by simply combining the feature maps from the skip connection at the same level. In the first step, learning and recovering high-resolution information from the lower level feature map is extremely difficult as it requires the recovery of non-existent details. In the second step, simply adding the feature map from the skip connection to a low-level feature map could disturb the concise features learned from the low level. The image details from the skip connection are limited as it has already suffered the "pooling process" in the previous feature detection.

Considering the multi-level structure of the U-Net and the higher level that the convolutional nodes locate, a smaller number of the "pooling process" were applied to these nodes. As a result, more texture details remained in these feature maps. The key to solving this problem was to find a proper strategy to enrich the feature map details by involving information from the higher level and reducing the noise amplifying effect at the same time. The new structure we proposed in this study, the HRU-Net, used the idea of maintaining high-resolution details during the whole process to ensure that the multi-resolution descriptions of the image were always present (Figure 4).

In this structure, the image details not only came from the same level but were also enriched from the higher level. To reduce the noise from the higher level and produce more deep sematic features, several convolutional nodes were added in the skip connection path. The new convolutional nodes increased the number of overall parameters, so in this study, to learn the network parameters more efficiently, the idea of deep supervision was adopted to re-design the loss function. The network architecture is illustrated in Figure 4a. Compared to the original U-Net architecture, the HRU-Net kept the same structure in the contracting and expansive path. More skip connections were added between the contracting and expansive path. The simplified topology diagram of the HRU-Net is shown in Figure 4b, simplified from Figure 4a by replacing the convolution process with the symbol *Xij* to make clearer the structure of the skip connection in the HRU-Net.

In the following part, we will further discuss from the two perspectives: (1) how to improve the skip connection structure and (2) how to use the idea of deep supervision to design the loss function.

**Figure 4.** (**a**) The HRU-Net architecture and (**b**) the simplified topology diagram of the HRU-Net.

#### 4.2.1. Improving the Skip Connection Structure

The skip connections were first introduced in the FCN [37]. Starting from the FCN, this structure has been widely introduced in many models to retain the high-resolution details across the different levels. In the U-Net, the feature maps in the contracting path are directly sent to the expansive path by skip connections. To simply copy the feature map from the contracting path and merge to the expansive path with the feature map from the lower level does not always work as the details have already been lost before the skip connections. The basic idea to solve this problem is to borrow the image details from a higher level to minimize the effect of the "pooling" (the green-sampling arrow in Figure 4b). Followed by this idea, in the HRU-Net the skip connection was improved in the following two aspects:

(1) Maintained resolution details at the same level

First, the HRU-Net maintained feature maps at the same layer by applying a repeated convolution module (shown in blue arrows in Figure 4b). Each module consisted of two 3 × 3 convolutions and a rectified linear unit. Then, it incorporated shallow features into deep features at each layer by a skip connection at the same level to retain details (shown in blue curved arrows in Figure 4).

(2) Fused multi-scale details cross different levels

The HRU-Net converted the high-resolution feature map into the same size and the same number of channels as the lower-level required by applying a 3 × 3 convolution with a stride of 2 (shown in green arrows in Figure 4b); then, the HRU-Net combined this high-level feature map with the feature map from the previous node by a convolution operation and a concatenation operation; at last, two 3 × 3 convolutions and a rectified linear unit were applied for further feature detection (shown in blue arrows in Figure 4a,b).

The HRU-Net can be formulated as follows:

$$X\_{ij} = \begin{cases} c\big(d(X\_{(i-1)j})\big) & j=0\\ c\big([X\_{ik}]\_{k=0}^{j-1}\big) & i=0, \ j=1,2,3\\ c\big([[X\_{ik}]\_{k=0}^{j-1}, u(X\_{(i+1)(j-1)})]\big) & i=0 \text{ and } j=4\\ c\big([X\_{ik}]\_{k=0}^{j-1}, d(X\_{(i-1)j})\big) & j>0, i>0 \text{ and } i+j<4\\ c\big([X\_{ik}]\_{k=0}^{j-1}, d(X\_{(i-1)j}), u(X\_{(i+1)(j-1)})\big) & j>0, i>0 \text{ and } i+j=4 \end{cases} \tag{1}$$

where *Xij* is the output feature map of the node (*i*, *j*), where *i* is the level index and *j* is the convolution node index at the same level. Function *c*(·) represents the convolution operation, *u*(·) denotes an up-sampling operation, *d*(·) is a pooling or down-sampling operation, and [·] is the concatenation operation. The overall structure can be described as follows:


4.2.2. Using the Idea of Deep Supervision to Modify the Loss Function

When designing the input of the loss function, the U-Net only obtains the classification probabilities from *X*04. Compared to the U-Net, the HRU-Net generated full-resolution feature maps from multiple levels, *X*0*j*, *j* ∈ (1, 2, 3, 4) , which can be used to apply deep supervision. We first obtained the classification probabilities at different semantic levels, from *X*0*j*, *j* ∈ (1, 2, 3, 4) , through 1 × 1 convolutions with the Softmax activation function (as marked by red arrows in Figure 4), and then obtained the predicted class probabilities *P*(*x*) by averaging all probabilities,

$$P(\mathbf{x}) = \begin{bmatrix} P\_0(\mathbf{x}) , P\_1(\mathbf{x}) \end{bmatrix}^T \tag{2}$$

where *Pi*(*x*) is the predicted probability of *x* belonging to class *i* (*i* = 0 for cultivated land, and *i* = 1 for non-cultivated land). The class label *y* of a given image can be calculated by obtaining the label from the maximized probability in *P*(*x*):

$$\mathbf{y} = \arg\max(\mathbf{P}(\mathbf{x})).\tag{3}$$

The loss function of HRU-net is defined as

$$H(\mathbf{Y}, \overline{\mathbf{Y}}) = -\frac{1}{N} \sum\_{i} \mathbf{Y}\_{i} \log \left( \overline{\mathbf{Y}}\_{i} \right) \tag{4}$$

with *Yi* and *Yi* denoting the predicted and the actual probability of class *i*, respectively, and *N* being the batch size.

#### 4.2.3. Assessment

The accuracy evaluation metrics in this paper include (1) the overall accuracy, (2) Cohen's kappa coefficient, and (3) the F1-score. The overall accuracy is defined as the number of correctly classified pixels over the total number of pixels. It is simple and intuitive but may fail to assess the performance thoroughly when the number of samples for different classes varies significantly. Cohen's kappa coefficient is more robust, as it takes into consideration the possibility of agreements occurring randomly. Let *p*<sup>0</sup> be the percentage of pixels correctly classified, and *pe* be the expected probability of agreement when the classifier assigns class labels by chance, Cohen's kappa coefficient is defined as:

$$K = \frac{p\_0 - p\_\varepsilon}{1 - p\_\varepsilon}.\tag{5}$$

Usually, we characterize *K* < 0 as no agreement, [0, 0.20] as poor agreement, [0.20, 0.40] as fair agreement, [0.40, 0.60] as moderate agreement, [0.60, 0.80] as good agreement, and [0.80, 1] as almost perfect agreement. The F1-score is defined as the harmonic mean of the precision rate and recall rate:

$$\text{F1} = \frac{2 \times P \times R}{P + R} \tag{6}$$

where *P* is the number of positive classes predicted correctly (*TP*) divided by the number of all positive results (including both true positive *TP* and false positive *FP*), and *R* is the number of true positive results (*TP*) divided by the number of all relevant samples (true positive plus false negative *FN*):

$$P = \frac{TP}{TP + FP} \tag{7}$$

$$R = \frac{TP}{TP + FN} \tag{8}$$

An F1 score reaches its best value at 1 (perfect precision and recall) and its worst at 0.

#### **5. Results and Discussion**

#### *5.1. The Learning Process of the HRU-Net*

In this study, we hoped to make full use of the advantage of the multi-band data of remote-sensing images instead of only RGB images. Thus, we decided to train the all network (HRU-Net, U-Net++, the original U-Net, and RF) from scratch. To compare the performance of the different numbers of bands, three datasets were prepared (Table 2). The performance of the near-infrared (NIR) band can be analyzed when comparing the results of the TM-NRG with those of the TM-RGB. Similarly, comparing the results from the TM-All to the TM-NRG datasets, the improvement of the shortwave-infrared (SWIR) can be investigated.

The HRU-Net, U-Net++, U-Net, and RF were trained and tested on the three datasets (Table 2) separately. In each dataset, all samples were randomly split into three: the training set, the validation set, and the testing set. The training set was used for model training. The validation set was used to calibrate the hyperparameters of the deep learning model, and the testing set was used to apply the independent assessment for the different models.

All experiments of the HRU-Net, the U-Net++, and U-Net were carried out on four TITAN X GPUs. We used PyTorch backend as the deep-learning framework (https://pytorch.org/). To maximize the GPU memory usage, we set a different batch size for each network (HRU-Net and U-Net++:24, U-Net:48), and each network model was trained by starting with a different initial learning rate (HRU-Net:0.0015, U-Net++:0.002, U-Net:0.0002). For three networks, the gradient descent optimization (SGD) optimizer with a momentum of 0.95 and a weight decay of 10−<sup>4</sup> was adopted, and the learning rate was decreased every iteration by a factor of 0.5 <sup>×</sup> (<sup>1</sup> <sup>+</sup> *cos* <sup>π</sup> *iter max iters* ). The batch-norm parameters were learned with a decay rate of 0.9, and the input crop size for each training image was set to 256 × 256. Figure 5 shows the training history of the HRU-Net, U-Net++, and U-Net. Considering the popularity and the success of the RF in the classification of remote-sensing images, we also trained the traditional RF classifier on the same datasets as a comparison. The Scikit-learn (http://scikit-learn.org, 2018) implementation was adopted for RF in our experiments, which employed several optimized *C4.5* decision trees to improve the prediction accuracy while controlling the over-fitting at the same time [55]. The detailed parameters of the random forest are shown in Table 4.

**Figure 5.** Visualizations of the training history for the HRU-Net, U-Net, and U-Net++ models.


**Table 4.** The parameters used in the random forest algorithm.

The visualizations of the training history for the HRU-Net, U-Net, and U-Net++ models are shown in Figure 5. The blue line represents the loss calculated by the training set at each epoch. The orange line represents the loss calculated by the validation set at each epoch. Both values of the loss are high at the beginning of the training process. As the model developed by each epoch, both loss values decrease. The main purpose of Figure 5 is to avoid overfitting during the training. As shown in Figure 5, all orange lines converge to a certain value, indicating that there is no overfitting that happens during the training process. In other words, all three models were sufficiently trained and can be compared fairly with each other.

#### *5.2. Comparation of the HRU-Net with U-Net, U-Net*++*, and RF*

We tested the results of the HRU-Net, U-Net, U-Net++, and RF from three aspects: (1) the overall accuracy, (2) the accuracy of the edge details, and (3) the robustness for the intra-class variation.
