*2.4. Methodology*

The DenseNet architecture takes the output of all the previous layers as input and combines the previous information features to extract abstract features that are fairly limited. U-net performs deep feature extraction on the basis of the previous layer. Therefore, this study first improves the U-net and DenseNet networks and deeply couples them into the U-net-DenseNet-coupled network (UDN). Then, this network is combined with object-based multiresolution segmentation methods to construct the OUDN algorithm for intelligent and accurate extraction of urban land use and urban forest resources from VHSR images. The following introduces the DL algorithms in detail based on a brief introduction of DCNNs.

#### 2.4.1. Brief Introduction of CNNs

Convolutional neural networks (CNNs) are the core algorithms of DL in the field of computer vision (CV) applications (such as image recognition) because of their ability to obtain hierarchically abstract representations with local operations [63]. This network structure was first inspired by biological vision mechanisms. There are four key ideas behind CNNs that take advantage of the properties of natural signals: local connections, shared weights, pooling, and the use of many layers [24], which are fully utilized.

As shown in Figure 3, the CNN structure consists of four basic processing layers: the convolution layer (Conv), nonlinear activation layer (such as ReLU), normalization layer (such as batch normalization (BN)), and pooling layer (Pooling) [63,64]. The first few layers are composed of two types of layers: convolutional layers and pooling layers. The units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank, and all units in a feature map share the same filter bank. Different feature maps in every layer use different filter banks, so different features can be learned. The result of this local weighted sum is then passed through a nonlinear activation function such as a ReLU, and the output results are pooled and nonlinearly processed through normalization (such as BN). In addition, nonlinear activation and nonlinear normalization are nonlinear blocks of processing that leads to a bigger boost in model training, so they play a significant role in CNN architecture. After multiple convolutions (combining a convolutional layer and a pooling layer is called a convolution), the results are flattened as the input of the fully connected layer, namely, the artificial neural network (ANN). Thus, the prediction result is finally obtained. Specifically, the major operations performed in the CNNs can be summarized by Equations (1)–(5):

$$S^{[l]} = \operatorname{pool}\_p(q(S^{[l-1]} \* \mathcal{W}^{[l]} + b^{[l]})) \tag{1}$$

$$\text{op}(Z) = \text{ } \text{ } = \begin{cases} \text{ } Z; & \text{ if } Z \ge 0 \\ 0; & Z < 0 \end{cases} \tag{2}$$

$$\mu = \frac{1}{m} \sum\_{i=1}^{m} \mathcal{R}^{(i)} \tag{3}$$

$$\sigma^2 = \frac{1}{m} \sum\_{i=1}^{m} \left( \mathbb{R}^{(i)} - \mu \right)^2 \tag{4}$$

$$R\_{\text{norm}}^{(i)} = \frac{R^{(i)} - \mu}{\sqrt{\sigma^2 + \varepsilon}},\tag{5}$$

where *S*[*l*] indicates the feature map at the *l*th layer [25], *S*[*l*−<sup>1</sup>] denotes the input feature map to the *l*th layer, and *W*[*l*] and *b*[*l*] represent the weights and biases of the layer, respectively, that convolve the input feature map through linear convolution ∗. These steps are often followed by a max-pooling operation with *p* × *p* window size (*poolp*) to aggregate the statistics of the features within specific regions, which forms the output feature map *S*[*l*]. The ϕ(*Z*), *R*, indicates the nonlinearity function outside the convolution layer and corrects the convolution result of each layer, *Z* denotes the result of the convolution operation by calculating *S*[*l*−<sup>1</sup>] ∗ *W*[*l*] + *<sup>b</sup>*[*l*], *m* represents the batch size (the number of samples required for a single training iteration), μ represents the mean, σ2 represents the variance, ε is a constant set to keep the value stable to prevent √σ<sup>2</sup> + ε from being 0, and *R*(*i*) *norm* is the normalized value.

**Figure 3.** The classical structure of convolutional neural networks (CNNs). Batch normalization (BN) is a technique for accelerating network training by reducing the offset of internal covariates.

#### 2.4.2. DL Algorithms

**Improved DenseNet (D):** DenseNet is based on ResNet [65], and its most important characteristic is that the feature maps of all previous networks are used as input for each layer of the network. Additionally, the feature maps are used as input by the following network layer, so the problem of gradient disappearance can be alleviated and the number of parameters can be reduced. The improved DenseNet network structure in this study is shown in Figure 4. Figure 4a is the complete structure, which adopts 3 Dense Blocks and 2 Translation layers. Before the first Dense Block, two convolutions are used. In this study, the bottle layer (1 × 1 convolution) in the Translation layer is converted to a 3 × 3 convolution operation, followed by an upsampling layer and finally the prediction result. The specific Dense Block structure is shown in Figure 4b and summarized by Equation (6):

$$\mathbf{X}\_{\mathfrak{I}} = \mathsf{H}\_{\mathfrak{I}}([\mathbf{x}\_0, \,\,\mathbf{x}\_1, \,\,\dots, \,\,\,\mathbf{x}\_{\mathfrak{I}-1}]),\tag{6}$$

where [X0 , X1 , ... , <sup>X</sup>−1denotes the feature maps with layers of X0, X1 ... , <sup>X</sup>−1 and <sup>H</sup>[X0 , X1 , ... , <sup>X</sup>−1 indicates that the layer takes all feature maps of the previous layers (X0, X1 ... , <sup>X</sup>−1) as input. In this study, all the convolution operations in the Dense Block use 3 × 3 convolution kernels, and the number of output feature maps (*K*) in each layer is set to 32.

**Figure 4.** The improved DenseNet structure composed of three dense blocks: (**a**) the complete structure and (**b**) the Dense Block composed of five feature map layers.

**Improved U-net (U):** U-net is an improved fully convolutional network (FCN) [66]. This network has attracted extensive attention because of its clear structure and excellent performance on small data sets. U-net is divided into a contracting path (to effectively capture contextual information) and an expansive path (to achieve a more precise position for the pixel boundary). Considering the characteristics of urban land use categories and the rich details of WorldView-3 images, the improved structure in this study mainly increases the number of network layers to 11 layers, and each layer increases the convolution operations, thereby obtaining increasingly abstract features. The network is constructed around convolution filters to obtain images with different resolutions, so the structural features of the image can be detected on different scales. More importantly, BN is performed before the convolutional layer and pooling layer, and the details are shown in Figure 5.

(1) The left half of the bottom layer is the contracting path. With the input of a 128 × 128 image, each layer uses three 3 × 3 convolution operations. After each convolution, followed by the ReLU activation function, max-pooling with a step of 2 is applied for downsampling. In each downsampling stage, the number of feature channels is doubled. Five downsamplings are applied, followed by two 3 × 3 convolutions in the bottom layer of the network architecture. The size of the feature maps is eventually reduced to 4 × 4 pixels, and the number of feature map channels is 1024.

(2) The right half of the network, that is, the expansive path, mainly restores the feature information of the original image. First, a deconvolution kernel with a size of 2 × 2 is used to perform upsampling. In this process, the number of the feature map channels is halved, while the feature maps of the symmetrical position generated by the downsampling and the upsampling are merged; then, three 3 × 3 convolution operations are performed on the merged features, and the above operations are repeated until the image is restored to the size of input image; ultimately, four 3 × 3 and one 1 × 1 convolution operations and a Softmax activation function are used to complete the category prediction of each pixel in the image. The Softmax activation function is defined as Equations (7):

$$p\_k(\mathbf{X}) = \frac{\exp(a\_k(\mathbf{X}))}{\left(\sum\_{k'=1}^K \exp(a\_{k'}(\mathbf{X}))\right)'} \tag{7}$$

where *ak*(*X*) represents the activation value of the *k*th channel at the position of pixel *X*. *K* indicates the number of categories, and *pk*(*X*) denotes the function with the approximate maximum probability. If *ak*(*X*) is the largest activation value in the *k*th channel, *pk*(*X*) is approximately equal to 1; in contrast, *pk*(*X*) is approximately equal to zero for other *k* values.

**UDN:** The detailed coupling process of the improved U-net and DenseNet is shown in Figure 6. (a) The first two layers use the same convolutional layer and pooling layer to obtain abstract feature maps; (b) then, the feature maps obtained by the above operations are input into the Combining Block structure to realize the coupling of the convolution results from the two structures. After two convolution operations are performed on the coupling result, max-pooling is used to perform downsampling, followed by two Combining Block operations; (c) after the downsampling, two convolutions are performed on the coupling result to obtain 1024 feature maps of 4 × 4; (d) the smallest feature maps (4 × 4 × 1024) are restored to the size of the original image after 5 upsamplings; (e) finally, the classification result is output based on the front feature maps through the 1 × 1 convolution operations and the Softmax function.

**Figure 5.** The improved U-net structure is composed of eleven convolution layers.

**OUDN:** The boundary information of the categories is the basis of the accurate classification of VHSR images. In this study, the OUDN algorithm combines the category objects obtained by object-based multiresolution segmentation [18] with the classification results of the UDN algorithm to constrain and optimize the classification results. Four multispectral bands (red, green, blue, and near infrared) together with vegetation indices and texture features, useful for differentiating urban land use objects with complex information, are incorporated as multiple input data sources for the image segmentation using eCognition software. Then, all the image objects are transformed into GIS vector polygons with distinctive geometric shapes, which are combined with the classification results

of the UDN algorithm. Based on the Spatial Analysis Tools of ArcGIS, the category with the largest statistics is taken as the category of the object by counting the number of pixels in each object and using the majority voting method. Thereby, final classification results of the OUDN algorithm are obtained. The segmentation scale directly affects the boundary accuracy of the categories. Therefore, according to the selection method of the optimal segmentation scale [67], this study gains the segmentation results by setting different segmentation scales, and determines the final segmentation scale 50.

**Figure 6.** The network structure of the UDN algorithm, where NF represents the number of convolutional filters. (**a**) The first two layers (Level1 and Level2) including convolutional layers and pooling layers; (**b**) the coupling of U-net and DenseNet algorithms; (**c**) the bottom layer of the network; (**d**) Upsampling layers; (**e**) predicted classification result.

Finally, the template for training the minibatch neural network based on the above algorithms in this research is shown in Algorithm 1 [68]. The network uses the loss function of categorical cross entropy and the adaptive optimization algorithm of Adam. Additionally, the number of iterations is set to 50, and the learning rate (*lr*) is set to 0.0001. In each iteration, *b* images are sampled to compute the gradients, and then the network parameters are updated. The training of the network stops after *K* passes through the data set.

#### **Algorithm 1 Train a neural network with the minibatch Adam optimization algorithm.**

```
initialize (net)
for epoch = 1, ... , K do
     for batch = 1, ... , # images/b do
          images ← uniformly sample batch − size images
          X, y ← preprocess(images)
          z ← forward (net, X)
          l ← loss (z, y)
          lr, grad ← background (l)
          update (net, lr, grad)
     end for
```
#### *2.5. Experiment Design*

**end for**

The flowchart of steps is shown in Figure 7. The WorldView-3 image with 15.872 × 15.872 pixels is first preprocessed by image fusion, radiometric calibration, and atmospheric correction (Figure 7a). According to this preprocessed image, a 3968 × 3968 pixel subimage with various categories is cropped for model prediction, and other representative subimages are cropped as the sample set including training set and validation set for model training; then, labeled maps are made based on the sample set, followed by image cropping (Figure 7b); the cropped original images and the corresponding labeled

maps are used to train the DL models (Figure 7c); the image with 3968 × 3968 pixels is classified by the trained model (Figure 7d); finally, objects of multiresolution segmentation are applied to optimize the classification results of the UDN algorithm to obtain the classification results of the OUDN, followed by detailed comparisons of the results from all algorithms including U, D, UDN, and OUDN (Figure 7e).

**Figure 7.** A flowchart of the experimental method in this paper, including five major steps: (**a**) image preprocessing; (**b**) image labeling and image cropping; (**c**) model training; (**d**) model prediction; (**e**) object-based optimization of the UDN results and comparisons of the results from all algorithms.

#### **3. Results and Analysis**

The tests of the proposed OUDN algorithm were presented in this section, and the classification results are compared with those of UDN, improved U-net (U), and improved DenseNet (D). To evaluate the proposed algorithm, the classification results in this study were assessed with the overall accuracy (OA), kappa coefficient (Kappa), producer accuracy (PA), and user accuracy (UA) [69]. The detailed results and analysis of the model training and classification results are clarified as follows.

#### *3.1. Training Results of U, D and UDN Algorithms*

There were a total of 4761 image blocks with 128 × 128 pixels in the sample set. Additionally, 3984 of these blocks were selected for the training, and the remaining blocks were used for the validation. Then, the cropped original image blocks and the corresponding labeled maps were used to train the minibatch network model according to the template of Algorithm 1. Based on the three feature groups of Spe, Spe-Index, and Spe-Texture, the overall model accuracies including training accuracy (TA) and validation accuracy (VA) of the U, D, and UDN algorithms were demonstrated in Table 3. In all feature combinations, the UDN algorithm obtained the highest training accuracies (98.1%, 98%, and 98.4%). However, for the U and U algorithms, the training accuracies of the Spe-Texture were the lowest (96.3% and 96%) compared with those of the Spe and Spe-Index. The UDN algorithm achieved the highest model accuracies (TA of 98.4% and VA of 93.8%, respectively) based on Spe-Texture.

**Table 3.** The overall training and validation accuracies of the improved U-net (U), improved DenseNet (D), and U-net-DenseNet-coupled network (UDN) algorithms based on the three feature groups of Spe, Spe-Index, and Spe-Texture. The algorithms in this table did not include object-based U-net-DenseNet-coupled network (OUDN), since the OUDN algorithm was based on UDN algorithm to optimize classification results.


#### *3.2. Classification Results*

3.2.1. Classification Results Based on Four Algorithms

The classification accuracies of the U, D, UDN, and OUDN algorithms on the three feature groups of Spe, Spe-Index, and Spe-Texture are demonstrated in Tables 4–6, respectively. In general, among the three feature combinations, the U and D algorithms yielded the lowest OA and Kappa, followed by UDN; in contrast, the OUDN algorithm achieved the highest OA (92.3%, 92.6%, and 93.8%) and Kappa (0.910, 0.914, and 0.928). The average accuracy of the OUDN algorithm was much higher (approximately 3%) than those of the U and D algorithms. As shown in Table 4, the UDN algorithm obtained better accuracies for Agricultural Land and Grassland than the U and D algorithms. For example, the PA values of Agricultural Land were 89%, 88%, and 90.3% for the U, D, and UDN algorithms, respectively, and the PA values of Grassland were 64.3%, 73%, and 74%, respectively. Compared with those of the UDN algorithm, the OUDN algorithm obtained better PA values for Agricultural Land, Grassland, Barren Land, and Water. Table 5 shows that the PA of Agricultural Land of the UDN algorithm was 5% and 3.3% higher than those of the U and D algorithms, respectively. In addition, the OUDN algorithm mainly yielded improvements in the PA values of Forest, Built-up, Agricultural Land, Grassland, and Barren Land. As shown in Table 6, the UDN algorithm yielded higher accuracies for Forest, Built-up, Grassland, Barren Land, and Water than the U and D algorithms, and in particular, the PA value of Grassland was significantly higher, by 15.6% and 17.3%, respectively. Meanwhile, the OUDN algorithm yielded the accuracies superior to those of the UDN algorithm in some categories. In summary, the OUDN algorithm obtained high extraction accuracies for urban land use types, and coupling object-based segmentation effectively addressed the fragmentation problem of classification with high-resolution images, thereby improving the image classification accuracy. Therefore, the OUDN algorithm offered grea<sup>t</sup> advantages for urban land-cover classification.

**Table 4.** The classification accuracies of the U, D, UDN, and OUDN algorithms based on the Spe, including the accuracies (user accuracy (UA) and producer accuracy (PA)) of every class, overall accuracy (OA) and kappa coefficient (Kappa).



**Table 5.** The classification accuracies of the U, D, UDN, and OUDN algorithms based on the Spe-Index, including the accuracies (UA and PA) of every class, OA and Kappa.

**Table 6.** The classification accuracies of the U, D, UDN, and OUDN algorithms based on the Spe-Texture, including the accuracies (UA and PA) of every class, OA and Kappa.


The classification maps of the four algorithms based on the Spe, Spe-Index, and Spe-Texture are presented in Figures 8–10, respectively, with the correct or incorrect classification results marked in black or red circles, respectively. In general, the classification results of the UDN and OUDN algorithms were better than those of the other methods, and there was no obvious "salt-and-pepper" effect in the classification results of the four algorithms. However, due to the splicing in the U, D, and UDN algorithms, the ground object boundary exhibited discontinuities, whereas the proposed OUDN algorithm addressed this problem to a certain extent.

**Classification maps of di** ff**erent algorithms based on Spe:** Based on the Spe, the proposed method in this paper better identified the ground classes that are di fficult to distinguish, including Built-up, Barren Land, Agricultural Land, and Grassland. However, the recognition e ffect of the U and D algorithms was undesirable. As shown in Figure 8, the U and D algorithms confused Built-up and Barren Land (red circle (1)), while the UDN and OUDN algorithms correctly distinguished them (black circle (1)); for the U algorithm, Built-up was misclassified as Barren Land (red circle (2)), while the other algorithms accurately identified these classes (black circle (2)); the D algorithm did not identify Barren Land (red circle (3)), in contrast, the recognition e ffect of the other methods was favorable (black circle (3)); for the U and D algorithms, Grassland was misclassified as Agricultural Land (red circle (4)), while other algorithms precisely distinguished them (black circle (4)); the four algorithms mistakenly classified some Agricultural Land as Grassland and confused them (red circle (5)).

**Classification maps of di** ff**erent algorithms based on Spe-Index:** Based on the Spe-Index, the proposed method in this paper better recognized Built-up, Barren Land, Agricultural Land, and Grassland. However, the recognition e ffect of the U and D algorithms was poor. As demonstrated by Figure 9, U and D algorithms confused Built-up and Barren Land (red circle (1)), whereas the UDN and OUDN algorithms correctly distinguished them (black circle (1)); the U algorithm incorrectly identified Barren Land (red circle (2)), while the classification results of other algorithms were superior (black circle (2)); the U and D algorithms mistakenly classified Barren Land as Agricultural Land (red circle (3)), in contrast, the UDN and OUDN better identified them (black circle (3)); for all four algorithms, some Agricultural Land was misclassified as Grassland (red circle (4)).

**Figure 8.** (**a**) Original image; (**b**) the classification map of the U algorithm based on the Spe; (**c**) the classification map of the D algorithm based on the Spe; (**d**) the classification map of the UDN algorithm based on the Spe; (**e**) the classification map of the OUDN algorithm based on the Spe; and the red and black circles denote incorrect and correct classifications, respectively.

**Figure 9.** (**a**) Original image; (**b**) the classification map of the U algorithm based on the Spe-Index; (**c**) the classification map of the D algorithm based on the Spe-Index; (**d**) the classification map of the UDN algorithm based on the Spe-Index; (**e**) the classification map of the OUDN algorithm based on the Spe-Index; and the red and black circles denote incorrect and correct classifications, respectively.

**Figure 10.** (**a**) Original image; (**b**) the classification map of the U algorithm based on the Spe-Texture; (**c**) the classification map of the D algorithm based on the Spe-Texture; (**d**) the classification map of the UDN algorithm based on the Spe-Texture; (**e**) the classification map of the OUDN algorithm based on the Spe-Texture; and the red and black circles denote incorrect and correct classifications, respectively.

**Classification maps of the di**ff**erent algorithms based on Spe-Texture:** Based on the Spe-Texture, the proposed method in this paper better identified each category, especially Grassland, yielding the best recognition result; nevertheless, the recognition effect of the U and D algorithms was worse. As shown in Figure 10, the U and D algorithms incorrectly classified much Barren Land as Agricultural Land (red circle (1)), whereas the UDN and OUDN algorithms identified these types better (black circle (1)); the D algorithm confused Built-up and Barren Land (red circle (2)), while the other algorithms better distinguished them (black circle (2)); the extraction effects for Grassland of the UND and OUDN algorithms (black circle (3)) were better than those of the U and D algorithms (red circle (3)); all the algorithms mistakenly classified some Agricultural Land as Grassland (red circle (4)).

#### 3.2.2. Extraction Results of Urban Forests

This section focuses on the analysis of urban forest extraction based on the Spe, Spe-Index, and Spe-Texture with the four algorithms. As shown in Tables 4–6, the PA values of the urban forest information extraction for all algorithms were above 98%, which indicated that the DL algorithms used in this study offered obvious advantages in the extraction of urban forests. Additionally, for the OUDN algorithm, the average PA (99.1%) and UA (89.3%) of urban forest extraction were better than those of the other algorithms based on the three groups of features. This demonstrated that the OUDN algorithm exhibited fewer errors from urban forest leakage and misclassification errors between urban forests and other land use types.

The classification results for urban forests, including scattered trees and street trees, of the different algorithms based on the Spe-Texture are presented in Figure 11. In this study, two representative subregions (subset (1) and subset (2)) were selected for the analysis of the results of the different algorithms, with the correct or incorrect classification results marked in black or blue circles, respectively. In general, the urban forest extraction effect of the OUDN algorithm was the best. According to the classification results of subset (1), the U and D algorithms mistakenly identified some street trees (blue circles), while UDN and OUDN better extracted these trees (black circles). As shown in the results of subset (2), the extraction results for some scattered trees of the U and D algorithms were not acceptable (blue circles); nevertheless, UDN and OUDN accurately distinguished them (black circles). Additionally, the U and D algorithms misclassified some Forest as Grassland and Built-up (blue circles), whereas UDN and OUDN correctly identified the urban forests (black circles).

**Figure 11.** There are two subsets (Subset (**1**) and Subset (**2**)) dominated by urban forests. (**a**) the classification maps of the U algorithm in the subsets; (**b**) the classification maps of the D algorithm in the subsets; (**c**) the classification maps of the UDN algorithm in the subsets; and (**d**) the classification maps of the OUDN algorithm in the subsets.

#### 3.2.3. Result Analysis

According to the classification results of the four algorithms on the Spe, Spe-Index, and Spe-Texture, a confusion matrix is constructed, which is shown in Figure 12. In general, regardless of the feature combinations, the classification accuracies of each algorithm for Forest, Built-up, Water, and Others were relatively high, and the recognition accuracy was above 95%. In particular, the classification accuracy of the Forest was above 98%, whereas the classification accuracies of the other categories varied greatly. As demonstrated by Figure 12, (1) based on the Spe, the extraction accuracies of the OUDN algorithm for Agricultural Land and Grassland were significantly superior to those of the U and D algorithms, whereas the U and D algorithms misclassified Agricultural Land and Grassland as Forest at a higher rate. Compared with that of the U algorithm, the OUDN algorithm yielded better Grassland classification accuracy (75%, an increase in 11%) while optimizing the extraction accuracy of UDN (74%). The D algorithm misclassified 15% of the Barren Land as Built-up, whereas only 8% was incorrectly predicted by the UDN and OUDN algorithms. Therefore, the OUDN algorithm offered obvious advantages in urban land-cover classification. (2) For the Spe-Index, compared with those of the U and D algorithms (87% and 89%, respectively), the OUDN algorithm yielded higher extraction accuracies of Agricultural Land (94%) and optimized the classification accuracy of UDN (92%). The U and D algorithms misclassified 12% of the Barren Land as Built-up, whereas only 11% and 10% were incorrectly predicted by the UDN and OUDN algorithms, so the OUDN algorithm captured the best classification effect. (3) For the Spe-Texture, the extraction accuracies of the UDN and OUDN algorithms for Grassland were very high (83% and 85%, respectively), and the accuracies were the highest among all the Grassland classification results. Compared with the classification accuracies of the U and D algorithms (68% and 66%), the accuracies of UDN and OUDN were 15–19% higher. Figure 12 showed that the U and D algorithms misclassified 21% and 25% of the Grassland as Agricultural Land, respectively, whereas the misclassification rates of UDN and OUDN were fairly low (7% and 6%, respectively).

**Figure 12.** (**a1**) Confusion matrix of U algorithm based on Spe; (**b1**) Confusion matrix of U algorithm based on Spe-Index; (**c1**) Confusion matrix of U algorithm based on Spe-Texture; (**a2**) Confusion matrix of D algorithm based on Spe; (**b2**) Confusion matrix of D algorithm based on Spe-Index; (**c2**) Confusion matrix of D algorithm based on Spe-Texture; (**a3**) Confusion matrix of UDN algorithm based on Spe; (**b3**) Confusion matrix of UDN algorithm based on Spe-Index; (**c3**) Confusion matrix of UDN algorithm based on Spe-Texture; (**a4**) Confusion matrix of OUDN algorithm based on Spe; (**b4**) Confusion matrix of OUDN algorithm based on Spe-Index; (**c4**) Confusion matrix of OUDN algorithm based on Spe-Texture.

For urban forests, as demonstrated by Figure 12, (1) based on the Spe, the extraction accuracy of urban forests was 99% for each algorithm, however, these algorithms misclassified Agricultural Land and Grassland as Forest generally. Compared with those of the U and D algorithms, OUDN's rate of misclassification of Agricultural Land and Grassland as Forest was the lowest (5% and 4%); (2) based on the Spe-Index, the OUDN algorithm obtained the highest urban forest extraction accuracy (99%) and the lowest rate of Agricultural Land and Grassland misclassified as Forest (4% and 7%); (3) based on the Spe-Texture, the urban forest extraction accuracy of the OUDN algorithm was the highest (approximately 100%).

Through the above analysis, it was concluded that (1) the classification results of the OUDN algorithm were significantly better than those of the other algorithms for confusing ground categories (such as Agricultural Land, Grassland, and Barren Land); (2) the accuracy of the UDN algorithm was improved through object constraints; (3) especially for Spe-Texture, the OUDN algorithm achieved the highest OA (93.8%), which was 4% and 4.1% higher than those of the U and D algorithms, respectively; (4) the UDN and OUDN algorithms had obvious advantages regarding the accurate extraction of urban forests, and they not only accurately extracted the street trees but also identified the scattered trees ignored by the U and D algorithms.
