**1. Introduction**

Urban land use mapping and the information extraction of urban forest resources are significant, ye<sup>t</sup> challenging tasks in the field of remote sensing and have grea<sup>t</sup> value for urban environment monitoring, planning, and designing [1–3]. In addition, smart cities are now an irreversible trend in urban development in the world, and urban forests constitute "vital," "green," and indispensable infrastructure in cities. Therefore, the intelligent mapping of urban forest resources from remote sensing data is an essential component of smart city construction.

Over the past few decades, multispectral (such as the Thematic Mapper (TM)) [4–7], hyperspectral, and LiDAR [8–10] techniques have played important roles in the monitoring of urban forest resources. Currently, with the rapid development of modern remote sensing technologies, a very large amount of VHSR remotely sensed imagery (such as WorldView-3) is commercially available, creating new opportunities for the accurate extraction of urban forests at a very detailed level [11–13]. The application of VHSR images in urban forest resource monitoring has attracted increasing attention because of the rich and fine properties in these images. However, the ground objects of VHSR images are highly complex and confusing. For one thing, numerous land use types (such as Agricultural Land and grassland) have the same spectrum and texture characteristics [14], resulting in strong homogeneity in di fferent categories [15], that is, the phenomenon of "same spectrum with di fferent objects." For another, rich detailed information gives similar objects (such as building composed of di fferent construction materials) strong heterogeneity in the spectral and structural properties [16], resulting in the phenomenon of "same object with di fferent spectra". In addition, traditional statistical classification methods encounter these problems in the extraction of urban forests from VHSR remote sensing images. Additionally, urban forests with fragmented distributions are composed of scattered trees, street trees, and urban park forest vegetation. This also creates very large challenges for urban land use classification and the accurate mapping of urban forests [17].

Object-based classification first aggregates adjacent pixels with similar spectral and texture properties into complementary and overlapping objects through the image segmentation method to achieve image classification, and the processing units are converted from conventional pixels to image objects [18]. This classification method is based on homogeneous objects. In addition to applying the spectral information of images, this method fully exploits spatial features such as geometric shapes and texture details. The essence of object-based classification is to break through the limitations of traditional pixel-based classification and reduce the phenomena of the "same object with di fferent spectra" and the "salt-and-pepper" phenomenon caused by distribution fragmentation. Therefore, object-based classification methods often yield better results than traditional pixel-based classification methods [19]. Recently, the combination of object-based and machine learning (ML) is widely used to detect features in the forest such as damage detection, landslide detection, and insect-infested forests [20–23]. In terms of ML, deep learning (DL) uses a large amount of data to train the model and can simulate and learn high-level features [24], making deep learning a new popular topic in the current research on the intelligent extraction of VHSR remote sensing information [15,25,26].

For DL, DCNNs and semantic segmentation algorithms are widely used in the classification of VHSR images, providing algorithmic support for accurate classification and facilitating grea<sup>t</sup> progress [27–38]. Among them, DCNNs are the core algorithms for the development of deep learning [39]. These networks learn abstract features through multiple layers of convolutions, conduct network training and learning, and finally, classify and predict images. DenseNet is a classic convolutional neural network framework [40]. This network can extract abstract features while combining the information features of all previous layers, so it has been widely applied in the classification of remote sensing images [41–44]. However, this network has some problems such as the limited extraction of abstract features. Semantic segmentation places higher requirements on the architectural design of convolutional networks, classifying each pixel in the image into a corresponding category, that is, achieving pixel-level classification. A typical representation of semantic segmentation is U-net [45], which combines upsampling with downsampling. U-net can not only extract deeper features but also achieve accurate classification [46,47]. Therefore, U-net and DenseNet can be integrated to address the problem of the limited extraction of abstract features in DenseNet, and this combination may facilitate more accurate extraction from VHSR images.

In summary, object-based multiresolution segmentation o ffers obvious advantages in dealing with the problems of "same object with di fferent spectra" and the "salt-and-pepper" phenomenon caused by distribution fragmentation [48–53], and deep learning is an important method for the intelligent mapping of VHSR remote sensing images. Consequently, this research proposes the novel classification method of the object-based U-net-DenseNet-coupled network (OUDN) to realize the intelligent and accurate extraction of urban land use and urban forest resources. This study takes subregion of the Yuhang District of Hangzhou City as the study area, with WorldView-3 images as the data source. First, the DenseNet and U-net network architectures are integrated; then, the network is trained according to the labeled data sets, and land use classification results are obtained based on the trained model. Finally, the object boundaries derived by object-based multiresolution segmentation are combined with the classification results of deep learning to optimize the classification results with the majority voting method.

#### **2. Materials and Methods**

#### *2.1. Study Area*

In this research, a subregion of the Yuhang District (YH) of Hangzhou, in Zhejiang Province in Southeast China, was chosen as the study area (Figure 1). WorldView-3 images of the study area were captured on 28 October 2018. The images contain four multispectral bands (red, green, blue, and near infrared (NIR)) with a spatial resolution of 2 m and a panchromatic band with a spatial resolution of 0.5 m. According to the USGS land cover classification system [54] and the FROM-GLC10 [55,56], the land use categories were divided into six classes, including Forest, Built-up, Agricultural Land, Grassland, Barren Land, and Water. As shown in Figure 1b, due to shadows in VHSR image, this study added a class of Others, including Shadow of trees and buildings. The detailed descriptions of each land use class and its corresponding subclasses are listed in Table 1.

**Figure 1.** Location of the study area: (**a**) Zhejiang Province, and the blue polygons represent Hangzhou, (**b**) the subregion of the Yuhang District (YH) of Hangzhou.


**Table 1.** The urban land use classes and the corresponding subclass components.

#### *2.2. Data Processing*

The image preprocessing including radiation correction and atmosphere correction was first performed using ENVI 5.3. Then, this label maps of the actual land use categories were made by eCognition software based on the results of the field survey combined with the method of visual interpretation. Due to the limitations on the size of the processed images from the GPU as well as to obtain more training images and to better extract image features, this study used the overlapping cropping method (Figure 2) to segmen<sup>t</sup> the images in the sample set into 4761 subimage blocks using 128 × 128 pixel windows for the minibatch training of the DL algorithms.

**Figure 2.** The overlapping cropping method for training the deep learning (DL) network. The size of the cropping windows is set to 128 × 128 pixels, where n is defined as half of 128.

#### *2.3. Feature Setting*

In this study, the classification features are divided into three groups: (1) the original R, G, B, and NIR bands, namely, the spectral features (Spe), the spectral features combined with the vegetation index features (Spe-Index), and the spectral features combined with the texture features (Spe-Texture). Based on these three groups of features, the performance of the OUDN algorithm

in the mapping of urban land use and urban forest information is evaluated. Descriptions of the spectral features, vegetation index, and texture are given in Table 2. The texture features based on the gray-level co-occurrence matrix (GLCM) [57] include mean, variance, entropy, angular second moment, homogeneity, contrast, dissimilarity, and correlation [58–60] with different calculation windows (3 × 3, 5 × 5, 7 × 7, 9 × 9, 11 × 11 and 13 × 13) [61].

**Table 2.** All the involved features are listed in detail in this paper, including original bands of WorldView-3 data, vegetation indices, and texture features based on the gray-level co-occurrence matrix (GLCM).

