1. Introduction
Land use classification information offers a significant indication of human activities in an urban environment [
1]. This information can provide the basic datasets for change detection [
2], landscape pattern [
3], and urban heat island effects [
4]. With the rapid development of remote sensing technology, this method has become a major way to obtain land use information, and innumerable high resolution remote sensing images are used to extract spatial information regarding urban land use. However, improvements in spatial resolution increase the internal variability of homogenous land cover units and decrease the statistical separability of land cover classes in the spectral space, not necessarily achieving better classification [
5]. The internal variability of the high-resolution images makes the land use classification more challenging [
6,
7].
In land use classification methods, the pixel-based method applies statistical knowledge to extract features based on the spectral characteristics of ground objects—this method is mainly used for low resolution remote sensing images (10–30 m) [
8,
9]. In contrast, high resolution remote sensing images make the land cover image characteristics more complex and diverse. Pixel-based methods cannot take the context and texture around the pixels into consideration, which generates much of the salt and pepper noise, and these methods cannot obtain object-level information [
10]. Because of the limitations of pixel-based classification methods, object-based methods were proposed [
11] and obtained a better performance in terms of high-resolution image classification and object identification [
12]. The object-based method first separates the spatially and spectrally similar pixels at different scale levels as segmented objects that can effectively identify urban land cover classes [
13]. Then, the texture and geometric features of the segmented objects are calculated as a rule set to achieve the final image classification. The object-based method effectively avoids the classification errors caused by spectral differences in pixel-based classification and eliminates the influence of salt and pepper noise. Therefore, object-based modules are also adopted by commercial software systems, such as the eCognition [
14], and ENVI [
15] (The Enviroment for Visualizing Images) systems. However, this method overlooks the semantic functions or spatial configurations [
16] and depends heavily on the selected land-cover classification system and the accuracy of the land-cover classification [
17], and the specific object features used in the rule set are not applicable to various ground objects. Thus, advanced methods to automatically extract features from high-resolution remote sensing images are still urgently in demand.
Recently, deep learning has provided an effective method to automatically learn and identify features from large amounts of data, which is the main difference compared to the traditional machine learning methods. With the development of deep learning theory, many deep learning architectures, such as convolutional neural network (CNN) and recurrent neural networks (RNN [
18]), have achieved state-of-the-art performance in terms of computer vision [
19], object detection [
20], and natural language processing.
Jonathan Long et al. [
21] used standard convolutional layers to replace the fully connected layer in a CNN and proposed the fully convolutional network (FCN) to promote the large-scale development of image semantic segmentation. The FCN architecture maintains the two-dimensional structure of the feature maps in upsampling, which is quite different from the standard CNN. Recently, as an extension of the FCN, the deep convolutional neural network (DCNN) is rapidly being adopted for image classification [
22,
23,
24,
25].
In general, FCNs are composed of an encoder part and a decoder part. The encoder part comprises the stacks of “convolutional-pooling” layers, which are used as multi-scale feature extractors and aimed at extracting abstract high-level features. The decoder part contains the upsampling and skip layer. The upsampling is used to recover the input feature map resolution by deconvolution or interpolation and to generate segmentation maps with the same resolution as the original image. However, the deconvolution or interpolation operation only generates coarse segmentation maps, which lack an accurate localization of the object boundaries and high-frequency details. Therefore, the skip layer is used to fuse the previous input features with the output features from the upsampling. However, there is still a loss of detailed information from the reduced feature resolution. To overcome this, tremendous efforts have been made. Saining Xie et al. [
26] developed a new algorithm called holistically nested edge detection (HED) to resolve the challenging ambiguity problem in edge and object boundary detection via multi-scale feature learning. Marmanis et al. [
27] combined the semantic segmentation with semantically informed edge detection to eliminate the effect of the associated loss of effective spatial resolution washing out the high-frequency details, leading to blurry object boundaries—this method made the boundaries more explicit and improved the overall accuracy. Gong Cheng et al. [
28] proposed discriminative CNNs and trained them by optimizing a new discriminative objective function that imposes a metric learning regularization term on the CNN features to address the problem of within-class diversity and between-class similarity. Wei Liu et al. [
29] presented a simple technique that introduced the global context to the FCN to enlarge the receptive field and augment the features at each location. This change achieved state-of-the-art performance for semantic segmentation. Furthermore, Chen et al. [
30,
31,
32] discovered that consecutive convolution and pooling operations reduced the feature resolution and caused a loss of spatial detailed information in the downsampling. Therefore, they used atrous convolutional layers to replace some standard convolution and pooling layers. Without introducing extra parameters and changing the feature resolution, the atrous convolutional layer could effectively enlarge the receptive fields, decrease the loss of spatial information, and improve the classification accuracy. However, the consecutive atrous convolutional layers still accounted for the missing spatial information, which is called the “gridding effect”. PanquWa et al. [
33] proposed a hybrid dilated convolution (HDC) framework by the superposition of a receptive field in various atrous rates to counter the gridding effect. However, the HDC framework substantially increases the computing costs at the same time. Guangsheng Chen et al. [
34] used the DeepLabv3 architecture and an added Augmented Atrous Spatial Pyramid Pool and Fully Connected (FC) Fusion Path layers to tackle the poor classification of small objects and unclear boundaries.
In addition, to solve the coordinate transform problem, which is difficult to achieve in standard CNNs, Liu et al. [
35] proposed a significant coordconv module. Because of the lack of the input feature coordinate information, the consecutive convolution operation will discard much of the spatial information at multiple levels. For example, the boundary information in large-scale feature maps will be weakened, and the tiny-scale may vanish. The coordconv module worked by giving the convolution access to obtain its own input coordinates through the use of extra coordinate channels. This module enabled the network to effectively learn the spatial information of different ground objects and eliminated the loss of spatial information, especially the boundary information. The coordconv module achieved a state-of-the-art performance in image classification, object detection, generative modeling, and reinforcement learning.
To further reduce the loss of high-frequency details and object boundaries in high resolution remote sensing images, inspired by coordinate convolution, we extended the coordconv module into the FC-DenseNet and designed a novel encoder–decoder architecture called the dense-coordconv network (DCCN) to solve the complex urban land use classification tasks by using high-resolution remote sensing images. The study of [
36] demonstrated that the DenseNet could alleviate the vanishing gradient problem, encourage feature reuse, and effectively reduce the number of parameters; thus, DenseNet is considered the best framework for semantic segmentation [
37]. The FC-DenseNet, which is an extension of DenseNet that works as an FCN and has the same advantages mentioned above, has achieved an excellent performance in image semantic segmentation. Therefore, the FC-DenseNet is used as the overall structure of our network. In addition, to analyze the effect of coordconv module on network, the paper discusses the performance of the coordconv module located in different level feature maps.
The remainder of this paper is arranged as follows.
Section 2 mainly introduces the network architecture and coordconv module.
Section 3 presents the experimental results, and
Section 4 introduces the experimental discussion. The conclusion is shown in
Section 5.
5. Conclusions
In this work, a novel deep convolutional neural network, DCCN, was proposed for semantic segmentation in high-resolution remote sensing images. The major contributions of our method were the proposed encoder–decoder architecture and the introduction of dense blocks as the feature extractors. At the same time, we also added the coordinate convolution module in the network, which obviously improved the OA and F1 score (89.48% and 86.89% in Potsdam, 85.31% and 81.36% in Vaihingen). The proposed DCCN aimed to make full use of multilevel features and eliminate the loss of spatial information. Experiments were carried out on the ISPRS dataset. Six land cover classes were extracted successfully with the proposed DCCN, and the results demonstrated the effectiveness and feasibility of the DCCN in improving the performance for land use classification. The proposed DCCN was compared with other typical networks for semantic segmentation, such as the U-net, SegNet, and Deeplab-V3 models. The experimental results showed that the proposed model performed better than other networks. The OA and mean F1 score of proposed model are 1.36% and 1.16% higher than SegNet method in Potsdam, and far exceed the Deeplab-V3 and the U-net. However, the performance was still affected by complex land covers. In fact, our proposed model has the potential to perform better if some widely used methods, such as the attention mechanism or data preprocessing and postprocessing, are considered.