DMAU-Net: An Attention-Based Multiscale Max-Pooling Dense Network for the Semantic Segmentation in VHR Remote-Sensing Images

Yang, Yang; Dong, Junwu; Wang, Yanhui; Yu, Bibo; Yang, Zhigang

doi:10.3390/rs15051328

Open AccessArticle

DMAU-Net: An Attention-Based Multiscale Max-Pooling Dense Network for the Semantic Segmentation in VHR Remote-Sensing Images

by

Yang Yang

^1,2,3

,

Junwu Dong

^1,2,3

,

Yanhui Wang

^1,2,3,

Bibo Yu

^1,2,3 and

Zhigang Yang

^4,*

¹

College of Resource Environment and Tourism, Capital Normal University, Beijing 100048, China

²

3D Information Collection and Application Key Lab of Education Ministry, Capital Normal University, Beijing 100048, China

³

Beijing State Key Laboratory Incubation Base of Urban Environmental Processes and Digital Simulation, Capital Normal University, Beijing 100048, China

⁴

Surveying and Mapping Institute, Lands and Resource Department of Guangdong Province, Guangzhou 510670, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(5), 1328; https://doi.org/10.3390/rs15051328

Submission received: 10 February 2023 / Revised: 21 February 2023 / Accepted: 24 February 2023 / Published: 27 February 2023

(This article belongs to the Special Issue Deep Learning for Remote Sensing Image Classification II)

Download

Browse Figures

Versions Notes

Abstract

:

High-resolution remote-sensing images cover more feature information, including texture, structure, shape, and other geometric details, while the relationships among target features are more complex. These factors make it more complicated for classical convolutional neural networks to obtain ideal results when performing a feature classification on remote-sensing images. To address this issue, we proposed an attention-based multiscale max-pooling dense network (DMAU-Net), which is based on U-Net for ground object classification. The network is designed with an integrated max-pooling module that incorporates dense connections in the encoder part to enhance the quality of the feature map, and thus improve the feature-extraction capability of the network. Equally, in the decoding, we introduce the Efficient Channel Attention (ECA) module, which can strengthen the effective features and suppress the irrelevant information. To validate the ground object classification performance of the multi-pooling integration network proposed in this paper, we conducted experiments on the Vaihingen and Potsdam datasets provided by the International Society for Photogrammetry and Remote Sensing (ISPRS). We compared DMAU-Net with other mainstream semantic segmentation models. The experimental results show that the DMAU-Net proposed in this paper effectively improves the accuracy of the feature classification of high-resolution remote-sensing images. The feature boundaries obtained by DMAU-Net are clear and regionally complete, enhancing the ability to optimize the edges of features.

Keywords:

high-resolution remote-sensing images; ground object classification; dense connections; multiscale maximum pooling; semantic segmentation

1. Introduction

The feature classification of remote-sensing images is an essential part of the remote-sensing interpretation process, which aims to allocate a category to each pixel in an image, enabling better understanding and annotation of the image [1,2]. Therefore, it has been a leading research topic on how to classify remote-sensing images into features quickly and efficiently, enhancing the capability of processing and applying remote-sensing data [3].

With the development of photogrammetry and sensor technology over the past few years, acquiring very-high-resolution (VHR) remote-sensing images has become increasingly diverse [3,4]. VHR remote-sensing images can capture more detailed information at a higher level with richer spectral and texture information [5,6], thus effectively improving feature recognition and scene understanding. It plays a vital role in various fields, such as land and resource mapping [7,8], precision agriculture [9,10], urban planning [11,12], and environmental protection [13,14].

Traditional feature classification in remote-sensing images is primarily based on pixel-based and object-oriented methods. Among them, the pixel-based methods include unsupervised and supervised classification methods. Commonly unsupervised classification methods include K-Means [15] classification, ISODATA [16] classification, and fuzzy C-mean [17] category, which is essentially a kind of cluster analysis that makes obtaining satisfactory classification result. Meanwhile, classic supervised classification methods include machine learning-based methods such as support vector machines (SVM) [18], artificial neural networks (ANN) [19], and random forests (RF) [20], which rely heavily on samples. Additionally, pixel-based classification methods depend on spectral features and do not take sufficient account of the structural features of inter-class similarity and intra-class variability [21]. The object-oriented approach is an image segmentation method used prior to feature classification. It integrates spectral, shape, size, texture, spatial relationship, and contextual information, which takes full advantage of the image detail information, to a more considerable extent [22]. However, it is more challenging to select a rational segmentation scale for the object-oriented approach. Recently, with the development of sensors and computer vision, very-high-resolution images and deep-learning techniques have opened up new horizons for accurately classifying features.

Attributing each image element to the predetermined category in a remote-sensing image is the goal of feature classification. The pixel-level categorization and prediction method for remote-sensing images is known as semantic segmentation in computer vision research [23]. Compared with the traditional methods that use unstable features, such as texture, spectrum, geometry, and shadow, the deep-feature extraction method based on deep learning is more suitable for semantic segmentation [24]. The development of deep-learning technology, which is represented by convolutional neural networks (CNNs), has dramatically improved the accuracy of semantic segmentation. CNNs use a fully connected layer for classification, where the parameters of the network are learned and updated via backpropagation. They also have a robust linear-fitting capability. However, due to the fixed geometry structure and locality of the convolution kernel, they are inherently limited by local receptive fields and short-range context information. The features extracted by CNNs may be local rather than global features [25,26]. It is likely that some semantic information will be lost, making it difficult for the network to achieve accurate classification of fine-resolution images. Upon this, FCNs replace the fully connected layer in CNNs with a deconvolutional or up-sampling layer that can perform pixel-level prediction when receiving an arbitrary size of input [27]. FCN is an essential breakthrough in semantic segmentation techniques. An end-to-end structure was harnessed to convert the feature map into a classification map of the same size as the input image. Based on this, many semantic segmentation models based on FCN improvements have emerged, such as SegNet [28], RefineNet [29], and U-Net [30] networks.

U-Net, one of the most classic networks of FCN, connects low-level detail information to high-level semantic information via jump connections, making more efficient use of the original feature maps in the encoder network [30]. The DeepLabv3+ series of models was then proposed one after another [31], with the key idea of atrous convolution, adopting spatial pyramidal pooling (SPP) and incorporating multiscale features. The semantic segmentation model improves the automation ability of the model. However, unlike natural images, VHR remote-sensing images are highly unique. (1) Remote-sensing images are usually obtained from a top-down perspective [32], which can lose some of the feature information from the viewpoint. (2) Extensive spatial details in VHR images result in intra-class variability and inter-class similarities that makes the segmentation of these data more challenging [33,34]. (3) Remote-sensing images often show objects of varying sizes. For example, houses occupy a large area of pixels whereas cars occupy a small area of pixels, which poses a challenge for classifying smaller features [35]. Landscapes often contain a more comprehensive range of scenes, and the contextual information of the scenes is more complex [24,36]. The features involved in remote-sensing images have irregular scales and characteristics, and the boundaries of the features are difficult to segment. For these reasons, it is usually challenging to achieve satisfactory results when directly applying the classical semantic segmentation network to remote-sensing image feature classification.

The development of deep-learning technology has led some researchers to innovate and enhance the classical semantic segmentation network and apply it to classify multiple types of image features in remote-sensing images, obtaining favorable classification results [37,38,39]. Many researchers have borrowed ideas from U-Net for model improvement. ResUNet-a [38], proposed by Diakogiannis et al., replaces standard convolutions with ResNet units containing multiple parallel atrous convolutions. Also, pyramid scene parsing pooling is included in the middle and end of the network, establishing a conditional relationship between the individual tasks. U-Net 3+ introduces a full-scale skip connection [40], ensuring that each decoder layer incorporates smaller and same-scale feature maps from the encoder while introducing larger-scale feature maps from the decoder, which capture both fine-grained and coarse-grained semantics at full scale. U-Net 3+ enhances the ability to segment the edges of ground objects accurately. In addition to this, many scholars have introduced modules such as atrous spatial pyramid pooling (ASPP) [41], dense connections [42], and attention mechanisms [43] to improve the performance of U-shaped networks.

Recently, the innovation of attention mechanisms has dramatically enhanced the development of deep learning. Inspired by human attention mechanisms, the attention in deep learning can be seen as a weighting factor obtained by the network that is autonomously learning, which emphasizes regions of interest in a “dynamically weighted” way and suppresses the irrelevant background [44,45]. Currently, there are mainly three types of attention mechanisms: spatial attention, channel attention, and self-attention [44]. As a representation of channel attention, squeeze-and-excitation (SE) attention [43] is an efficient way to construct interdependencies between channels, yet it commonly ignores the position dimension for semantic segmentation. The Convolutional Block Attention Module (CBAM) advances the idea of SE-Net further by introducing large-scale kernel convolution into spatial information coding by adopting global average pooling and global maximum pooling operations [46,47]. Although the idea of combining spatial and channel domains is proposed, obtaining information on long-range dependencies is impossible due to the lack of correlation between the two dimensions. In addition, the Coordinate Attention (CA) mechanism can effectively capture the relationship between location information and channel-wise relationships, obtaining important background information and long-range dependencies of geographic entities, which improves the final segmentation results [48]. Unlike SE attention, where channel compression is performed initially, Efficient Channel Attention (ECA) attention boosts inter-channel dependencies and efficiently implements local region interactions across channels using 1-D convolution [49]. Thus, the network performance can be effectively improved by integrating the attention module into the semantic segmentation network of remote-sensing images.

According to the above research, we propose an improved network named DMAU-Net, which incorporates the idea of multiscale maximum pooling and dense connectivity based on the attention mechanism for remote-sensing image feature classification. The official dataset provided by ISPRS is taken as the experimental data for feature classification experiments and compared with the mainstream methods to evaluate the performance of this network.

The main contributions of our work are as follows:

Inspired by the idea of dense connections in U-Net++ and U-Net 3+, we propose the dense connections and the multiscale max-pooling module in the encoder part of the U-Net network. This module enables each decoding layer to receive feature maps from the same-scale coding layer and smaller-scale decoding layer at the same time, enabling the low-level semantic information of the sub-layer to be fused with the high-level semantic information, which improves the feature interaction capability of the encoder.
We incorporated the ECA attention-mechanism module in the decoding stage. Inputting multistage features from the encoding stage into the ECA attention mechanism effectively utilizes the detailed information of low-stage features, enhances the learning of multistage features, and improves the classification ability of features and the classification of small objects at the same time.
We tested the performance of DMAU-Net on two well-known VHR remote-sensing datasets, ISPRS Vaihingen and Potsdam datasets. Our proposed model outperforms the mainstream models in classification, with mIoU reaching 87.85% and 85.68% on the two datasets, respectively.

2. Methods

2.1. Multiscale Max-Pooling Module Based on Dense Connections

Translation invariance of maximum pooling is introduced in this paper to construct the multiscale max-pooling module. Maximum pooling represents an N-dimensional vector derived by selecting the maximum value of channel pixels as a representative when N channels are assumed to exist. It plays a vital role in neural networks, not only to retain important features but also to reduce computation and avoid overfitting, thus improving the model’s generalization ability [50,51]. In addition, maximum pooling also maximizes the translation invariance of the feature map when mapping features [51], preserving important features, and simultaneously improving the quality of the feature map.

The section adopts a multiscale max-pooling module, where the convolved feature map was bridged by several layers of parallel-pooling strategy, each of which contained different-sized pooling kernels. The output of each layer was then stacked with feature maps of the same size, enabling the subsequent layer to retain more information about the preceding layer. Figure 1 shows the multiscale max-pooling module, with the parameters for the pooling kernel size with 16, 8, 4, and 2 in descending order.

Taking advantage of the translation invariance of maximum pooling, this module enhanced the quality of the feature map. Feature maps of different sizes were obtained and fused with encoded feature maps at different encoding stages by maximum pooling modules with different pooling kernels.

To fuse more multiscale features and improve the extraction of target edge features, the DenseNet strategy was applied here by designing a dense cross-layer connection based on U-Net [52]. Convolutional blocks generally pass features from one layer to another after a simple sequence of inputs. In contrast, dense connection networks connect all the layers in the encoder, allowing maximum information interaction between all layers. Figure 2 illustrates how the dense block was linked in DenseNet.

Figure 2 shows the feature maps for the different layers from left to the right. A total of five layers of feature maps were included in the module. In the absence of dense connections, the module had four connections. After utilizing the dense connections, it was evident that each subsequent layer interacted directly with the previous layer, and the number of connections was 10. Each layer leads to 4, 3, 2, and 1 connections from left to right.

Dense connections are calculated as follows:

x_{ℓ} = H_{ℓ} ([x_{0}, x_{1}, \dots, x_{ℓ - 1}])

(1)

As shown in Equation (1),

[x_{0}, x_{1}, \dots, x_{ℓ - 1}]

refers to the result of the stacking from layer 0 to layer

ℓ - 1

. Assuming that there are

ℓ

layers of connections between the convolutional layers, the number of connections was

\frac{ℓ (ℓ + 1)}{2}

. With dense jump connections, each previous layer of feature maps could be an input to each subsequent layer. It significantly reduced the number of parameters in the network as well as reduced the learning process for redundant information. Unlike the ResNet network [53], which relies on element-wise stacked feature fusion, the dense network utilized a concatenate stacking approach. Element-wise superposition summed two feature maps of the same dimension at the pixel level. A more general feature-fusion approach was used in concatenate to improve the feature reusability and build an efficient feature-extraction network. Element-wise superposition involves a pixel-level summation between two feature maps in the same dimension. However, if the feature maps are in different dimensions, they need to be converted to the same dimension by a linear transformation. In comparison, the concatenate takes a vector-stitching approach, a more general feature fusion that can improve reusability and make features more efficient.

2.2. An Efficient Channel-Attention Mechanism

SE-Net represents the channel-attention mechanism. In recent years, more and more researchers have shown that channel-attention works better in convolutional blocks [54,55]. However, SE-Net picks two fully connected layers when performing channel attention, which computes the interaction between all channels [49]. The Efficient Channel Attention Network (ECA-Net), instead, applies a local channel interaction strategy. A channel-by-channel global averaging pooling was performed without dimensionality reduction, after which an adaptively sized convolution kernel was adopted. Figure 3 illustrates the ECA mechanism in the decoder.

Firstly, the feature map obtained by the multiscale max-pooling module was concatenated with the features of the up-sampling stage and input into the ECA module together. Then, a global average pool was used to generate a vector of weights for the input feature map. A one-dimensional convolution kernel and a sigmoid function were applied to output this weight, after which the input features were weighted using a dot product. It was assumed that the feature map input to the ECA module was

F \in U^{C \times H \times W}

, which was subsequently transformed into a

1 \times 1 \times C

one-dimensional vector using global average pooling (GAP).

y_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H \times W} ω_{c} (i)

(2)

As shown in the above equation:

y_{c} \in Ω^{C \times 1}

,

ω_{c}

represents the local feature on the channel

c

,

H

and

W

represent the height and width of the input feature map, respectively. When calculating the weights of feature layer

y_{i}

, ECA only considered the interaction between

y_{i}

and its

k

neighbors. The weight of channel

i

can be expressed as:

ω_{i} = σ (\sum_{j = 1}^{k} w_{i}^{j} y_{i}^{j}), y_{i}^{j} \in Ω_{i}^{k}

(3)

where

Ω_{i}^{k}

indicates the set of

k

adjacent channels of

y_{i}

.

Since the above equation only involves one unknown parameter

k

, it can be further simplified as:

ω = σ ({C 1 D}_{k} (y))

(4)

where

C 1 D

indicates the 1-D convolution, and here

ω

is the efficient channel attention (ECA) module, which only involves k parameters. The coverage of the interaction (i.e., the size

k

of the 1-D convolution kernel) is proportional to the channel dimension

C

(i.e., assuming the existence of a mapping).

C = ϕ (k)

(5)

As the relationship described by the one-dimensional linear function

ϕ (k) = γ * k - b

is limited, and the channel dimension

C

(number of filters) is usually set to power of 2, the relationship between the channel dimension

C

and the 1-D convolution kernel

k

is defined in the following form:

C = ϕ (k) = 2^{(γ * k - b)}

(6)

That is, given the channel dimension

C

, the size k of the 1-D convolution kernel can be determined by:

k = ψ (C) = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d}

(7)

where

{|t|}_{o d d}

indicates the nearest odd number of

t

.

Considering that it is necessary to reduce the complexity of the model in the setting of parameters, and the interaction capability of cross-channel needs to be guaranteed, the parameter settings in the experiments were considered as follows: as shown in Figure 3, the cross-channel interaction capability cannot be achieved when

k = 1

; the cross-channel interaction capability can be achieved when

k = 3

with the fewest number of parameters. If γ and b are both set to 1,

k

should be set to 5 at the minimum when

C > 4

. Therefore, we also followed the parameter setting principle of Wang et al. [49] by setting γ and b to 2 and 1, respectively, which can ensure the complexity of the model with larger number of channels. Wang et al. also demonstrated that when

k = 3

, the ECA module obtained similar accuracy results as the method based on capturing cross-channel interactions without downscaling in channel attention [49], while the number of parameters of the latter was a power multiple of the former. In our experiment, cross-channel interaction was achieved by utilizing the ECA module four times with the introduction of a total of 12 parameters, which achieved cross-channel interaction capability while greatly reducing the number of parameters compared to the SE module.

Figure 4 illustrates the multiscale max-pooling densely connected network, which we designed using U-Net as a framework, incorporating the ECA module (DMAU-Net). The jump connection between the encoder and decoder in the U-Net was strengthened to enable a more straightforward presentation of the improved encoder structure. The raw images were input into the network, and then four down-sampling operations were performed, which led to five layers of feature maps. The max-pooling module was integrated after two convolution operations in each layer, with maximum parallel pools of 4, 3, 2, and 1 from the first to the fourth layers, respectively. To stitch feature maps of the same size, we set up max-pooling kernels of different sizes. For the first parallel layer, the pooling kernel multiples were 16, 8, 4, and 2; for the second parallel layer, the pooling kernel multiples were 8, 4, and 2; and for the third parallel layer, the pooling kernel multiples were 8 and 4. After the last layer, only a two maximum pooling was bridged.

After that, a dense jump connection was taken. To ensure that each feature mapping was matched across all layers, each layer was directly connected to the previous layer. Concatenate is then used to fuse the feature maps resulting from the second convolution of each layer with the results of the other layers after maximum pooling. Besides improving the quality of the feature map, it also improved global contextual information, integrating shallow features with deeper semantic information and learning more “collective knowledge.” The right-side decoding section had the ECA module along with the encoder improvements. Four up-sampling processes were performed in the decoder, and the ECA module was incorporated in each layer.

2.3. Loss Function

In the training phase, selecting the loss function for model training is of great significance. The classic loss functions include the Focal Loss function [56], Dice Loss function [57], Cross Entropy Loss [58], and various combinations of loss functions. The cross-entropy (CE) loss function was selected as the loss function for model learning, which represents the difference between the true value and the predicted value.

The CE loss function can solve multi-classification problems and is one of the most used loss functions in the field of deep learning [58]. This function turns the output of the neural network into a probability distribution, so that the CE loss can calculate the distance between the predicted probability distribution and the probability distribution of actual output [59]. When performing loss calculations, the CE loss function uses an inter-class competition mechanism and is more concerned with the accuracy of correctly labeled prediction results. In the PyTorch framework, the cross-entropy is calculated as follows:

H (p, q) = - \sum_{x} p (x) \log q (x)

(8)

As shown in Equation (6), the probability of the actual output is

q

, the probability distribution of the desired output is

p

, and the cross-entropy is the difference between the two. In this function, we first mapped the input feature map into the Softmax function, then took the logarithm of the result, and finally used the Negative Log-Likelihood (NLL) Loss function to calculate that which is obtained as the CE loss.

3. Experiments

3.1. Experimental Dataset

To test the performance of DMAU-Net, we adopted the Vaihingen and Potsdam high-resolution urban remote-sensing image datasets, officially provided by ISPRS [60], to study the classification of the specific ground objects for urban scenes. Vaihingen contains 33 remote-sensing images with a resolution of 0.09 m, 16 finely annotated. The wavelength bands are red, green, and NIR. The average size of the images is 2494 × 2046 pixels with R, G, and NIR bands, covering a total of approximately 1.38 square kilometers of surface information. The Potsdam dataset contained 38 remotely sensed images with a resolution of 0.05 m and 24 with precise annotation. Each image is approximately 6000 × 6000 pixels with R, G, B, and NIR bands. Thus, the total covers about 3.42 square kilometers of surface information. Based on the labels of the dataset, the ground cover was divided into six classes: impervious surfaces, car, building, bush (low vegetation), tree, and clutter (background and water). The sample images for both datasets and their corresponding labels are shown in Figure 5.

3.2. Evaluation Metrics

A performance indicator for evaluating classification accuracy is the Intersection-over-union ratio (IoU) and the mean Intersection-over-union ratio (mIoU). IoU is widely used as a standard metric in computer-vision tasks such as semantic segmentation and object detection. It refers to the ratio of the intersection of the predicted value of a particular class to the true value, indicating the level of fit between the two. mIoU is calculated on a class level and is a weighted average of the IoU for each class, which is an objective evaluation of the overall classification. The equations for IoU and mIoU are shown below.

I o U = \frac{T P}{F N + F P + T P}

(9)

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P}

(10)

In Equation (9), P is the ratio of the correct pixels predicted for a feature class to the total number of pixels for that feature. For example, TP refers to the number of pixels of the building that was correctly extracted. In contrast, FP refers to the number of pixels extracted from the other objects classified as buildings, and FN represents the number of pixels extracted from buildings as other objects.

3.3. Experimental Setting

To augment the training dataset, we performed data augmentation on both the Vaihingen and Potsdam datasets. The dataset was first clipped and randomly rotated, and the noise was added to enhance the dataset. Adding noise to the dataset also contributes to data augmentation. Finally, the samples in both datasets were cropped to a size of 512 × 512 pixels. Thus, the Vaihingen dataset provided 806 images for training, 107 images for testing, and 95 images for validating, while the Potsdam dataset had 4012 training images, 479 testing images, and 525 validating images.

Additionally, the original image size is 512 × 512 pixels, and the batch size is set to 6. In the training phase, Back Propagation is used to update the parameters to reduce losses and keep the predicted values closer to the true values. The Stochastic Gradient Descent (SGD) optimizer is used to train the network; the momentum is set to 0.9, the weight decay parameter is 0.0001, the epoch is formed to 200, and the initial learning rate is 0.01. The learning rate was adjusted to 0.001 after 100 epochs of model training, and for the last 50 epochs, the learning rate was adjusted to 0.0001. The environment for this paper is a Python platform compiled for Windows using the PyTorch deep-learning framework. The hardware configuration is a computer with 32G of RAM, an Intel i9 11900K CPU, and an NVIDIA RTX3080Ti GPU with 12G VRAM.

The comparative methods include the classical semantic segmentation networks RefineNet [29] and DeepLabv3+ [61], of which RefineNet has an encoder-decoder structure, are the classical semantic segmentation networks. DeepLabv3+ applies depth-wise separable convolution to ASPP and decoder modules. UNet++ [42] incorporates multistage feature maps by introducing a convolutional layer with Dense-like structure. ABCNet [25] does not utilize the encoder-decoder structure, but it has an attention enhancement module (AEM) and a feature aggregation module (FAM). MACU-Net [62] also utilizes the U-shaped structure and multiscale jump connection, while adding a channel attention module. The above network is used together with our proposed network for the experimental results comparison.

4. Experimental Results and Analysis

4.1. Quantitative Analysis of Experimental Results

To quantitatively evaluate the ground object classification results of the models proposed in this paper, the comparative results of the classification accuracy of each model on different datasets were obtained based on the IoU and mIoU ratios. The evaluation of the classification results for the models based on the Vaihingen and Potsdam datasets are shown in Table 1 and Table 2, respectively.

The table above shows IoU for each feature category and mIoU for all classes under different classification models. The classification accuracy of DMAU-Net proposed in this paper reached 87.85% and 85.68% in two other datasets, which is about 10 percentage points higher than the DeepLabv3+ network and almost 5 percentage points higher than the U-Net++ network. Compared with ABCNet and MACU-Net, which are the models proposed in the last two years, we still have a 0.5–1.5 percentage point improvement in the semantic segmentation accuracy. The classification accuracy is significantly higher than other methods, and the classification results are the best in all classes.

The variation of the training process of DMAU-Net on the Vaihingen dataset is shown in Figure 6. The two curves represent the variation process of loss and training accuracy during training, respectively, and the horizontal axis is the number of training epochs. Obviously, the loss value decreases rapidly at the beginning period as the number of epochs increases, and then the loss value decreases gradually. Eventually, the loss value decreases from 0.915 at the beginning to 0.064. In contrast, the accuracy gradually increases until it flattens out, and the mIoU rises from 50.13% to 87.85% during the process.

4.2. Qualitative Analysis of Experimental Results

The prediction results of DMAU-Net proposed in this paper with network models from relevant literature for the Vaihingen and Potsdam datasets are shown in Figure 7.

As shown in Figure 7, the two columns of images on the left are the raw images and labels from the Vaihingen dataset, while Figure 7a–f shows the predicted results of each model, respectively. The two columns on the right are the results from the Potsdam dataset. It can be clearly seen that ABCNet, MACU-Net, and DMAU-Net all obtained better experimental results. It is evidence that the ground object classification method of DMAU-Net proposed in this paper provides the best classification results when performing segmentation in the first column of images. For example, DMAU-Net shows better classification results for the low vegetation in the bottom left area. The vegetation part of the other models is disturbed more by other objects, with crossover and overlap, while the prediction result of our algorithm shows a clear boundary of objects.

Furthermore, it can be seen from the second column of images that some car edges overlap with trees, causing challenges for accurate car identification. In contrast, the number of cars recognized in the multi-pooling integrated network model results is the same as in the labelled image. It also shows that our proposed model can partly reduce the negative impact on identifying small targets due to occlusion by other features, making our network advantageous for small object recognition. This is mainly because the ECA module is introduced in this paper, which can effectively improve the detection ability of ground objects. Besides, taking buildings and vegetation as examples, it can be seen that in the classification results of DeepLabv3+ and U-Net++ methods in the images of the third and fourth columns, there will be blank areas in the boundary of ground objects, which are not complete. According to the classification results of RefineNet, although the region is complete, sometimes the edge is jagged. Compared with ABCNet and MACU-Net, our model obtains results closer to the ground truth in terms of segmentation of low vegetation and cars.

Figure 8 shows the edge optimization capabilities of the different methods for the target features. The two left columns show the comparison results for the Vaihingen dataset, and the two right columns for the Potsdam dataset. Taking the building in the circle in the first column of pictures as an example, there are blank voids in the middle of the blue buildings in the classification results from DeepLabv3+ and U-Net++, and there is confusion between the buildings and impervious surfaces for identification. Although the area is complete, the left border area has significant jaggedness in the RefineNet results. DMAU-Net retains more edge detail information for the objects due to combining multi-stage features in the network. The edges of the buildings are more clearly and coherently contoured while retaining better integrity than the results from DeepLabv3+ and RefineNet. The same result can be seen in the second and third columns. As shown in the red circle of the second column, where the area around the building abounds with trees and low vegetation, DeepLabv3+ classifies the whole area as low vegetation, bringing the worst classification results. In contrast, RefineNet and U-Net++ can recognize the low vegetation and trees but with poorer segmentation accuracy than DMAU-Net, which achieves the closest result to reality. In the fourth column, DeepLabv3+ loses the most boundary information. Owing to the ECA mechanism module to DMAU-Net, which adds to the fusion of multistage features, it obtains better results than U-Net++. The results show that DMAU-Net obtains finer building edges when classifying features and can accurately classify features with inter-class similarity in remote-sensing images. The algorithm proposed in this paper has more complete segmentation results and has a superior boundary extraction capability in feature classification.

In summary, DMAU-Net can extract clear and coherent feature edge contours by modifying the U-Net network while maintaining good integrity. This is mainly due to our modification of the encoder network, which improves the feature-extraction capability of the network and obtains higher-quality feature maps. The visualized comparison results show that, compared to other classification methods in the literature, a significant advantage of DMAU-Net is its robustness for detecting features and its accurate, complete, and clear classification maps, which make it a superior method in comparison with other techniques.

5. Conclusions

This paper proposes DMAU-Net, a U-Net-based network for feature classification. The model exploits the invariance of feature selection and improves image quality by introducing parallel integrated max-pooling modules. The densely connected network interconnects all feature layers to maximize the information flow interaction in feature mapping. Its regularization effect also helps to alleviate the vanishing gradient problem. At the same time, the network reinforces the reusability of the feature map, allowing it to fuse richer detailed features, as well as reducing the amount of learning of the redundant features and decreasing the number of the model’s paraments. Finally, the DMAU-Net was validated on the Vaihingen and Potsdam urban remote-sensing datasets. After qualitative and quantitative results analysis, it is found that the network can significantly improve the accuracy of the ground object classification and achieve optimal values on each feature class compared with other prevailing ground object classification methods.

Although the DMAU-Net proposed in this paper has improved the accuracy of feature classification to some extent, there are still tricky points that need to be settled in the future. (1) The research in this paper is based on optical remote-sensing images as the data source and does not concern multisource remote-sensing data. There is still a need to explore methods to classify objects based on multisource data fusion, such as optical images, multispectral images, SAR images, and DEM. (2) The experimental datasets in this paper are the Vaihingen and Potsdam datasets officially provided by ISPRS, which have not been experimented on more remote-sensing image datasets. Further experiments will be conducted on various datasets to test the generalization ability and robustness of the network. (3) Our model is still not as accurate as it should be in the extraction of small features, such as the segmentation of cars. On the one hand, it is challenging to segment the cars accurately due to the small number of pixels occupied by the cars. On the other hand, it may be caused by the fact that we lose some background information during the dense connection of the multiscale maximum pooling. (4) We only theoretically illustrate the difference between the ECA module and the channel-only attention module in CBAM, and why the ECA module is superior to the channel-only attention module in CBAM, and further experiments need to be conducted to verify this.

Author Contributions

Methodology, software, and writing—original draft preparation, Y.Y.; data curation and writing—review and editing, J.D.; experiments design guidance and resources, Z.Y.; supervision and funding acquisition, Y.W.; investigation and visualization, B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (2018YFB0505400), National Natural Science Foundation of China (42171224), the Great Wall Scholars Program (CIT&TCD20190328), Key Research Projects of National Statistical Science of China (2021LZ23).

Data Availability Statement

Publicly available datasets were analyzed in this study. The Vaihingen and Potsdam datasets in the experiments can be found here: https://www.isprs.org/education/benchmarks/UrbanSemLab/default.aspx (accessed on 24 October 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Su, Y.; Cheng, J.; Bai, H.; Liu, H.; He, C. Semantic Segmentation of Very-High-Resolution Remote Sensing Images via Deep Multi-Feature Learning. Remote Sens. 2022, 14, 533. [Google Scholar] [CrossRef]
Zhang, Q.; Yang, G.; Zhang, G. Collaborative Network for Super-Resolution and Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4404512. [Google Scholar] [CrossRef]
Wang, L.; Zhang, C.; Li, R.; Duan, C.; Meng, X.; Atkinson, P.M. Scale-Aware Neural Network for Semantic Segmentation of Multi-Resolution Remote Sensing Images. Remote Sens. 2021, 13, 5015. [Google Scholar] [CrossRef]
Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic Labeling in Very High Resolution Images via a Self-Cascaded Convolutional Neural Network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef] [Green Version]
Shang, R.; Zhang, J.; Jiao, L.; Li, Y.; Marturi, N.; Stolkin, R. Multi-Scale Adaptive Feature Fusion Network for Semantic Segmentation in Remote Sensing Images. Remote Sens. 2020, 12, 872. [Google Scholar] [CrossRef] [Green Version]
Long, T.; Jiao, W.; He, G.; Zhang, Z.; Cheng, B.; Wang, W. A Generic Framework for Image Rectification Using Multiple Types of Feature. ISPRS J. Photogramm. Remote Sens. 2015, 102, 161–171. [Google Scholar] [CrossRef]
Shi, Y.; Qi, Z.; Liu, X.; Niu, N.; Zhang, H. Urban Land Use and Land Cover Classification Using Multisource Remote Sensing Images and Social Media Data. Remote Sens. 2019, 11, 2719. [Google Scholar] [CrossRef] [Green Version]
Feng, S.; Fan, Y.; Tang, Y.; Cheng, H.; Zhao, C.; Zhu, Y.; Cheng, C. A Change Detection Method Based on Multi-Scale Adaptive Convolution Kernel Network and Multimodal Conditional Random Field for Multi-Temporal Multispectral Images. Remote Sens. 2022, 14, 5368. [Google Scholar] [CrossRef]
Griffiths, P.; Nendel, C.; Hostert, P. Intra-Annual Reflectance Composites from Sentinel-2 and Landsat for National-Scale Crop and Land Cover Mapping. Remote Sens. Environ. 2019, 220, 135–151. [Google Scholar] [CrossRef]
Taylor, J.R.; Lovell, S.T. Mapping Public and Private Spaces of Urban Agriculture in Chicago through the Analysis of High-Resolution Aerial Images in Google Earth. Landsc. Urban Plan. 2012, 108, 57–70. [Google Scholar] [CrossRef]
Matikainen, L.; Karila, K. Segment-Based Land Cover Mapping of a Suburban Area—Comparison of High-Resolution Remotely Sensed Datasets Using Classification Trees and Test Field Points. Remote Sens. 2011, 3, 1777–1804. [Google Scholar] [CrossRef] [Green Version]
Benediktsson, J.A.; Chanussot, J.; Moon, W.M. Advances in Very-High-Resolution Remote Sensing. Proc. IEEE 2013, 101, 566–569. [Google Scholar] [CrossRef]
Yin, H.; Pflugmacher, D.; Li, A.; Li, Z.; Hostert, P. Land Use and Land Cover Change in Inner Mongolia—Understanding the Effects of China’s Re-Vegetation Programs. Remote Sens. Environ. 2018, 204, 918–930. [Google Scholar] [CrossRef]
Samie, A.; Abbas, A.; Azeem, M.M.; Hamid, S.; Iqbal, M.A.; Hasan, S.S.; Deng, X. Examining the Impacts of Future Land Use/Land Cover Changes on Climate in Punjab Province, Pakistan: Implications for Environmental Sustainability and Economic Growth. Environ. Sci. Pollut. Res. 2020, 27, 25415–25433. [Google Scholar] [CrossRef]
Sinaga, K.P.; Yang, M.-S. Unsupervised K-Means Clustering Algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar] [CrossRef]
Bezdek, J.C. A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 1980, PAMI-2, 1–8. [Google Scholar] [CrossRef]
Ramze Rezaee, M.; Lelieveldt, B.P.F.; Reiber, J.H.C. A New Cluster Validity Index for the Fuzzy C-Mean. Pattern Recognit. Lett. 1998, 19, 237–246. [Google Scholar] [CrossRef]
Camps-Valls, G.; Tuia, D.; Bruzzone, L.; Benediktsson, J.A. Advances in Hyperspectral Image Classification: Earth Monitoring with Statistical Learning Methods. IEEE Signal Process. Mag. 2014, 31, 45–54. [Google Scholar] [CrossRef] [Green Version]
Adede, C.; Oboko, R.; Wagacha, P.W.; Atzberger, C. A Mixed Model Approach to Vegetation Condition Prediction Using Artificial Neural Networks (ANN): Case of Kenya’s Operational Drought Monitoring. Remote Sens. 2019, 11, 1099. [Google Scholar] [CrossRef] [Green Version]
Pal, M. Random Forest Classifier for Remote Sensing Classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
Li, Y.; Tao, C.; Tan, Y.; Shang, K.; Tian, J. Unsupervised Multilayer Feature Learning for Satellite Image Scene Classification. IEEE Geosci. Remote Sens. Lett. 2016, 13, 157–161. [Google Scholar] [CrossRef]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-Cover Classification with High-Resolution Remote Sensing Images Using Transferable Deep Models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef] [Green Version]
Chen, G.; Tan, X.; Guo, B.; Zhu, K.; Liao, P.; Wang, T.; Wang, Q.; Zhang, X. SDFCNv2: An Improved FCN Framework for Remote Sensing Images Semantic Segmentation. Remote Sens. 2021, 13, 4902. [Google Scholar] [CrossRef]
Chen, H.; Cheng, L.; Zhuang, Q.; Zhang, K.; Li, N.; Liu, L.; Duan, Z. Structure-Aware Weakly Supervised Network for Building Extraction From Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5412712. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remotely Sensed Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid Multiple Attention Network for Semantic Segmentation in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5603018. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef] [Green Version]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Lin, G.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5168–5177. [Google Scholar] [CrossRef] [Green Version]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLabv3+: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhong, Y.; Fei, F.; Liu, Y.; Zhao, B.; Jiao, H.; Zhang, L. SatCNN: Satellite Image Dataset Classification Using Agile Convolutional Neural Networks. Remote Sens. Lett. 2017, 8, 136–145. [Google Scholar] [CrossRef]
Qin, R.; Liu, T. A Review of Landcover Classification with Very-High Resolution Remotely Sensed Optical Images—Analysis Unit, Model Scalability and Transferability. Remote Sens. 2022, 14, 646. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607713. [Google Scholar] [CrossRef]
Guo, S.; Jin, Q.; Wang, H.; Wang, X.; Wang, Y.; Xiang, S. Learnable Gated Convolutional Neural Network for Semantic Segmentation in Remote-Sensing Images. Remote Sens. 2019, 11, 1922. [Google Scholar] [CrossRef] [Green Version]
Ni, W.; Gao, X.; Wang, Y. Single Satellite Image Dehazing via Linear Intensity Transformation and Local Property Analysis. Neurocomputing 2016, 175, 25–39. [Google Scholar] [CrossRef]
Mohammadimanesh, F.; Salehi, B.; Mahdianpari, M.; Gill, E.; Molinier, M. A New Fully Convolutional Neural Network for Semantic Segmentation of Polarimetric SAR Imagery in Complex Land Cover Ecosystem. ISPRS J. Photogramm. Remote Sens. 2019, 151, 223–236. [Google Scholar] [CrossRef]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A Deep Learning Framework for Semantic Segmentation of Remotely Sensed Data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef] [Green Version]
Elamin, A.; El-Rabbany, A. UAV-Based Multi-Sensor Data Fusion for Urban Land Cover Mapping Using a Deep Convolutional Neural Network. Remote Sens. 2022, 14, 4298. [Google Scholar] [CrossRef]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar] [CrossRef]
Gao, Q.; Almekkawy, M. ASU-Net++: A Nested U-Net with Adaptive Feature Extractions for Liver Tumor Segmentation. Comput. Biol. Med. 2021, 136, 104688. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Rundo, L.; Han, C.; Nagano, Y.; Zhang, J.; Hataya, R.; Militello, C.; Tangherloni, A.; Nobile, M.S.; Ferretti, C.; Besozzi, D.; et al. USE-Net: Incorporating Squeeze-and-Excitation Blocks into U-Net for Prostate Zonal Segmentation of Multi-Institutional MRI Datasets. Neurocomputing 2019, 365, 31–43. [Google Scholar] [CrossRef] [Green Version]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Han, L.; Zhao, Y.; Lv, H.; Zhang, Y.; Liu, H.; Bi, G. Remote Sensing Image Denoising Based on Deep and Shallow Feature Fusion and Attention Mechanism. Remote Sens. 2022, 14, 1243. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. ISBN 978-3-030-01233-5. [Google Scholar]
Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 905–909. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
You, H.; Tian, S.; Yu, L.; Ma, X.; Xing, Y.; Xin, N. A New Multiple Max-Pooling Integration Module and Cross Multiscale Deconvolution Network Based on Image Semantic Segmentation. arXiv 2020, arXiv:2003.11213. [Google Scholar]
You, H.; Yu, L.; Tian, S.; Ma, X.; Xing, Y.; Xin, N.; Cai, W. MC-Net: Multiple Max-Pooling Integration Module and Cross Multi-Scale Deconvolution Network. Knowl.-Based Syst. 2021, 231, 107456. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Niu, Z.; Zhong, G.; Yu, H. A Review on the Attention Mechanism of Deep Learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention Mechanisms in Computer Vision: A Survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31, 1–11. [Google Scholar]
Qu, Z.; Mei, J.; Liu, L.; Zhou, D.Y. Crack detection of concrete pavement with cross-entropy loss function and improved VGG16 network model. IEEE Access 2020, 8, 54564–54573. [Google Scholar] [CrossRef]
Rottensteiner, F.; Sohn, G.; Gerke, M.; Wegner, J.D. ISPRS Semantic Labeling Contest; ISPRS: Leopoldshöhe, Germany, 2014. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Li, R.; Duan, C.; Zheng, S.; Zhang, C.; Atkinson, P.M. MACU-Net for Semantic Segmentation of Fine-Resolution Remotely Sensed Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8007205. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the multiscale max-pooling module [51].

Figure 2. Diagram of the dense connection method. A total of four densely connected blocks were passed after the feature map input, and each layer took all preceding feature-maps as input [52].

Figure 3. Schematic representation of the ECA module in the network. The input feature map was convolved twice with a convolution kernel size of 3 × 3, and the feature map of size

H_{1} \times W_{1} \times C_{1}

was passed through the ECA module [49], and finally the feature map with channel attention was output.

Figure 3. Schematic representation of the ECA module in the network. The input feature map was convolved twice with a convolution kernel size of 3 × 3, and the feature map of size

H_{1} \times W_{1} \times C_{1}

was passed through the ECA module [49], and finally the feature map with channel attention was output.

Figure 4. The structure of DMAU-Net.

Figure 5. Raw images and ground truth masks of the Vaihingen and Potsdam dataset. (a) Potsdam dataset; (b) Vaihingen dataset; (c) number of pixels in each class as a percentage of total pixels.

Figure 6. Schematic diagram of the loss and accuracy change during the training.

Figure 7. Classification results of different models based on Vaihingen and Potsdam datasets. (a) DeepLabv3+; (b) RefineNet; (c) U-Net++; (d) ABCNet; (e) MACU-Net; (f) DMAU-Net.

Figure 8. Typical examples that compare the extracted edges of different models. (a) DeepLabv3+; (b) RefineNet; (c) U-Net++; (d) ABCNet; (e) MACU-Net; (f) DMAU-Net.

Table 1. IoU (%) and mIoU (%) for the different feature classification models on the Vaihingen dataset.

Model	Param (M)	IoU (%)					mIoU (%)
Model	Param (M)	Imp.surf.	Building	Tree	Car	Lowveg.	mIoU (%)
RefineNet [29]	99	82.14	86.74	77.87	68.69	72.46	74.39
DeepLabv3+ [61]	41.25	79.93	88.78	78.45	68.12	66.24	77.54
U-Net++ [42]	9.05	85.31	88.12	80.82	71.27	75.35	81.24
ABCNet [25]	14.06	86.13	90.25	82.19	76.35	77.21	83.16
MACU-Net [62]	5.15	88.36	89.73	83.96	76.12	80.86	85.27
DMAU-Net	23.54	89.72	92.46	84.36	77.43	81.70	87.85

Table 2. IoU (%) and mIoU (%) for the different feature classification models on the Potsdam dataset.

Model	Param (M)	IoU (%)					mIoU (%)
Model	Param (M)	Imp.surf.	Building	Tree	Car	Lowveg.	mIoU (%)
RefineNet [29]	99	81.19	84.68	75.66	70.34	69.67	72.36
DeepLabv3+ [61]	41.25	76.91	86.59	77.64	65.16	62.68	75.87
U-Net++ [42]	9.05	83.25	83.87	78.33	73.27	74.38	80.56
ABCNet [25]	14.06	86.25	92.17	83.16	74.83	76.26	85.17
MACU-Net [62]	5.15	86.64	90.36	80.69	73.37	76.58	84.76
DMAU-Net	23.54	87.72	92.03	83.91	75.46	78.52	85.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Dong, J.; Wang, Y.; Yu, B.; Yang, Z. DMAU-Net: An Attention-Based Multiscale Max-Pooling Dense Network for the Semantic Segmentation in VHR Remote-Sensing Images. Remote Sens. 2023, 15, 1328. https://doi.org/10.3390/rs15051328

AMA Style

Yang Y, Dong J, Wang Y, Yu B, Yang Z. DMAU-Net: An Attention-Based Multiscale Max-Pooling Dense Network for the Semantic Segmentation in VHR Remote-Sensing Images. Remote Sensing. 2023; 15(5):1328. https://doi.org/10.3390/rs15051328

Chicago/Turabian Style

Yang, Yang, Junwu Dong, Yanhui Wang, Bibo Yu, and Zhigang Yang. 2023. "DMAU-Net: An Attention-Based Multiscale Max-Pooling Dense Network for the Semantic Segmentation in VHR Remote-Sensing Images" Remote Sensing 15, no. 5: 1328. https://doi.org/10.3390/rs15051328

APA Style

Yang, Y., Dong, J., Wang, Y., Yu, B., & Yang, Z. (2023). DMAU-Net: An Attention-Based Multiscale Max-Pooling Dense Network for the Semantic Segmentation in VHR Remote-Sensing Images. Remote Sensing, 15(5), 1328. https://doi.org/10.3390/rs15051328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DMAU-Net: An Attention-Based Multiscale Max-Pooling Dense Network for the Semantic Segmentation in VHR Remote-Sensing Images

Abstract

1. Introduction

2. Methods

2.1. Multiscale Max-Pooling Module Based on Dense Connections

2.2. An Efficient Channel-Attention Mechanism

2.3. Loss Function

3. Experiments

3.1. Experimental Dataset

3.2. Evaluation Metrics

3.3. Experimental Setting

4. Experimental Results and Analysis

4.1. Quantitative Analysis of Experimental Results

4.2. Qualitative Analysis of Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI