1. Introduction
Semantic segmentation is a hot field in computer vision and remote sensing image processing. In the processing and application of semantic segmentation images, land cover is an important indicator. lt has important practical significance in land planning, mapping analysis [
1], surveying, and land classification [
2]. The analysis of land resources is receiving more and more attention. Consequently, real-time semantic land cover segmentation is a crucial step in land resource planning and management [
3,
4].
Currently, the primary classical semantic segmentation techniques include the threshold approach, clustering method, support vector machine technique, and others. Support vector machines were introduced by Zhang, Jiang, and Xu (2013) [
5] as a way to extract shorelines by increasing geometric edge properties and reducing mistakes, although this approach is challenging for large-scale data training. Normalized difference water index was proposed by McFeeters in 1996 (NDWI) [
6]. The technique creates a segmentation band from the image by combining the green and near-infrared bands; however, it is quite environment-dependent. On the basis of a digital elevation model, Du et al. (2018) [
7] developed a novel method for the detection of the water body and accomplished high accuracy. The issue is that the outcomes are significantly influenced by various infrared images. The picture segmentation problem is solved by the dynamic contour model (ACM) [
8] by turning it into a minimal energy functional problem. The model can produce a smooth, continuous contour curve, but it is highly dependent on the original contour. Segmentation on small samples can be completed using the traditional methods mentioned above. However, there are drawbacks for large sample segmentation, including poor precision, laborious parameter tweaking, and inadequate generalization performance. Therefore, it is impossible to accurately extract the shapes of any lakes, rivers, or constructions.
Most recently, deep learning neural networks are extensively utilized in image processing and other industries [
9,
10,
11], and they significantly increase image processing accuracy as a result of the deep learning industry’s quick development. Hinton [
12] created a deep learning technique that trained deep belief networks using the back propagation algorithm, considerably enhancing the model’s capacity for image processing. The major function of the initial stages of the network model of deep learning is to classify images. Convolution, pooling, and sigmoid are the processes that the deep learning network uses to extract the feature information of the entire picture. The output layer then calculates the probability that each picture is a particular category, taking the highest probability as the classification outcome. However, its disadvantage is that it can only effectively classify specific objects and cannot extract image position information and edge information. Therefore, for land cover segmentation objects with complex environmental backgrounds, traditional convolutional neural networks are enabled to complete effective and high-precision segmentation. To solve the above problems, computer vision scholars have recommended a series of efficient models for semantic segmentation. Convolutional neural network (CNN) [
13] significantly boosts the efficiency and precision of semantic land cover segmentation compared to conventional deep learning algorithms. For the pixel-level classification of pictures, Long, Shelhamer, and Darrel [
14] suggested a Fully Convolution Network (FCN) in 2014. A feature map for the final convolutional layer is upsampled using a deconvolution layer to equalize its size with that of the source image, allowing FCN to take input images of any size. This allows predictions to be made for each pixel. Spatial data from the initial input image is also retained. The parity in the up-sampled feature map pixel classification, which considerably enhances the classification accuracy of remote sensing images, is the last step. In 2015, Badrinarayanan et al. (2017) [
15] recommended SegNet to address image semantic segmentation for autonomous driving and intelligent robots. SegNet and FCN ideas are very similar, but its encoder and decoder (upsampling) use inconsistent methods. The first 13 layers of the VGG16 convolutional network are also used by SegNet’s encoder, and a decoder layer is matched to each encoder layer. The soft-max classifier receives the result of the final decoder and independently creates likelihoods of a class for each pixel, but its training is time-consuming and ineffective. In 2015, Ronneberger, Fischer and Brox [
16] proposed a widely used semantic segmentation model UNet. UNet is a segmentation network applied to the medical field based on FCN, and has quickly become the baseline for most medical image semantic segmentation tasks. Finally, although the medical image signal is complex but the category is simple, and the distribution of human tissue has certain regularity, so the data set is relatively simple, and the segmentation task is relatively easy. In 2017, Liang-Chieh Chen, George Papandreou [
17] and others proposed to revisit the DeepLabv3 network of dilated convolutions for semantic segmentation. This is a potent tool for managing the feature response resolution produced by deep convolutional neural networks in semantic picture segmentation applications, as well as for visualizing and altering the filter field of view. It suggests atrous spatial pyramid pooling (ASPP), which is equal to capturing the image’s multi-scale context semantic information in various proportions, to sample the input in parallel with atrous convolution at various sampling rates. However, with the deepening of Block, the expansion coefficient (rate) of the dilated convolution continues to increase, and finally the ability of capturing global information becomes weaker and weaker.
Although the existing segmentation algorithms perform well in real-time semantic segmentation of land cover, due to the continuous down-sampling operation of the convolutional neural network, many semantic features are lost, resulting in inaccurate segmentation and blurred edges [
18,
19]. Additionally, the contextual semantic information of the image cannot be properly captured because it is difficult to merge the global message of low-level semantic features with the detailed message of high-level semantic features [
20,
21]. Because of the complex background and terrain of land cover, the traditional convolutional neural network system often has problems such as misjudgment and undetectable objects, resulting in low segmentation accuracy. In order to solve the problems and shortcomings of traditional convolutional neural network in semantic segmentation task, a network is designed in this paper by considering various elements of convolutional neural network. The network extracts the feature information of land cover images by downsampling the adapted RestNet50. The multi-scale feature extraction pyramid module aggregates the context information of different regions, thereby improving its ability to obtain global information of land cover. Then, the multi-scale feature fusion module is used to combine the low-level global information and high-level detail information to improve the segmentation ability of land cover edges and details. Land cover has complex background, unpredictable terrain and intricate edges. Finally, a multi-scale convolutional attention module is proposed to focus on the key target area of land cover image, which helps the network to locate the target of interest more accurately. It can pay more attention to the important information of land cover, dilute the interference and unimportant information, so that the segmentation effect has been significantly improved, and the high-precision segmentation could be achieved.
2. Methodology
Convolutional neural network (CNN) architecture-based models are intensively used in computer vision due to its explosive growth. However, in remote sensing images, the land background is complex and diverse, the terrain trend is unpredictable, and the detailed information and spatial information are rich. Traditional convolutional neural networks fail to accomplish high-precision semantic land cover segmentation. In order to segment land cover images more accurately, multi-scale and context semantic information must be used efficiently. It is not only satisfied with the segmentation of buildings and waters in land cover images, but more importantly, the segmentation of their edges and small details. This study recommends ResNet [
22] as the foundation network for a multi-scale feature aggregation network. The network is made up of three modules: a multi-scale feature fusion module, a multi-scale feature extraction space pyramid module, and a multi-scale convolutional attention module. In this study, the input land cover remote sensing image is used to train the model, and the current model’s output image is acquired by performing a forward propagation computation. The discrepancy between the produced image and the label is calculated using the cross entropy loss function, and the chain rule is used to return the resulting error to the network. When doing a back propagation computation, the adaptive moment estimation (Adam) optimizer is used to update the learning rate and other parameters of the model. The Adam optimizer [
23] utilizes an exponential decay rate with a coefficient of 0.9 to regulate the redistribution of control and an exponential decay rate with a coefficient of 0.999 to control the impact of the preceding gradient square. The variety of learning tactics are used, such as “fixed”, “step”, and “ploy” strategies. Current experiments show that the “ploy” strategy is better in semantic segmentation experiments [
24].
2.1. Network Structure
This research introduces a unique semantic segmentation deep learning network. It performs incredibly well in real time land cover segmentation. Its frame structure is shown in
Figure 1. In its composition, as the foundation network for extracting features, we employ the modified ResNet50. Then, this paper presents a multi-scale feature extraction space pyramid module. Its goal is to combine the context data from several places, enhancing the ability to obtain global information, so as to accurately and effectively identify regions of any scale. Next, a multi-scale feature fusion module is constructed to combine various scale features, which is crucial for enhancing performance. Compared with high-level features, low-level features have more location, detail, and better resolution data, less convolution, lower semantics, and more noise. Although high-level characteristics have weaker resolution and poorer detail perception, they have stronger semantic information. The multi-scale feature fusion module is utilized to fully integrate low-level features and high-level features. Finally, a multi-scale convolutional attention module is suggested. It can efficiently lmitigate the interference of invalid targets, enhance the detection effect of concerned targets, and improve the attention to buildings and waters in land cover images and change the allocation of resources according to the importance of the target. It allocates more resources to the object of attention [
25], and thereby increases the model’s total segmentation accuracy, which makes the real-time semantic segmentation of land cover more precise.
2.2. Backbone
The extraction of semantic information and detail information is a key step in image segmentation. The selection and optimization of the backbone network directly affect the semantic segmentation. Typical backbone networks include VGGNet [
26], MobileNet [
27], Transformer [
28], ResNet, Xception [
29], and DenseNet [
30]. In the semantic segmentation of land cover, the typical backbone network in the past is often undersampled too deeply, which leads to heavy loss of global information in the land cover image; as a result, effective information of small targets in the image decrease sharply. At the same time, the deep undersampling leads to the multiple increase in each parameter, which makes the model too heavy in the calculation of high volume, or there is too much pursuit of the lightweight, as it is unable to focus on the essential problems faced in the downsampling process. In the task of the real-time semantic segmentation of land cover, in order to extract high-precision semantic information and detailed information, it prevents the explosion of the gradient caused by increasing depth. Therefore, this paper uses modified ResNet50 as the backbone network of semantic segmentation of land cover. Because ResNet50 is down-sampled 32 times, too deep down-sampling can easily lead to the loss of more location and detail information. With both at once, the last layer of the ResNet50 channel is expanded to 2048 layers, which makes the calculation heavy and model complexity high. Thus, this article uses the modified ResNet50, using only the first four layers of ResNet50, the channel to the 1024th layer, and the fourth layer stride equals one downsampling only eight times. In this way, sufficient location information and detailed information are retained, and the lightweight of the network is guaranteed. It solves the problems that the traditional typical backbone network has too deep sampling and too much pursuit of lightweight, resulting in the loss of the key information of land coverage. The expression of the residual unit in the ResNet residual block is as follows:
where
is the first residual unit’s input vector,
is the
residual unit’s output vector,
is a nonlinear function ReLU, the weight matrix is represented by
and
.
2.3. Multi-Scale Feature Extraction Space Pyramid Module
Capturing multi-scale context data is crucial for semantic segmentation in order to gather comprehensive data. The land cover image has a complex background and crisscrossed rivers and houses. Therefore, it is necessary to design a module to aggregate multi-scale context information to solve the problem of context information loss, which could cause blurred edge and undetectable small targets, thereby improving the accuracy of the real-time semantic segmentation of land cover. Therefore, to combine the context information from several locations and enhance the capacity to access global information, this research refers to pyramid pooling module in PSPNet [
31]. The ASPP module of DeepLabv3Plus network of classical convolutional neural networks is also referenced. In order to ensure the image resolution, the dilated convolution is employed to expand the receptive field of land cover image information.
Figure 2 illustrates its particular structure.
In land cover segmentation, ordinary convolution operations can only be used to process the local area, and its receptive field is limited; thus, it is difficult to capture multi-scale context information. Firstly, three 3 × 3 dilated convolutions are used, and their expansion coefficients are 6, 12 and 18. The dilated convolution shows the extent of the convolutional expansion by the expansion factor. It improves the network’s ability to acquire multi-scale context while expanding the network’s receptive field without altering the convolution kernel’s form, or without downsampling. The two-dimensional dilated convolution receptive field growth formula can be expressed as [
32]
Then, 1 × 1, 2 × 2, 3 × 3, 6 × 6 adaptive AvgPool is used. Global information is gained by pooling instead of convolution. This method reduces the quantity of convolution layers on the premise that the backbone is deep enough. The pooling of different sizes divides the feature mapping into different sub-regions and forms a set of different positions for the land cover image. Pyramid pooling module integrates the features of land cover images in four different scales, fully aggregating the context information and global information for different regions. Then, the CBAM attention module is also added [
33]. It can change the way of resource allocation according to the importance of the target, so that the resources are more inclined to the construction and water objects in the land cover image, which solves the problem of edge blurring and small object omission because the traditional neural network does not pay enough attention to important objects. The traditional attention mechanism in the deep learning network is more concerned with the analysis of the channel domain, which is limiting the relationship between feature map channels. CBAM begins with the two scopes of channel and space. The sequential attention structure from channel to space is realized by introducing two analytical dimensions of spatial attention and channel attention. Spatial attention might direct the neural network to focus more on the pixel regions that are important for segmentation and ignores the unimportant parts. The feature map channel’s distribution connection is processed using channel attention, and the attention across the two dimensions strengthens the influence of the attention mechanism. It can improve the representation of the model’s, effectively reduce the interference of invalid targets such as trees, cars and seats in the land cover image, raise the effectiveness of segmentation of the concerned target, and overall improve the accuracy of real-time semantic segmentation. Finally, maxpool, 1 × 1 Conv, three 3 × 3 Atrous Convolution and four AdaptiveAvgPool are combined. Then, they focus on important features through the CBAM module to suppress unnecessary features. Finally, the multi-scale feature extraction space pyramid module is combined.
2.4. Multi-Scale Feature Fusion Module
In real-time land cover semantic segmentation, the complex background and interlaced housing construction make small buildings and rivers very difficult to identify [
34]. Higher resolution, additional location and detail information can be found in low-level features. Stronger semantic information is present in high-level features; however, these features have limited resolution and poor detail perception. In order to address the issue of incomplete data at various scales of land cover photographs, we fused multi-scale information. Therefore, the blending of characteristics on various scales improves the segmentation performance by combining the segmentation outcomes of various layers. In addition, the existing network generally uses maximum pooling or average pooling to process channels, which will lose the spatial position information of the object. In lightweight networks, model capacity is strictly limited. The application of attention is very lagging, mainly because the computational overhead of most attention mechanisms is unaffordable for lightweight.
As a result, this study develops a multi-scale feature fusion module.
Figure 3 illustrates its particular structure. It uses X Avgpool and Y Avgpool [
35] for average pooling in two dimensions. The module incorporates channel attention with location data. In contrast to channel attention only, 2D global pooling transforms the input into an individual feature vector. So as to analyze channel attention, the module employs encoding of two 1D features techniques that aggregate features in several directions. This allows for the acquisition of dependencies over a long distance in a single spatial direction and the retention of precise position data along another. This is very important because the target position detection is often inaccurate or the target position cannot be directly detected in the land cover segmentation. The produced land cover feature maps are then independently programmed to create a pair of position-sensitive and direction-aware feature maps. They can be applied in conjunction with the input feature map to make the target of interest more accurately represented. This is to encode exact location information and draw attention to the breadth and height of the land cover image. Prior to obtaining the feature maps in the width and height directions, the input feature map is first separated into two directions for global average pooling. Each channel has its own unique horizontal and vertical encoding for a particular input X using the pooling kernel’s spatial ranges (H, 1) or (1, W). As a result, the channel
c output at height
a can be written as:
where channel
c output at width
b can be expressed as follows:
where
a is the height,
b is the width,
c is the cth channel,
n is the cth channel’s nth position pixel with height
a, and
m is the pixel at the mth position of the cth channel, with width
b. Shifting in two directions also enables the attention module to save the exact position information in one spatial direction and long-range dependency along another. Afterward, two feature maps are produced via cascading. Using a shared 1 × 1 convolution transformation, the spatial data in the horizontal and vertical axes are combined to create a feature map in the middle of
. The following is the expression for
f:
where
is a nonlinear activation function [
36],
a represents height, and
b represents width.
f is then split into two distinct tensors
and
along the spatial dimension. Two 1 × 1 Conv
and
are used to transform the feature maps
and
to the same number of channels as the input X. The outcome is as shown below.
Then, it extends
and
. As the attention weight, its final output expression is as follows:
Finally, the low-level features are changed by 3 × 3 Conv, batch normalization and ReLU, and multiplied by the above high-level feature extraction module. The advanced feature is then changed in the number of channels by 1 × 1 Conv, and the output results after changing the channel are added to the above multiplied output results. These elements come together to form a multi-scale feature fusion module. From the point of view of effectiveness, the multi-scale feature fusion module enhances the pixel information and spatial information at the background edge of the land cover image, and the dependence between the capture channels can also well model the location information and long-range dependence. It weakens or eliminates interference information such as debris, forest shade, road and so on, and combines the high-level features and low-level features. It can better aggregate different scale features and solve the problem that the contour and size of objects are different. It solves the problem of pain points such as serious interference of objects or loss of multi-scale information in land cover segmentation tasks, thereby improving the accuracy of real-time semantic segmentation of land cover.
2.5. Multi-Scale Convolutional Attention Module
In semantic segmentation of land cover, complex background and staggered house rivers make the segmentation difficult. Therefore, in the semantic segmentation, it is particularly crucial to concentrate on key targets, effectively decrease the interference of invalid targets, and enhance the module’s representation capability [
37]. In order to address the issue of low attention ability of traditional convolutional neural network to waters and buildings, a multi-scale convolution attention module is proposed; it could reduce the interference of invalid targets such as tree shadows. Furthermore, it makes the edge segmentation of buildings and waters is more accurate, and the information of small rivers is more detailed. Simultaneously, the 3D attention module is used to effectively alleviate or solve the shortcomings of channel attention and spatial attention in the segmentation. Its specific structure is shown in
Figure 4. Finally, this method improves the accuracy of land cover semantic segmentation. Its specific structure is shown in
Figure 5.
A crucial component of the multi-scale convolution attention module is the SimAM module [
38]. The spatial attention or channel modules are distinct from the SimAM module. Without using any other parameters, the module produces 3D attention weights for the feature map. The 1D channel attention module treats different channels differently and treats all locations equally [
39]. The 2D spatial attention module treats different locations differently and equally for all channels [
40]. Compared with the existing channel and spatial attention modules, it provides three-dimensional attention weights for feature mapping in the feature layer. In the face of the complex and diverse background of land cover, it is often difficult for ordinary attention modules to combine such complex variables, but the SimAM module maps the three-dimensional attention weight, which solves the pain point that the attention module does not perform well in the complex backgrounds. Its particular structure is displayed in
Figure 4.
In the multi-scale convolution attention module design, the input value is first combined with the feature map of the multi-scale feature extraction space pyramid module and the 1 × 1 conv reduction channel to increase the input feature map, with more detailed information and semantic information. Then, through 1 × 1 Conv to reduce the quantity of channels, to avoid too many channels and information is too redundant, computing waste. The output feature maps are then averaged and maximum pooled to produce two 1 × 1 × C feature maps. After that, a neural network with two layers is supplied with the data (MLP) [
41]. ReLU serves as the activation function, and C/r is the amount of neurons in the first layer, where r is the reduction rate. The second layer of neural networks has C neurons, and the two levels are shared. The ultimate channel attention feature map is then produced from the MLP output’s features using a sigmoid activation technique. Following that, an element-wise multiplication operation is carried out on the output and input feature maps. Finally, the feature maps before the Maxpool and the Avgpool are subjected to the SimAM module, and the two outputs are added to obtain the output of the first layer. The second layer of this module’s module uses the feature map output from the first layer as its feature map input. First, two H × W × 1 feature maps are produced using AdaptiveMaxPool and AdaptiveAvgPool. Then, one concats the two feature maps based on the channel. The dimension is then decreased to 1 channel with a 7 × 7 convolution operation, and the feature map’s size is H × W × 1. Then, through the sigmoid, the output and the module input feature map are multiplied. In order to create the final generated map, the output feature map of the first layer is handled by the SimAM module, the two outputs are combined, and channel splicing is conducted with the input feature map. The multi-scale convolution attention module can change the allocation of resources according to the importance of the target, so that the neural network focuses more on the pixel area that has a significant impact on the segmentation and ignores the irrelevant area. Specifically, the resources are more inclined to houses and rivers. Taking into account both 2D attention and 3D attention, the problem that the ordinary attention module cannot take into account both the land cover image channel and space is solved, and effectiveness of the entire land cover’s semantic segmentation is enhanced.
2.6. Visual Contract of Different Modules
The model’s focus on various areas of the image is depicted by the heat map. Following the red zone in importance in the image are the orange and green regions, and the blue region is typically the area that does not require attention. However, the heat map is not equivalent to the final segmentation result. Specifically, on the land cover dataset, housing construction and water area are the most concerned areas, which are shown in red. Areas of secondary interest include the building handover and water edge detail, which are depicted in yellow and green, respectively. Additionally, the model must take into account the fact that complex background details such as trees and traffic would affect the segmentation [
42]. The particular image is displayed in
Figure 6.
Figure 6b,c show the visual effects of buildings or waters without or with MFFM. By comparison, we find that models with MFFM can identify targets more clearly and focus more on the building and the waters. The segmentation results without MFFM are more blurred, and the attention to housing construction and the water area is not enough, as in the other modules. Heat maps better demonstrate this effect.
5. Summary
A key component of processing of remote sensing images with a high level of resolution is the land cover semantic segmentation, which is a crucial landmark in remote sensing images. It is very practical in land resource protection planning, geographical classification, surveying and mapping analysis. This research proposes a multi-scale feature aggregation network for real-time semantic land cover segmentation in the area of deep learning picture segmentation. In order to address the issues with classic convolutional neural networks’ poor generalization capacity and low accuracy in the context of semantic land cover segmentation, the network draws for convolutional neural networks’ benefits in deep learning, and uses the modified ResNet50 as the backbone network to extract image feature information. To combine the context data from many regions, a multi-scale feature extraction space pyramid module is proposed. This module can efficiently and reliably categorize regions of any scale, enhancing the potential to gather global information. It is suggested to use a multi-scale feature fusion module. The low-level feature is more detailed and has a better resolution, but its semantics are weaker and its noise is greater because there are fewer convolutions. Although high-level characteristics have weaker resolution and poorer detail perception, they have stronger semantic information. The fusion of high-level and low-level features helps to extract more information from the image. Finally, a multi-scale convolution attention module is proposed, which can pay more attention to the building and the river, and reduce the interference of complex backgrounds such as trees and roads. Compared with the traditional convolutional neural network model, the model in this paper greatly improves the accuracy of real-time semantic segmentation of land cover, and captures details such as small tributaries and house edge contours quickly. The experimental results demonstrate that the average intersection over union (MIOU) of this method on the land cover dataset reaches 88.05%. The generalization ability of the network is also very strong; it could reach 96.06% on the water segmentation dataset.