1. Introduction
Semantic segmentation occupies an irreplaceable position in the field of computer vision. Its main task is to label each pixel in the image with the correct semantic category label. Thanks to deep neural networks, semantic segmentation has been extensively researched and developed in recent years. With its rapid development, it has been applied to many fields, such as robot navigation, autonomous driving [
1], medical image analysis, virtual reality, and agricultural model analysis [
2,
3]. Since the start of FCN [
4], deep convolutional networks have been the main strategy. FCN is a pixel-level classification of images, that is, it classifies each pixel, thus solving the problem of semantic-level image segmentation. FCN can accept input images of any size and use the deconvolution layer to upsample the feature map of the last convolutional base layer to restore it to the same size as the input image, so that a prediction can be generated for each pixel while preserving the Spatial information in the original input image. Currently, the improved networks based on fully convolution networks (FCN) have achieved good results. However, due to the traditional model structure, only a small range of contextual information can be provided. The receptive field is small, and its limitation cause the segmentation effect to not reach the expected accuracy.
To solve the limitations issue of full convolution, Chen et al. proposed an atrous spatial pyramid pooling model in the Deeplab series in papers [
5], which used multiscale dilated convolution to aggregate contextual information. Zhao et al. proposed a pyramid pooling model to capture contextual information [
6]. However, the method of atrous convolution can only capture surrounding information and cannot be effective in global or dense context information. The method based on pyramid pooling cannot adaptively gather context information.
To aggregate the dense context information at the pixel level, nonlocal networks [
7] use a self-attention mechanism to weigh the pixel features of each location with the pixel features of the whole image to obtain the long-distance dependence. Although this method achieves good results in visual tasks, it requires a huge attention map to calculate the relationship between each pixel pair. The complexity of
leads to a long computation time and a large space occupation, where
N is the size of the input feature map or the number of channels. Based on that idea, Liu et al. proposed a dual-attention network with complementary positional attention and channel attention [
8], which generates a similarity matrix by calculating the relationship between each pixel in the feature map and the pixel in the whole image. However, not all pixels are correlated; we usually use matrix multiplication to calculate pixel similarity, and this calculation method obtains positive correlation weights. However, some goals in the segmentation of reality are not related or even opposed, such as cars and sky, sky and roads, etc. The method of calculating the similarity between each pixel and the pixels of the whole image space is not good for the segmentation of some objects, and the amount of calculation and storage complexity is large.
To solve the above problems, we propose a lightweight and effective positional attention module; our motivation is to replace the traditional single dense connected graph with two sparse connected graphs, unlike the existing network, which requires each pixel feature to be weighted with all pixels of the feature map. We only focus on the dependencies between pixels and neighboring pixels. Specifically, we firstly pay attention to the relationship between the pixels in the input feature map and the adjacent pixels in the same column to obtain a column attention map, and then assign the column attention weight to the input feature to obtain the column attention feature. Secondly, we calculate the similarity between the feature pixel and the adjacent pixels in the same row, and then assign the row attention weight to the column attention feature, thus forming a global positional attention. We named it the Adjacent Position Attention Module (APAM). This strategy greatly reduces the model complexity and computation time, and the parameter amount of the attention map is reduced from to .
We compared the difference between the nonlocal block and the adjacent position attention block, as shown in
Figure 1. When inputting the local feature map, the local feature obtains the attention map through matrix changes and operations, and then assigns the weight value in the attention map to the local feature V, so that each value of the local feature obtains a global weight, thereby obtaining a global context feature output. We improve the effect of the module by changing the calculation method of the attention map. The number of attention map weight values for each pixel in
Figure 1a is (HW), and the attention weight value for each pixel in
Figure 1b is
; the size of
is
H or
W, different from the dense connections adopted by the nonlocal module. As shown in
Figure 1b, each position in the feature map is sparsely connected with other ones which are in the same column neighbors or the same row neighbors in our block. Taking the five pixels on the diagonal as an example, the first pixel (red) only pays attention to three adjacent pixels in the same column (the first column), the second pixel (yellow) only pays attention to the four adjacent pixels in this column, and the middle pixel (blue) pays attention to all the pixels in this column. The number of concerned pixels increases linearly from the edge to the middle. Leading to the predicted attention, the map only has about
weights rather than
in the nonlocal module.
In semantic segmentation, the low-level features have a higher resolution with more position and detail information; the high-level features have a large Receptive Field, so there is rich semantic and category information. We should combine the advantages of the two. The previous methods used jump structure for fusion [
4], Chen [
9] merged the features after the atrous spatial pyramid pooling with the low-level features; U-Net [
10] and SegNet [
11] use encoding and decoding structures for fusion. These fusion methods are too simple and the effect is not obvious. Later, DFN [
12] and PAN [
13] used the global average pooling of high-level features to guide the fusion of low-level features. Global average pooling is suitable for extracting the feature information of large target objects, and it is easy to confuse surrounding objects for small target objects. To balance the fusion of high and low-level features, we propose a cross-dimensional interactive attention mechanism to solve this problem. By capturing rich feature representations across dimensions, the cross-dimensional interactive attention mechanism consists of three branches. The first two branches are responsible for capturing the dependency between the channel dimension (C) of the input feature and the spatial dimension (
H ×
W), and the latter branch is composed of the spatial attention mechanism. High-level feature maps and low-level feature maps are jointly input into the cross-dimensional interactive attention model, so that the detailed information of low-level features and the semantic information of high-level features are perfectly combined, so that the semantic segmentation of the image achieves a good effect.
In summary, our contributions are as follows:
To capture the long-distance dependence, we proposed the APAM, which combined with the Channel Attention Module to form a new dual-attention model (NDAM), which is lighter and more effective than DANet [
8].
To obtain a better semantic segmentation effect, we designed a cross-dimensional interactive attention feature fusion module (CDIA-FFM) for fusing features from different stages in the decoder.
Combining NDAM and CDIA-FFM, a new network MANet for semantic segmentation is proposed. It has obtained good results in the two benchmark tests PASCAL VOC 2012 and cityscapes.
The rest of the paper is organized as follows. We review related work in
Section 2 and describe the APAM and CDIA-FFM in detail in
Section 3, and then introduce the entire network framework. In
Section 4, we present the fusion experiments, experiment comparisons, experimental results, and the analysis of the experimental results.
Section 5 focuses on conclusions and future work suggestions.
4. Experiments
To verify our proposed method, we conducted evaluation experiments on two authoritative semantic segmentation datasets. They are the PASCAL VOC 2012 dataset [
31] and the city scene dataset Cityscapes [
32], and we have performed all ablation experiments on the PASCAL VOC 2012 dataset.
4.1. Datasets and Implementation Details
PASCAL VOC 2012: As one of the semantic segmentation benchmark data sets, it is often used in comparative experiments and network model effect evaluation. It contains 20 object categories and a background category. The training set contains 1464 images, the validation set contains 1449 images, and the test set contains 1456 images.
Cityscapes: The data set has a total of 5000 images, including street scenes in 50 different cities, 19 semantic class high-quality pixel-level labels, and a background. The pixels of each image are 2048 × 1024. A total of 2979 images in the data set are for training, 1525 images are for testing, and 500 images are for verification
There are three main criteria for measuring the accuracy of image semantic segmentation, which are pixel accuracy, average pixel accuracy, and average IOU. Our experiment uses the commonly used criterion mIOU, which calculates IOU for each category separately, and then averages the IOU of all categories. In semantic segmentation, there are two sets of real labels and predicted values; IOU is to calculate the ratio of intersection and union of these two sets.
Assuming there are
classes (including a background),
represents the number of pixels that belong to class
i but are predicted to be class
j.
represents the number of correct predictions, and
and
, are interpreted as false positives and false negatives, respectively.
4.2. Implementation Details
Training settings: We build our network based on the PyTorch-1.6 experimental environment and train the network on a server with 2 NVIDIA GeForce GTX 1080Ti. Stochastic Gradient Descent (SGD) with mini-batch is used for training. We use a “poly” learning rate strategy, where the initial learning rate is multiplied by with power = 0.9. We perform data augmentation by randomly scaling (from 0.5 to 2.0) the input images and flipping the method horizontally during training. For Cityscapes, the initial learning rate and weights are 0.01 and 0.0001, respectively, then high-resolution patches (768 × 768) from the resulting images are randomly cropped out. We set the training time to 60 epochs for cityscapes, and the batch size is 4. For PASCAL VOC 2012, the initial learning rate and weights are 0.007 and 0.0005, respectively, the crop size is 513 × 513, the batch size is 8, and the training time is set to 50 epochs.
4.3. Results on PASCAL VOC 2012
4.3.1. New Dual-Attention Model
We set NDAM at the end of the base network resnet-101 to capture the spatial position dependence of features and the correlation between channels. It significantly improves the segmentation results by modeling rich context correlations on local features. To verify the effectiveness of NDAM, we conducted corresponding comparison and ablation experiments.
In the spatial correlation modeling of the acquired features, we conducted five sets of comparative experiments, as shown in
Table 1. The effects achieved are different depending on the position of the attention mechanism. The first and second sets of experiments compare our proposed neighbor location attention with traditional spatial attention. In the third set of experiments, the location attention acquisition method is to calculate the similarity between all pixels and the pixels of the whole image. In the fourth set of experiments, the acquisition method is to calculate the similarity between the pixel and all the pixels adjacent to the same column and in the same row in turn. In the fifth set of experiments, the acquisition method is to calculate the similarity of pixels in sequence with the pixels adjacent to the same column and in the same row. After comparison, the experimental program of the fifth group has the best effect, and its mIOU is 73.57%.
In order to compare the parameter amount and computational complexity of NDAM and DANet, it can be obtained from
Table 2. The parameter amount of NDAM is 1.033 M less than DANet, and the FLOPs of NDAM are 1.036 G smaller than DANet. Combining
Table 1 and
Table 2 under our experiment conditions, the effect of NDAM is 0.18% better than DANet, and the amount of parameters is smaller.
To understand our model more deeply, we visualized the effect of semantic segmentation, as shown in
Figure 6. The first and second columns represent the original image and the label image, respectively. The third column and the fourth column, respectively, represent the DANet and NDAM segmentation results. In the segmentation result of the first image, our model has a more complete contour of the rider. The shape of the bicycle is more obvious in the segmentation result of the second image. In the segmentation result of the third image, our model has fewer red error segmentation pixels. This verifies the effectiveness of the cross-dimensional interactive attention fusion module, which makes the boundary and contour of the target object in the segmentation result more perfect. Overall, the NDAM segmentation results are better.
4.3.2. Cross-Dimensional Interactive Attention Feature Fusion Module
The CDIA-FFM can not only capture the dependencies between dimensions but also highlight low-dimensional features and high-dimensional features. We conducted an ablation experiment on the setting of the convolution kernel size in the cross-dimensional interactive attention feature fusion module, as shown in
Table 3. The three convolution kernels are, respectively, set in the three fusion modules. The network effect is the best when the sizes of the convolution kernel are set to 3, 5, and 7, respectively. The size of the convolution kernel is set according to the size of the feature map in the fusion module, and the convolution kernel increases with the size of the feature map.
To obtain a better segmentation effect, we combined the NDAM and CDIA-FFM to form MA-FFNet, as shown in
Table 4. We compare MA-FFNet with some of the existing semantic segmentation networks, such as [
8], DeepLabv3+ [
9], OCRNet [
17], EfficentFCN [
33], and Bisenet v2 [
34]. Our segmentation effect is 3.1% higher than that of the basic network, and the effect is higher than the comparison network under the same experimental conditions, so our network is feasible. We randomly visualized several segmentation results. As shown in
Figure 7, it is obvious that the segmentation effect of our network is better. Comparing the chairs, bicycles, and sheep legs in the figure, all can prove the effect of our segmentation network.
4.4. Results on Cityscapes
To further verify the effectiveness of our network, we conducted verification and comparison experiments on a very challenging cityscapes data set. As shown in
Table 5, we have achieved very good results. The Miou of 72.8% is much higher than the result of the comparison experiment and 6.19% higher than the base network.
We visualized the segmentation results on the cityscapes data set. As shown in
Figure 8, the segmentation diagram of telephone poles and traffic signs has the best segmentation effect in our network, which illustrates the effectiveness of the Adjacent Position Attention Module, calculates the similarity between the pixel and the adjacent pixel in the column, so that this long target object receives attention. One disadvantage of our network is that it would segment the Mercedes-Benz car head (as example) into other targets in the image.
4.5. Compared with Transformer Method
With the development of computer vision, the visual transformer has been effectively used in semantic segmentation, and we compare our proposed segmentation network with the existing transformer methods. As shown in
Table 6, the performance of our method is compared with Segformer [
35]. There are five backbone networks with different parameters in Segformer, and Mit-B2 with similar parameters to our method is used as backbone. Under the same training epoch and different training strategies (Segformer uses the original training strategy in the comparison experiment), our method performs 0.6% higher than Segformer on the PASCAL VOC 2012 dataset and lower than Segformer on the cityscapes dataset. The main reason is that our network is not pretrained. If our method is effectively pretrained, its semantic segmentation performance is comparable to the transformer semantic segmentation network.