1. Introduction
Semantic segmentation is a task to predict the corresponding category of each pixel in the image. This topic is a very popular direction in computer vision and can be applied to many practical tasks, such as city scenes [
1,
2,
3,
4,
5], satellite topographic survey [
6,
7], and medical image analysis [
8,
9,
10]. The new point-to-point network structure designed by Fully Convolutional Networks (FCN) for semantic segmentation [
11] performs well in this dense pixel prediction task. This structure foundation accelerates the development of semantic segmentation. So far, many excellent networks have been developed [
12,
13,
14,
15].
In FCN [
11], the multilayer convolution and pooling structure results in a 32-fold reduction in the final feature compared to the input image. This design loses a lot of spatial information, resulting in inaccurate predictions, especially on the edge details of the picture. To solve this problem, many networks have tried various methods. For example, the atrous convolution applied in DeepLabV3 [
16] increases the field of perception while not reducing the size of the feature map. There is also a parallel atrous convolution (Atrous Spatial Pyramid Pooling, ASPP) [
2,
16] structure that can be used to improve the result of the segmentation if added to most of the segmentation networks. Additionally, the encoder–decoder [
1,
8] network structure is often a countermeasure to the above loss of spatial structure information. In an encoder–decoder network, the backbone network of the classification network is often used as the encoder [
11,
13,
17], which is responsible for encoding the input pictures into feature mappings with low resolution but rich semantic information. The decoder then restores the pixels of the low-resolution feature to obtain a pixel-level category prediction of the same size as the original image, usually designed as a series of convolution and up-sampling operations. Because the direct up-sampling of low-resolution feature maps still lacks the spatial detailed information lost in the encoder, decoders often incorporate low-level features into the up-sampling parts to capture the fine-grained information. A typical structure design is DeepLabV3plus [
2].
Carefully analyzing most of the existing segmentation networks, it is not difficult to find that there are usually two ways to integrate low-level features in decoders: concatenation and element-wise addition. Element-wise addition adds features with the same size and number of channels. When convoluting, a priori is added. By the way of addition, new features can be obtained. This new feature can reflect some unique information of the original features, but some information contained in the original features will be lost in the process. Concatenation is the splicing of feature mappings of the same size in the channel dimension. After splicing, each channel corresponds to the corresponding convolution kernel. Compared with concatenation, addition saves more computation than concatenation, but there is no information loss in the concatenation process without considering the amount of computation. Therefore, to obtain a better prediction result, the concatenation fusion feature is more commonly used in semantic segmentation [
2,
8,
18]. However, this also leads to some problems. Because decoders usually combine in-depth features with shallow features, we know that these two features carry very different types of information. Convolution-output features after direct concatenation are not well integrated, resulting in low information utilization. As shown in
Figure 1, the lower parts contain more simple spatial and line information, while the higher features include rich semantic information. The convolution output after concatenation shows that the spatial information in the semantic features has been optimized greatly. However, the overall features are still mixed, resulting in a situation where both the semantic and spatial characteristics are not prominent. This would undoubtedly result in inaccurate fuzzy predictions for pixel-level segmentation tasks.
Inspired by the above work, we designed an inter-level feature balanced fusion module to fuse the inter-level features in a more balanced and efficient manner, which solves the problem of inefficient feature utilization and weak purpose of the regular inter-level feature fusion method of concatenation or element-wise addition. The core idea of this module is inspired by the differences between spatial and semantic features. In the process of two-level feature fusion, the correlation between two-level features is calculated in the channel dimension after concatenation of the channel dimension. The spatial weights of the spatial feature in the channel dimension and the semantic weights of the semantic feature in the channel dimension are given respectively by using the method of a normalized score. Additionally, in the back propagation [
20], we continuously update and optimize the weight information parameters. Compared with the previous feature fusion methods, we consider that the information between the fusion features is different. Our feature balanced fusion module design can let the network learn how to integrate the features, guide the two different information interaction balances, each taking its role, and finally contribute to better segmentation prediction.
Overall, our network is based on the encoder–decoder structure, which uses a backbone with a deep convolution feature extraction network to extract rich semantic information, followed by an Atrous Spatial Pyramid Pooling (ASPP) [
16] structure to extract multi-scale features to obtain rich contextual information without reducing the feature resolution. The decoder is designed with the skip-connections structure, which restores the high-level feature resolution while fusing with the low-level feature, adds spatial information to the semantic information, and gradually restores the segmentation boundary. To obtain more spatial information, we also designed a shallow branch of spatial flow and fused it with the previous level features. In each level of fusion structure, we applied the feature balanced fusion module to balance the fusion of features from different parts.
Our contributions can be summarized as follows:
An inter-level feature balanced fusion module was designed to solve the problem of feature imbalance caused by traditional concatenation or element-wise addition, which makes the fusion more balanced and utilization of features more effective.
A shallow spatial stream with only three convolution layers was designed and added into the network, which is fused with the main semantic features before outputting the prediction in the decoder. This further enriches the spatial information.
Our IFBFNet achieved a comparative performance of 81.2% to mIoU on the Cityscapes dataset with only finely annotated data used for training, significantly improving over baselines.
4. Experiment
To verify the effectiveness of our proposed module, we have carried out a number of experiments on the Cityscapes dataset [
19]. Our network achieved a competitive performance and has greatly improved on the baseline network. Additionally, we performed some visual contrast experiments to prove the effectiveness of our module.
The Cityscapes dataset, jointly provided by three German units including Daimler, contains stereo vision data of more than 50 cities for urban scene understanding, including 19 classes for urban scene analysis and pixel-level segmentation. It contains 2975 fine labeled images for training, 500 for validation, and 1525 for testing. Additionally, there are an additional 20,000 coarse segmentation labeled images for training. It is worth noting that all our experiments were conducted on the Cityscapes finely annotated set.
4.1. Implementation Details
Our network is based on PyTorch; following the setting of learning rate in previous work [
13,
16,
17], we adopted the poly learning rate policy, where the learning rate of the current iteration can be calculated by way of multiplying by the factor (1
)
. We used a Stochastic Gradient Descent (SGD) optimizer [
40] to optimize the network parameters. For the Cityscapes dataset, we set the initial learning rate of the network to 0.01, the weight decay coefficient to 0.0005, and the momentum to 0.9. In the network training process, we set the learning rate of the coarse segmentation output branch and feature balanced fusion module parts to 10 times, and the remaining parts to one time. The loss function is shown in Equation (
8), and
was set to 0.4 to achieve the best fusion effect. The OHEM [
39] loss function was used as a category of network loss functions to purposefully improve the learning of difficult samples. In training, we replaced all BatchNorm layers with InPlaceABN-Sync [
42]. For the data augmentation, the input pictures were randomly cropped to 876 × 876 sizes during the training and flipped horizontally. All experiments were performed on two Nvidia GTX 1080Ti GPUs. The total number of training iterations was set to 81k, and the first 1k iterations were warmup processes.
4.2. Experimental Results
Applying the methods we proposed and some common training techniques, and following the implementation rules described in
Section 4.1, after training only on a finely annotated training set, our mIoU on the Cityscapes validation set reached 81.2% in terms of mIoU. Compared to the basic network DeepLabV3plus [
2], we thus achieved an improvement of nearly 1.3%.
Table 1 shows in detail the improvements we have achieved with other advanced networks in each class. Comparing with the IoUs class, it is not difficult to find that in most classes, our indicators have been greatly improved.
4.3. Ablation Study
In this section, we outline a series of performed comparative experiments to demonstrate the validity of our proposed modules.
4.3.1. Baseline Network
Specifically, we set up two baseline networks as the basis of our experiments; one is ResNet101 [
27], which is a very basic feature extraction network, and the other is ResNet101 designed with ASPP added to the end. We have made several experimental comparisons based on these two baselines.
Baseline1: ResNet101-Dilated. To make ResNet101 more suitable for semantic segmentation tasks, we set the dilation of the last two layers of ResNet as [1,2] and stride as [2,1]. Thus, the output stride of the network was set to 16, so that the feature mapping of 1/16 the image size was finally obtained. We pre-trained the network on the ImageNet dataset, and then fine-tuned the parameters on the Cityscapes dataset. After pre-training on the ImageNet dataset, subsequent training for specific segmented datasets converged faster.
Baseline2: ResNet101+ASPP. Just as we detailed in
Section 2.2, ASPP is a module that has achieved tremendous success in semantics segmentation tasks due to its delicate design. To verify that our proposed module can also work with other modules to improve network performance, we also set up this ResNet101 + ASPP baseline. Its specific structure is composed of a global average pooling and three 3 × 3 dilated convolutions (dilated rates of 6, 12, and 18), which extract context information at multiple scales. Features from four branches are concatenated. Then, a 1 × 1 convolution was used to reduce the dimension of the concatenated map channel by 256.
Inter-Level Feature Balanced Fusion Network. The structural design of inter-level feature balanced fusion network is introduced in detail in
Section 3.2. To make the best use of this module where appropriate, we have applied our designed inter-level feature balanced fusion module to each feature fusion process at two different levels in the network. Since we use the global average pooling operation to compute the inter-layer weights, this part of the calculation is not large (relative to [
34,
35,
37]). We use inter-level feature balanced fusion module to balance the differences between two levels of features from ASPP and stage one’s low-level feature of the backbone network. Another time, inter-level feature balanced fusion strategy is applied in feature mapping fusion between a specially designed spatial branch and a previous backbone network.
Inter-Level Feature Balanced Fusion Network with Spatial Stream. The spatial stream was designed to be relatively simple so as to obtain more spatial information in the input image, and thus the spatial stream consists of only three layers of convolution, with the number of intermediate feature map channels mimicking the first two layers of the ResNet network structure. We set the number of intermediate channels to 64. However, because we were not sure how to control the amount of spatial information, we performed a set of comparative experiments. We set the number of spatial flow branches with different output channels to obtain some spatial information of different sizes. To obtain the optimal result, we finally set the number of spatial stream output channels to 128.
4.3.2. Ablation Study for Inter-Level Feature Balanced Fusion Module
First, we used Baseline1 (101 and 50 of the atrous ResNet series) as the baseline network for the next series of experiments, and the corresponding output was directly up-sampled. The results of the experiment are shown in
Table 2. We show the comparison data of two different depth backbone networks. To improve the network performance, we added the ASPP module at the end of the network, which improved the ResNet50 series by 2.80% and the ResNet101 series by 2.46% compared with the baseline network.
To verify the effectiveness of our proposed inter-level feature balance fusion module, we added layer skip-connections to the baseline network ResNet + ASPP. One of the fusion methods of skip-connection is the common concatenation corresponding to the “Skip-Connection” in
Table 2, and the other is our feature balance fusion mode. From the experimental data in
Table 2, it can be seen that adding layer skip-connection to ResNet50 and ResNet101 improved the performance respectively by 2.06% and 2.40%. The inter-level feature balanced fusion method has improved separately from the normal fusion method by 2.43% and 1.19% on ResNet50 and ResNet101, respectively.
4.3.3. Ablation Study for Spatial Stream
In this section, we further analyze the importance of a spatial information stream to enhance experimental results. In
Table 2, we find that adding spatial streams improved both ResNet50 and ResNet101, respectively by 0.34% and 0.11%. To obtain the most appropriate spatial flow design, we adjusted the spatial information of the spatial flow, corresponding to a series of convolution final output channels (16, 32, 48, 64, 128, 256). We controlled the same settings except for the number of channels. Six sets of comparative tests were performed on the Cityscapes validation set. As shown in
Table 3, we found that when the number of final output channels is 128, the best prediction results can be obtained.
4.3.4. Ablation Study for Improvement Strategies
As in [
16,
35,
37], we also used similar improvement strategies, as shown in
Table 4. (1) OHEM (online hard example mining): we focused on the training of difficult samples during the training process, which is reflected in the loss function in
Section 3.4. We classified the samples predicted to be the correct kind with a probability of less than the threshold of 0.7 as difficult samples. With OHEM the performance on the Cityscapes validation set was improved by 1.48%. (2) DA (data augmentation with random scaling): during the training process, we randomly reduced or increased the size of the input pictures by 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, or 2.0 times, and this increased the experimental results by 1.07%. (3) MS (multi scales): we averaged the inferred predictions of six scale input images to obtain the final predicted output, which had the sizes of 0.5, 0.75, 1.0, 1.25, 1.5, and 1.75 times. The experimental results were improved with MS by 1.21%. The final performance was 81.2% mIoU, higher than the 79.9% of Deeplabv3plus on the Cityscapes validation set, and we improved by about 1.3%.
We added auxiliary loss supervising identical to the loss function in Equation (
8). Due to the uncertainty of the auxiliary loss’s weight combined with the main loss, which can lead to a better result if it is set properly, we conducted a set of comparative experiments. On the premise of ensuring the consistency of other parameters, we adjusted the coefficient of auxiliary loss from 0.1 to 1.0. The experimental results are shown in
Table 5. After comparing the experimental data, we finally determined that when the weight parameter of auxiliary loss was set to 0.4, we could obtain the best performance on the Cityscapes validation set.
4.4. Visualization of Inter-Level Feature Balanced Fusion Module
In the previous part, we have introduced the network structure design and experimental data comparison. To show the role of our inter-level feature balanced fusion module in the network more vividly, we input an image to visualize the feature maps of different levels when the network processes this image. The visualization results are shown in
Figure 5. We have visualized the low-level features of the network (in our case, the first stage of the backbone), which are the results of the second line in the figure. They mainly contain shallow spatial information and some line outlines. The third line is the deep feature, which contains rich semantic information, and the fourth line is the result of the fusion of two-level features obtained by the ordinary concatenation method. The result of inter-level feature balanced fusion is shown in the fourth line. The comparative results demonstrate that our fusion results can better integrate the two parts of features from different levels, and can carry clear line contour information while containing rich semantic information.
4.5. Comparing with the State-of-the-Art
On the Cityscapes test set, we further compared our proposed network with the recently published methods. We input the original test set images into the trained model to obtain the test set prediction results that meet the requirements of the official test set. In this reasoning process, we applied multi-scales prediction and flipping strategies, which should improve the performance of our network. We packaged and uploaded the results to the official test script, waited for more than ten minutes, and obtained the corresponding results, as shown in
Table 6. Compared with other previous methods, our network achieved better performance. Compared with the replicated DeepLabV3plus [
16], the mIoU index of our network on the Cityscapes test set exceeded it by 1.1%. In the training process, we only used the finely segmented data of the Cityscapes dataset to refine our network, which further demonstrates the effectiveness of the proposed method.
In
Figure 6, we provide a comparison of predictions between our network and the baseline network. We visualize some samples of picture test results on the Cityscapes validation set for the baseline network and IFBFNet. We mark the areas where the results were significantly different with yellow lines. Looking at these graphs, we find that IFBFNet can significantly improve the prediction of object boundaries, and small objects, such as power poles, can also be well predicted to maintain the correct shape.
5. Conclusions
In this paper, IFBFNet is proposed for scene segmentation, which solves the problem that the information of feature fusion between semantic segmentation levels is mixed and unclear. Specifically, we adopt an adaptive feature fusion strategy to let the network learn by itself to reach a state where the low-level information and the high-level complex semantic information can be integrated more efficiently. At the same time, we add a shallow spatial information flow to increase the amount of spatial information. A series of ablation experiments and visualization of intermediate features have shown that using the inter-level feature balance fusion method can achieve a more balanced and clear feature representation, as well as a more accurate segmentation result by increasing the flow of spatial information. Our IFBFNet achieved very competitive results on the challenging Cityscapes dataset, significantly improving over baselines.
In the future, we will further explore more application scenarios of the inter-level feature balanced module. The inter-level feature balanced module we designed is mainly applied to the adaptive integration of different levels of information in the network structure. It can be extended to some scenarios that need to take into account both spatial information and high-level semantic information. The design we proposed is applied to the integration of two levels; considering the complex situation, such as the scene of feature fusion of three or more levels, one can refer to the same design, but we need to add corresponding input branches and weight blocks here. The effectiveness of this structure also needs to be verified in future work.