1. Introduction
Semantic image segmentation is a typical computer vision problem. Its task is to assign different categories to each pixel in an image according to the object of interest [
1]. In the past several years, due to a large amount of training images and high-performance GPUs, deep learning techniques-in particular, supervised approaches such as deep convolutional neural networks (DCNNs)-have achieved relentless success in various high-level computer vision tasks, such as image classification, object detection, semantic segmentation, etc. [
2,
3,
4]. The key advantage of these deep learning techniques is to learn high-level feature representations in an end-to-end fashion, which are more discriminative than traditional ones. Inspired by the success of deep learning techniques in image classification tasks, researchers explored the capabilities of such networks for pixel-level annotations and proposed many prominent deep learning networks for semantic segmentation.
Nowadays, most DCNNs for semantic segmentation are based on a common pioneer: fully convolutional network (FCN) proposed by Long et al. [
5]. It transforms the well-known DCNNs used for image classification, such as AlexNet [
6], VGG [
7], GoogleNet [
8], into fully convolutional ones by replacing the fully connected layers with convolutional ones in order to output spatial feature maps instead of classification probabilities. Those feature maps are then decoded [
5] to produce dense pixel-level annotations. FCN is considered a milestone in deep learning techniques for semantic segmentation, since it demonstrates how DCNNs can be trained end-to-end to solve this problem, efficiently learning how to produce dense pixel-level predictions for input of arbitrary sizes. It achieved 20% relative improvement in segmentation accuracy over traditional methods on the PASCAL VOC 2012 dataset [
9].
DeepLab series [
10,
11,
12,
13] are successful and popular in DCNNs based semantic segmentation model. DeepLab v1 [
10] introduces atrous convolution [
14] in DCNN to effectively enlarge the receptive field without increasing the number of network parameters. To localize objects boundaries, it combines the last layer of DCNN with fully connected CRF [
15]. Due to these two advanced techniques, it reached 71.6% mIOU accuracy in the PASCAL VOL 2012 dataset. Based on DeepLab v1, DeepLab v2 [
11] further proposes atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. By employing multiple parallel atrous convolutional layers with different dilation rates, ASPP can exploit multi-scale features, thus capturing objects as well as image context at multiple scales. DeepLab v2 combines atrous convolution, ASPP and fully connected CRF, achieving 79.9% mIOU accuracy in the PASCAL VOC 2012 dataset. DeepLab v3 [
12] incorporates improved ASPP, batch normalization and a better way to encode multi-scale context to further improve performance, reaching 85.7% mIOU accuracy in the PASCAL VOC 2012 dataset. Improved ASPP involves concatenation of image-level features, a 1x1 convolution and three 3x3 atrous convolution with different dilation rates. Batch normalization is used after each of the parallel convolution layers. Fully connected CRF is abandoned from Deeplab v3. In Deeplab v3+ [
13], ASPP and Encoder-Decoder structure are used. The Decoder module refines the segmentation results in pixel-level. Deeplab v3+ further explores the Xception model [
16] and applies the depthwise separable convolution [
17] to ASPP and the Decoder module. In PASCAL VOC 2012
test set and Cityscapes datasets, it achieved performance of 89.0% and 82.1% separately.
Superpixel has been commonly used as a preprocessing step for image segmentation since it was first introduced by Ren et al. [
18] in 2003, because it reduces the number of inputs for subsequent processing steps and adheres well to objects boundaries. It groups pixels into perceptually meaningful atomic regions, thus can enable feature computing on a meaningful image representation. In the past decades, a large number of superpixel algorithms have been proposed. Quick shift [
19] is a mode-seeking based clustering algorithm, which has a relatively good boundary adherence. It first initializes the segmentation using medoid shift [
20], then moves each data point in the feature space to the nearest neighbor that increases the Parzen density estimation [
21]. Simple Linear Iterative Clustering (SLIC) [
22] adopts a k-means clustering approach with a distance metric that depends on both spatial and intensity differences to efficiently generate superpixels. Felzenszwalb [
23] is a graph-based approach used for image segmentation. It performs an agglomerative clustering of pixels as nodes on a graph such that each superpixel is the minimum spanning tree of the constituent pixels.
Although DeepLab v3+ has achieved good performance in semantic image segmentation, it still has some shortcomings. One of the main problems is that it adopts DCNN for semantic segmentation, which consists of strided pooling and convolution layers. They increase receptive field but aggregate the context information while discarding the boundary information. However, semantic segmentation needs the exact alignment of class maps and thus, needs the boundary information to be preserved. To ttackle the challenging problem, this paper presents a novel method to refine the object boundaries of the segmentation results output from DeepLab v3+, which unites the benefits of DeepLab v3+ with the superpixel segmentation algorithm-quick shift [
19]. The main methods are as follows: (i) Using DeepLab v3+ to obtain the class-indexed score map of the same size of the input image; (ii) Segmenting the input image into superpixels by quick shift; (iii) Inputing the output of these two modules into the category voting module to refine the object boundary of the segmentation result. The proposed method in this paper improves the semantic segmentation results both qualitatively and quantitatively, especially on object boundaries. Experiments on the PASCAL VOC 2012 dataset verify the effectiveness of the proposed method.
The paper is organized as follows.
Section 2 describes the proposed method in detail.
Section 3 presents the experimental results of the proposed method on the PASCAL VOC 2012 dataset, and comparisons with other methods.
Section 4 discusses and analyses the experimental results. Finally, conclusions are drawn in
Section 5.
2. Methodology
In this section, a robust framework is proposed for semantic image segmentation. It captures the class index score map by DeepLab v3+ and segments the input image into superpixels by quick shift. The object boundary of the segmentation result is refined by a class voting module.
2.1. Motivation
(1) It is hard for DCNNs based semantic segmentation methods to produce semantic segmentation results with accurate objects boundaries. There are two main reasons. First, the memory of GPU is limited, so DCNNs should adopt strided pooling and convolution to reduce parameters. Second, it is difficult to assign labels for pixels on objects boundaries because the cascaded feature maps generated by DCNNs blur them. In order to precisely segment foreground objects from background, DCNNs should have the following two properties. First, it should classify objects boundaries precisely. Second, for pixels on objects boundaries, class score computed for the target class should be close to class scores of other classes.
As shown in
Figure 1, DCNN centers on the red and yellow points of the input image, respectively, to extract features in their respective receptive fields. At the aeroplane’s boundaries, the image regions corresponding to the aeroplane and the background pixels have a great overlap, resulting in features extracted by DCNN are very similar and therefore difficult to classify pixels on aeroplane’s boundaries.
The softmax loss function used for semantic segmentation is often simply formulated as:
where
N is the total number of classes, and
k is the target class. In the training iteration, DCNNs minimize loss, i.e., maximize
. In order to obtain objects boundaries accurately by interpolation, DCNNs should consider not only class score of the target class but also class scores of other classes. The loss function aforementioned only tries to maximize class score of the target class, while ignoring class scores of other classes, so it is difficult for DCNNs to output a proper score for each class.
(2) In an image, a foreground object is often composed of a series of regions. Inside these regions, its color, lightness and texture have little changes. DCNNs based semantic segmentation methods directly classify every pixel in the image and have no idea that these regions belong to the same object.
Figure 2 is a semantic segmentation result output by DeepLab v3+. From it, we can see that the areas labeled with green color are segmented separately from the regions they should belong to, which is incorrectly segmented. In order to tackle this problem, we employ a region-based method to postprocess semantic segmentation results of DeepLab v3+.
2.2. Main Process
Framework of the proposed method is shown in
Figure 3. It consists of the following three modules: (a) DeepLab v3+, (b) superpixel segmentation-quick shift, and (c) Class Voting. In (a), we use DeepLab v3+ to obtain a class-indexed score map for the input image. In this score map, each pixel in the input image is marked with an index corresponding to its predicted class. In (b), we use quick shift algorithm to segment the image into superpixels. It outputs the superpixel index of each pixel in the image. Then, the outputs from (a) and (b) are fed into (c) to obtain the refined semantic segmentation results for the input image.
2.3. DeepLab v3+ Recap
2.3.1. Architecture
DeepLab v3+ [
13] was proposed by Cheng et al. in 2018, which has achieved state-of-the-art performance in PASCAL VOC 2012 dataset. As shown in
Figure 4, DeepLab v3+ is a novel Encoder-Decoder architecture which employs DeepLab v3 [
12] as Encoder module and a simple yet effective Decoder module. It applies ResNet-101 as backbone, adopts atrous convolution in deep layers to enlarge receptive field. After ResNet-101, an ASPP module is on the top of it to aggregate the multi-scale contextual information. The Decoder concatenates the low-level features from ResNet-101 with upsampled deep-level multi-scale features extracted from ASPP. Finally, it upsamples the concatenated feature maps to produce the final semantic segmentation results.
2.3.2. Training Details
We implement DeepLab v3+ with PyTorch and experiment it on PASCAL VOC 2012 dataset. The implemented model using
during training and evaluation, without using multi-scale and left-right flipped inputs [
13]. The dataset contains 20 foreground object classes and one background class. It officially consists of 1,464 images in
train set and 1449 images in
val set. We also augment the dataset with additional annotations provided by [
24], resulting in a total of 10,582 training images. These 10,582 images are used as the
train_aug set to train the model following the training strategy in DeepLab v3+ [
13] and the
val set for evaluation.
In DeepLab v3+, it preprocesses input images by resizing and cropping them to fixed size of
. When computing the value of mIOU, the annotated objects boundaries in ground-truths are not taken into consideration. Thus, mIOU can not be used to evaluate whether the pixels on objects boundaries are classified right or not. In order to evaluate the accuracy of the proposed method in localizing objects boundaries, we compute the value of mIOU of the implemented DeepLab v3+ in two steps. First, we follow the same training process as in [
13] with fixed input image size of
. Second, we recompute the value of mIOU with the model trained in the first step, but the inputs of the model are images of arbitrary sizes and without preprocessing. At the same time, we labeled objects boundaries in ground-truths as background when computing the value of mIOU. In the first step, the model we implemented reaches the performance of 78.89% mIOU in the PASCAL VOC 2012
val set, which is slightly higher than the result (78.85%) shown in [
13]. This verifies that our implementation is right. In the second step, the model achieves the performance of 68.34% mIOU in the PASCAL VOC 2012
val set, which is used for comparison with the proposed method in
Section 3.
2.4. Quick Shift
Quick shift [
19] is one of the most popular superpixel segmentation algorithms. Its principle is based on an iterative mode-seeking that identifies modes in a set of data points. A mode is defined as the densest location in a certain feature space which is composed of all the data points. Given
N data points
,
, …,
∈
X ⊆
, quick shift first computes Parzen density estimation [
21]:
where
is the kernel function, which is usually an isotropic Gaussian window.
is the distance between data point
and
. Then, it moves the center of the kernel window to the nearest neighbor of
, in order to extend the search path to the next data point. At which there is an increasing density
P:
When all the data points are connected with one another, a threshold is used to separate modes. Different clusters of the data points can then be separated.
Quick shift may be used for any feature space, but for the purpose of this paper we restrict it to 5D feature space to use it for image segmentation. The 5D feature space is consisting of 3D RGB color information and 2D location information. After computing the Parzen density estimation for each image pixel, quick shift constructs a tree connecting each pixel to its nearest neighbor that has higher density value. Then each pixel connects to its closest higher density pixel parent that achieves the minimum distance. It generates a forest of pixels whose branches are labeled with a distance value. This specifies a hierarchical segmentation of the image. Superpixels can be identified by applying a threshold to cut the branches that is larger than it.
We apply quick shift to partition the input image into superpixels, as shown in
Figure 5.
2.5. Class Voting
DeepLab v3+ outputs a class-indexed score map for the input image, each pixel of which is labeled with an index corresponding to its predicted class. At the same time, superpixels of the image are obtained by the quick shift algorithm. Each pixel in the image is labeled with a superpixel index. Then, the total number of pixels belonging to each class in each superpixel are counted. Finally, the Class Voting module votes on each superpixel to the class which contains the maximum number of pixels in it. A pseudo-code implementation is shown in Algorithm 1.
Algorithm 1 classVoting() |
Input:: number of classes; : number of superpixels segmented by quick shift; : output of quick shift; : output of DeepLab v3+; |
Output:; |
initial |
initial |
for to do |
for to do |
|
end for |
end for |
= max |
for to do |
for to do |
|
end for |
end for |
return ; |