1. Introduction
Waste objects are commonly found in both indoor and outdoor environments such as household, office or road scenes. As such, it is important for a vision-based intelligent robot to localize and interact with them. However, detecting and segmenting waste objects are much more challenging than most other objects. For example, waste objects could either be incomplete or damaged, or both. In many cases, their presence could only be inferred from scene-level contexts, e.g., via reasoning about their contrast to the background and judging by their intended utilities. On the other hand, one key challenge to accurately localizing waste objects is the extreme scale variation resulting from the variable physical sizes and the dynamic perspectives, as shown in
Figure 1. Due to the large number of small objects, it is difficult even for most humans to accurately delineate waste object boundaries without zooming in to see the appearance details clearly. For the human vision system, however, attention can either be shifted to cover a wide area of the visual field, or narrowed to a tiny region as when we scrutinize a small area for details (e.g., [
1,
2,
3,
4]). Presented with an image, we can immediately recognize the meaning of the scene and the global structure, which allow us to easily spot objects of interest. We can consequently attend to those object regions to perform fine-grained delineation. Inspired by how the human vision system works, we solve the waste object segmentation problem in a similar manner by integrating visual cues from multiple levels of spatial granularity.
The general idea of exploiting objectness has long been proven effective for a wide range of vision-based applications [
6,
7,
8,
9]. In particular, several works have already demonstrated that objectness reasoning can positively impact semantic segmentation [
10,
11,
12,
13]. However, in this work we propose a simple yet effective strategy for waste object proposal that neither require pretrained objectness models nor additional object or part annotations. Our primary goal is to address the extreme scale variation which is much less common in generic objects. In order to obtain accurate and coherent segmentation results, our method performs joint inference at three levels. Firstly, we obtain a coarse segmentation at the scene-level to capture the global context and to propose potential object regions. We note that our simple object region proposal strategy captures the objectness priors reasonably well in practice. This is followed by an object-level segmentation to recover fine structural details for each object region proposal. In particular, adopting two separate models at the scene and object levels respectively allows us to disentangle the learning of the global image contexts from the learning of the fine object boundary details. Finally, we perform joint inference to integrate results from both the scene and object levels, as well as making pixel-level refinements based on color, depth, and spatial affinities. The main steps are summarized and illustrated in
Figure 2. We obtain significantly superior results with our method, greatly surpassing a number of strong semantic segmentation baselines.
Recent years witnessed a huge success of deep learning in a wide spectrum of vision-based perception tasks [
14,
15,
16,
17]. In this work, we would also like to harness the powerful learning capabilities of convolutional neural network (CNN) models to address the waste object segmentation problem. Most state-of-the-art CNN-based segmentation models exploit the spatial-preserving properties of fully convolutional networks [
17] to directly learn feature representations that could translate into class probability maps either at the scene level (i.e., semantic segmentation) or the object level (i.e., instance segmentation). One of the key limitations when it comes to applying these general-purpose models directly for waste object segmentation is that they are unable to handle the extreme object scale variation due to the delicate contention between global semantics and accurate localization under a fixed feature resolution, and the resulting segmentation can be inaccurate for the abundant small objects with complex shape details. Based upon this observation, we propose to learn a multi-level model that allows us to adaptively zoom into object regions to recover fine structural details, while retaining a scene-level model to capture the long-range context and to provide object proposals. Furthermore, such a layered model can be jointly reasoned with pixel-level refinements under a unified Conditional Random Field (CRF) [
18] model.
The main contributions of our work are three-fold. Firstly, we propose a deep-learning based waste object segmentation framework that integrates scene-level and object-level reasoning. In particular, our method does not require additional object-level annotations. By virtue of a simple object region proposal method, we are able to learn separate scene-level and object-level segmentation models that allow us to achieve accurate localization while preserving the strong global contextual semantics. Secondly, we develop a strategy based on densely connected CRF [
19] to perform joint inference at the scene, object, and pixel levels to produce a highly accurate and coherent final segmentation. In addition to the scene and object level parsing, our CRF model further refines the segmentation results with appearance, depth, and spatial affinity pairwise terms. Importantly, this CRF model is also amenable to a filtering-based efficient inference. Finally, we collected and annotated a new RGBD [
20] dataset, MJU-Waste, for waste object segmentation. We believe our dataset is the first public RGBD dataset for this task. Furthermore, we evaluate our method on the TACO dataset [
5], which is another public waste object segmentation benchmark. To the best of our knowledge, our work is among the first in the literature to address waste object segmentation on public datasets. Experiments on both datasets verify that our method can be used as a general framework to improve the performance of a wide range of deep models such as FCN [
17], PSPNet [
21], CCNet [
22] and DeepLab [
23].
We note that the focus of this work is to obtain accurate waste object boundary delineation. Another closely related and also very challenging task is waste object detection and classification. Ultimately, we would like to solve for waste instance segmentation with fine-grained class information. However, existing datasets do not provide a large number of object classes with sufficient training data. In addition, differentiating waste instances under a single class label is also challenging. For example, the best Average Precision (AP) obtained in [
5] are in the 20 s for the TACO-1 classless litter detection task where the goal is to detect and segment litter items with a single class label. Therefore, in this paper we adopt a research methodology under which we gradually move toward richer models while maintaining a high level of performance. In this regard, we formulate our problem as a two-class (waste vs. background) semantic segmentation one. This allows us to obtain high quality segmentation results as we demonstrate with our experiments.
In the remainder of this paper,
Section 2 briefly reviews the literature on waste object segmentation and related tasks, as well as recent progress in semantic segmentation. We then describe details of our method in
Section 3. Afterwards,
Section 4 presents findings from our experimental evaluation, followed by closing remarks in
Section 5.
3. Our Approach
In this section, let us formally introduce the waste object segmentation problem and the proposed approach. We begin with the definition of the problem and notations. Given an input color image and optionally an additional depth image, our model outputs a pixelwise labeling map, as shown in
Figure 2. Mathematically, denote the input color image as
, the optional depth image as
, and the semantic label set as
, where
C is the number of classes. Our goal is to produce a structured semantic labeling
. We note that in deep models, the labeling of
at image coordinate
,
, is usually obtained via multi-class softmax scores on a spatial-preserving convolutional feature map
, i.e.,
. In practice, it is common that the convolutional feature map
is downsampled w.r.t. the original image resolution, but we can always assume that the resolution can be restored with interpolation.
3.1. Layered Deep Models
In this work, we apply deep models at both the scene and the object levels. For this purpose, let us define a number of image regions in which we obtain deeply trained feature representations. Firstly, let
be the set of all spatial coordinates on the image plane, or the entire image region. This is the region in which we perform scene-level parsing (i.e., coarse segmentation). In addition, we perform object-level parsing (i.e., fine segmentation) on a set of non-overlapping object region proposals. We denote each of these additional regions as
. Details on generating these regions are discussed in
Section 3.3. We apply our coarse segmentation feature embedding network
and the fine segmentation feature embedding network
to the appropriate image regions as follows:
where
denotes cropping the region
from image
. Here
and
, and we note that these feature maps are upsampled where necessary. In addition, the spatial dimension may be image and region specific for both
and
, which poses a practical problem for batch-based training. To address this issue, during CNN training we resize all image regions so that they have a common shorter side length, followed by randomly cropping a fixed-sized patch as part of the data augmentation procedure. We refer the readers to
Section 4.2 for details. In
Figure 3, the processes shown in blue and yellow illustrate the steps described in this section.
3.2. Coherent Segmentation with CRF
Given the layered deep models, we now introduce our graphical model for predicting coherent waste object segmentation results. Specifically, the overall energy function of our CRF model consists of three main components:
where
represents the scene-level coarse segmentation potentials,
denotes the object-level fine segmentation potentials, and
is the pairwise potentials that respect the color, depth, and spatial affinities in the input images.
is the weight for the relative importance among the two unary terms. The graphical representation of our CRF model is shown in
Figure 3. We describe the details of these three terms below.
Scene-level unary term. The scene-level coarse segmentation unary term is given by
where
is a pixelwise softmax on the feature map
as follows:
where
denotes the indicator function. This term produces a coarse segmentation map based on the long-range contexts from the input image. Importantly, we use the output of this term to generate our object region proposals, as discussed in
Section 3.3.
Object-level unary term. The object-level fine segmentation unary term is given by
where
is defined as:
The formulation above states that if a pixel location
belongs to one of the
L object region proposals, a negative log-probability obtained via fine segmentation is adopted. Otherwise, the object-level unary term falls back to the scene-level unary potentials. Here the probability
given by the fine segmentation model is obtained as follows:
where
and
are translation functions that map the image coordinates to that of the
l-th object proposal region, and
is the output feature embedding from the fine segmentation model for the
l-th object proposal region. We note that the object-level unary potentials typically recover more fine details along object boundaries, as opposed to the scene-level unary potentials. In general, it would become too computationally expensive to compute scene-level potentials at a comparable resolution for the entire image. In most cases, computing the object-level unary term on less than 3 object region proposals are sufficient, see
Section 3.3 for details. Additionally, our object-level potentials are obtained via a separate deep model that allows us to decouple the learning of long-range contexts from the learning of fine structural details.
Pixel-level pairwise term. Although the object-level unary potentials provide finer segmentation details, accurate boundary details could still be lost for some irregularly shaped waste objects. This poses a practical challenge for detail-preserving global inference.
Following [
19], we address this challenge by introducing a pairwise term that is a linear combination of Gaussian kernels in a joint feature space that includes color, depth, and spatial coordinates. This allows us to produce coherent object segmentation results that respect the appearance, depth, and spatial affinities in the original image resolution. More importantly, this form of the pairwise term allows for efficient global inference [
19]. Specifically, our pairwise term
includes an appearance term
, a spatial smoothing term
and a depth term
:
where
is the Potts label compatibility function. The appearance term and the smoothing term follow [
19] and take the following form:
where
and
are the image appearance and position features at the pixel location
. In addition, when a input depth image
is available, we are able to enforce an additional pairwise term induced by geometric affinities:
where
is the depth reading at the pixel location
. We note that in practice, any missing values in
are filled in with a median filter [
82] beforehand, see
Section 4.1 for details. In addition, Equation (
9) can be conveniently added or removed depending on the depth data availability. We simply discard Equation (
9) when training models for the TACO dataset which only contains color images.
3.3. Generating Object Region Proposals
In this work, we follow a simple strategy to generate object region proposals
. In particular, the output from the scene-level coarse segmentation model is a good indication of the waste object locations. See
Figure 3 for an example. We begin with extracting the connected components in the foreground class labelings of
from the maximum a posterior (MAP) estimate of the scene-level unary term
. For each connected component, a tight bounding box
is extracted. This is followed by extending
by
in four directions (i.e., N,S,W,E), subject to the image boundary truncation. Finally, we merge overlapping regions and remove those below or above certain size thresholds (details in
Section 4.2) to obtain a concise set of final object region proposals
. Example object region proposals obtained using this procedure are shown in
Figure 4, and we note that any similar implementation should also work satisfactorily.
Most images from the MJU-Waste dataset contain only one hand-held waste object per image. For DeepLabv3 with a ResNet-50 backbone, for example, only of all images from MJU-Waste produce 2 or more object proposals. For the TACO dataset, of all images produce 2 or more object proposals. However, only and of all images produce more than 3 and 5 object proposals, respectively.
3.4. Model Inference
Following [
19], we use the mean field approximation of the joint probability distribution
that computes a factorized distribution
which minimizes the KL-divergence [
83,
84]
. For our model, this yields the following message passing-based iterative update equation:
where the input color image
and the depth image
are omitted for notation simplicity. In practice, we use the efficient message passing algorithm proposed in [
19]. The number of iterations is set to 10 in all experiments.
3.5. Model Learning
Let us now move on to discuss details pertaining to the learning of our model. Specifically, we learn the parameters of our model by piecewise training. First, the coarse segmentation feature embedding network
is trained with standard cross-entropy (CE) loss on the predicted coarse segmentation. Based on the coarse segmentation for the training images, we extract object region proposals with the method discussed in
Section 3.3. This allows us to then train the fine segmentation feature embedding network
using the cropped object regions in a similar manner. Next, we learn the weight and the kernel parameters of our CRF model. We initialize them to the default values used in [
19] and then use grid search to finetune their values on a held-out validation set. We note that our model is not too sensitive to most of the parameters. On each dataset, we use fixed values of these parameters for all CNN architectures. See
Section 4.2 for details.
4. Experimental Evaluation
In this section, we compare the proposed method with state-of-the-art semantic segmentation baselines. We focus on two challenging scenarios for waste object localization: the hand-held setting (for applications such as service robot interactions or smart trash bins) and waste objects “in the wild”. In our experiments, we found that one of the common challenges for both scenarios is the extreme scale variation causing standard segmentation algorithms to underperform. Our proposed method, however, greatly improves the segmentation performance in these adverse scenarios. Specifically, we evaluate our method on the following two datasets:
MJU-Waste Dataset. In this work, we created a new benchmark for waste object segmentation. The dataset is available from
https://github.com/realwecan/mju-waste/. To the best of our knowledge, MJU-Waste is the largest public benchmark available for waste object segmentation, with 1485 images for training, 248 for validation and 742 for testing. For each color image, we provide the co-registered depth image captured using an RGBD camera. We manually labeled each of the image. More details about our dataset are presented in
Section 4.1.
TACO Dataset. The Trash Annotations in COntext (TACO) dataset [
5] is another public benchmark for waste object segmentation. Images are collected from mainly outdoor environments such as woods, roads and beaches. The dataset is available from
http://tacodataset.org/. Individual images in this dataset are either under the CC BY 4.0 license or the ODBL (c) OpenLitterMap & Contributors license. See
http://tacodataset.org/ for details. The current version of the dataset contains 1500 images, and a split with 1200 images for training, 150 for validation and 150 for testing is available from the authors. In all experiments that follow, we use this split from the authors.
We summarize the key statistics of the two datasets in
Table 1. Once again, we emphasize that one of the key characteristics of waste objects is that the number of objects per class can be highly imbalanced (e.g., in the case of TACO [
5]). In order to obtain sufficient data to train a strong segmentation algorithm, we use a single class label for all waste objects, and our problem is therefore defined as a binary pixelwise prediction one (i.e., waste vs. background). For the quantitative evaluation that follows, we report the performance of baseline methods and the proposed method by four criteria: Intersection over Union (IoU) for the waste object class, mean IoU (mIoU), pixel Precision (Prec) for the waste object class, and Mean pixel precision (Mean). Let TP, FP and FN denote the total number of true positive, false positive and false negative pixels, respectively. The four criteria used are defined as follows:
Intersection over Union (IoU) for the
c-th class is the intersection of the prediction and ground-truth regions of the
c-th class over the union of them, defined as:
mean IoU (mIoU) is the average IoU of all
C classes:
Pixel Precision (Prec) for the
c-th class is the percentage of correctly classified pixels of all predictions of the
c-th class:
Mean pixel precision (Mean) is the average class-wise pixel precision:
We note that the image labelings are typically dominated by the background class, therefore IoU and Prec reported on the waste objects only are more sensitive than mIoU and Mean which consider both waste objects and the background.
4.1. The MJU-Waste Dataset
Before we move on to report our findings from the experiments, let us more formally introduce the MJU-Waste dataset. We created this dataset by collecting waste items from a university campus, bringing them back to a lab, and then take pictures of people holding waste items in their hands. All images in the dataset are captured using a Microsoft Kinect RGBD camera [
20]. The current version of our dataset, MJU-Waste V1, contains 2475 co-registered RGB and depth image pairs. Specifically, we randomly split the images into a training set, a validation set and a test set of 1485, 248 and 742 images, respectively.
Due to sensor limitations, the depth frames contain missing values at reflective surfaces, occlusion boundaries, and distant regions. We use a median filter [
82] to fill in the missing values in order to obtain high quality depth images. Each image in MJU-Waste is annotated with a pixelwise mask of waste objects. Example color frames, ground-truth annotations, and depth frames are shown in
Figure 5. In addition to semantic segmentation ground-truths, object instance masks are also available.
4.2. Implementation Details
Here we report the key implementation details of our experiments, as follows:
Segmentation networks and . Following [
21,
74], we use the polynomial learning rate policy with the initial learning rate set to
and the power factor set to
. The total number of iterations are set to 50 epochs on both datasets with a batch size of 4 images. In all experiments, we use the ImageNet-pretrained backbones [
85] and a standard SGD optimizer with momentum and weight decay factors set to
and
, respectively. To avoid overfitting, standard data augmentation techniques including random mirroring, resizing (with a resize factor between
and 2), cropping and random Gaussian blur [
21] are used. The base (and cropped) image sizes for
and
are set to
and
pixels during training, respectively.
Object region proposals. To maintain a concise set of object region proposals, we empirically set the minimum and maximum number of pixels and in an object region proposal. For MJU-Waste, and are set to 900 and 40,000, respectively. For TACO, and are set to 25,000 and 250,000, due to the larger image sizes. Object region proposals that are either too small or too big will simply be discarded.
CRF parameters. We initialize the CRF parameters with the default values in [
19] and follow a simple grid search strategy to find the optimal values of CRF parameters in each term. For reference, the CRF parameters used in our experiments are listed in
Table 2. We note that our model is somewhat robust to the exact values of these parameters, and for each dataset we use the same parameters for all segmentation models.
4.3. Results on the MJU-Waste Dataset
The quantitative performance evaluation results we obtained on the test set of MJU-Waste are summarized in
Table 3. Methods using our proposed multi-level model have “ML” in their names. For this dataset, we report the performance of the following baseline methods:
FCN-8s [17]. FCN is a seminal work in CNN-based semantic segmentation. In particular, FCN proposes to transform fully connected layers into convolutional layers that enables a classification net to output a probabilistic heatmap of object layouts. In our experiments, we use the network architecture as proposed in [
17], which adopts a VGG16 [
86] backbone. In terms of the skip connections, we choose the FCN-8s variant as it retains more precise location information by fusing features from the early
and
layers.
PSPNet [21]. PSPNet proposes the pyramid pooling module for multi-scale context aggregation. Specifically, we choose the ResNet-101 [
14] backbone variant for a good tradeoff between model complexity and performance. The pyramid pooling module concatenates the features from the last layer of the
block with the same features applied with
,
,
and
average pooling and upsampling to harvest multi-scale contexts.
CCNet [22]. CCNet presents an attention-based context aggregation method for semantic segmentation. We also choose the ResNet-101 backbone for this method. Therefore, the overall architecture is similar to PSPNet except that we use the Recurrent Criss Cross Attention (RCCA) module for context modeling. Specifically, given the
features, the RCCA module obtains a self-attention map to aggregate the context information in horizontal and vertical directions. Similarly, the resultant features are concatenated with the
features for downstream segmentation.
DeepLabv3 [23]. DeepLabv3 proposes the Atrous Spatial Pyramid Pooling (ASPP) module for capturing the long-range contexts. Specifically, ASPP proposes the parallel dilated convolutions with varying atrous rates to encode features from different sized receptive fields. The atrous rates used in our experiments are 12, 24 and 36. In addition, we experimented with both ResNet-50 and ResNet-101 backbones on the MJU-Waste dataset to explore the performance impact of different backbone architectures.
We refer interested readers to the public implementation discussed in
Section 4.2 for the network details of the above baselines. For each baseline method, we additionally implement our proposed multi-level modules and then present a direct performance comparison in terms of IoU, mIoU, Prec and Mean improvements. We show that our method provides a general framework under which a number of strong semantic segmentation baselines could be further improved. For example, FCN-8s benefits the most from a multi-level approach (i.e.,
points of IoU improvement), partially due to the relatively low baseline performance. Even for the best-performing baseline, DeepLabv3 with a ResNet-101 backbone, our multi-level model further improves its performance by
IoU points. We note that such a large quantitative improvement can also be visually significant. In
Figure 6, we present qualitative comparisons between FCN-8s, DeepLabv3 and their multi-level counterparts. It is clear that our approach helps to remove false positives in some non-object regions. More importantly, it is evident that multi-level models more precisely follow object boundaries.
In
Table 4, we additionally perform ablation studies on the validation set of MJU-Waste. Specifically, we compare the performance of the following variants of our method:
Baseline. DeepLabv3 baseline with a ResNet-50 backbone.
Object only. The above baseline with additional object-level reasoning. This method is implemented by retaining only the two unary terms of Equation (
2). All pixel-level pairwise terms are turned off. This will test if the object-level reasoning will contribute to the baseline performance.
Object and appearance. The baseline with object-level reasoning plus the appearance and the spatial smoothing pairwise terms. The depth pairwise terms are turned off. This will test if the additional pixel affinity information (without depth, however) is useful. It also verifies the efficacy of the depth pairwise terms.
Appearance and depth. The baseline with all pixel-level pairwise terms but without the object-level unary term. This will test if an object-level fine segmentation network is necessary, as well as the performance contribution of the pixel-level pairwise terms alone.
Full model. Our full model with all components proposed in
Section 3.
Results are clear that the full model performs the best, producing superior performance by all four criteria. This validates that the various components proposed in our method all positively impact the final results.
In terms of the computational efficiency, we report a breakdown of the average per-image inference time in
Table 5. The baseline method corresponds to the scene-level inference only; additional object and pixel level inference incurs extra computational costs. These runtime statistics are obtained with an i9 desktop CPU and a single RTX 2080Ti GPU. Our full model with DeepLabv3 and ResNet-50 runs at approximately
s per image. Specifically, the computational costs for object-level inference are mainly a result from the object region proposals and the forward pass of the object region CNN. The pixel-level inference time, on the other hand, is mostly the result from the iterative mean-field approximation. It should be noted that the inference times reported here are obtained based on public implementations as mentioned in
Section 4.2, without any specific optimization.
More example results obtained on the test set of MJU-Waste with our full model are shown in
Figure 7. Although the images in MJU-Waste are captured indoors so that the illumination variations are less significant, there are large variations in the clothing colors and, in some cases, the color contrast between the waste objects and the clothes is small. In addition, the orientation of the objects also exhibits large variations. For example, the objects can be held with either one or both hands. During the data collection, we simply ask the participants to hold objects however they like. Despite these challenges, our model is able to reliably recover the fine boundary details in most cases.
4.4. Results on the TACO Dataset
We additionally evaluate the performance of our method on the TACO dataset. TACO contains color images only, so we exclude Equation (
9) for training and evaluating models on this dataset. This dataset presents a unique challenge for localizing waste objects “in-the-wild”. In general, TACO is different to MJU-Waste in two important aspects. Firstly, multiple waste objects with extreme scale variation are more common (see
Figure 8 and
Figure 9 for examples). Secondly, unlike MJU-Waste the backgrounds are diverse, such as in road, grassland and beach scenes. Quantitative results obtained on the TACO test set are summarized in
Table 6. Specifically, we compare our multi-level model against two baselines: FCN-8s [
17] and DeepLabv3 [
23]. Again, in both cases our multi-level model is able to improve the baseline performance by a clear margin. Qualitative comparisons of the segmentation results are presented in
Figure 8. It is clear that our multi-level method is able to more closely follow object boundaries. More example segmentation results are presented in
Figure 9. We note that the changes in illumination and orientation are generally greater on TACO than on MJU-Waste, due to the fact that there are many outdoor images. Particularly, in some beach images it is very challenging to spot waste objects due to the poor illumination and the weak color contrast. Furthermore, object scale and orientation vary greatly as a result of different camera perspectives. Again, our model is able to detect and segment waste objects with high accuracy in most images, demonstrating the efficacy of the proposed method.