1. Introduction
The unsupervised learning on images, which enables computers to learn the features of objects from unlabeled images, is an intriguing and challenging problem in computer vision. It has a potential impact on the development of the fundamentals of deep learning techniques and it may help to reduce the amount of required labeled data in many computer-vision tasks, such as classification, recognition, instance segmentation, etc. Our motivation for unsupervised learning on images is based on two observations: (1) images are naturally sparse: an image typically consists of multiple objects where each can be sparsely represented by a deep neural network, and each class of objects with similar appearances may appear in many images. Such a sparsity can be exploited in order to learn the features of the objects in an unsupervised manner. (2) Natural videos possess rich self-supervised information; for example, two objects could be distinguished when one object moves relative to another, even if their shapes change at the same time. In this paper, we explore the two ideas in a single unsupervised framework and propose a new method for training on videos and performing tasks on single images. In particular, we focus on the task of discovering primary objects from single images, and the method that was developed in this paper might be applied in many other computer-vision tasks.
Video understanding has been studied in computer vision for decades. In low-level vision, different methods have been proposed in order to find correspondences between pixels across video frames, which is known as optical flow estimation [
1,
2]. Camera motion and object motion can both result in optical flow, therefore these methods can not be aware of the
objects in the scene. In high-level vision, object tracking and object discovery in videos have been well-studied [
3,
4,
5,
6,
7,
8], especially with the introduction of unsupervised challenge of the DAVIS dataset [
9]. However, the unsupervising in that challenge only means that the supervision information is not required for the test phase, but it is still required for the training phase. Despite remarkable performance, these approaches benefit either from object labels [
5,
7], or from pre-trained models in order to generate proposals [
3,
10]. In this paper, we define that an unsupervised video learning module should not use any manual annotation or pre-trained models on the manual annotation.
The task of unsupervised object discovery in videos is strongly related to co-localization [
11,
12,
13,
14,
15] and co-segmentation [
16,
17,
18,
19,
20,
21,
22,
23]. The task has been studied for more than a decade in computer vision, with initial works mainly being based on local feature matching and detection of their co-occurring patterns [
24,
25,
26,
27]. Recent approaches [
12,
15,
18] discovered object tubes by linking candidate detection between frames with or without refining their location. Typically, the task of unsupervised learning from image sequences is formulated as an optimization problem for either feature matching, conditional random fields, or data clustering. However, it is inherently expensive, due to the combinatorial nature of the problem. Besides, it is time-consuming to perform object discovery in videos or in collections of images at test time. Different from that, our method moves the unsupervised discovery to the training stage, while, at test time, we apply the standard feed-forward processing in order to detect the object in single test images quickly.
Figure 1 illustrates our unsupervised object discovery (UnsupOD) framework. We formulate the problem of unsupervised learning in videos as the encoder–decoder structure that internally factors the video frame into a foreground, a background, and a mask, without direct supervision for any of these factors. However, without further assumptions, decomposing an image into these three factors is ill-posed. We construct the model that is based on the following assumptions.
First, foreground objects are more difficult to model than their backgrounds, as their movements and appearance are more complex. With this assumption, we build a background model with two constraints. On the one hand, we construct a much narrower bottleneck (i.e., the smallest layer in auto-encoder) to the background network than that to the foreground network. Consequently, a large amount of image information flows into the foreground network, and a small amount of information flows into the background network. On the other hand, we add a gradient constraint to the background in order to make it as “clean” as possible, which prevents foreground objects from appearing in the background.
Second, we define that a good mask should satisfy the following criteria: (i) it should present the object’s shape and appearance as refined as possible. (ii) It has to be much closer to a binary image. (iii) It should display smooth contours without having holes, namely the closed region. We add constraints to the mask model based on the above criteria and show that the object masks produced by the mask model have nicer and smoother shapes, and capture well the figure-ground contrast and organization.
We combine these elements in an end-to-end learning formulation, where all of the components are only learned from raw RGB data. We demonstrate our method on various datasets. The experimental results show that our approach can produce high-quality segmentation masks without any manual annotation or pre-trained features, indicating that our unsupervised model learns a strong, high-level semantic feature representation for objects. Moreover, our model is able to discover and segment objects that belong to classes not in the training dataset, which further verifies the feasibility and generalization capability of our approach.
The main contributions are summarized, as follows:
We propose a novel deep network architecture for unsupervised learning, which factors the image into multiple object instances that are based on the sparsity of images and the inter-frame structure of videos.
We propose a method to discover the primary object in single images by completely unsupervised learning without any manual annotation or pre-trained features.
Our segmentation quality tends to increase logarithmically with the amount of training data, which suggests the infinite possibilities of learning and generalization of our model. Besides, our model maintains a very high speed in testing and the experimental results demonstrate that it is at least two orders of magnitude faster than the related co-segmentation methods [
4,
27].
2. Related Work
Unsupervised learning. With the exponential growth of multimedia data in the Internet age, how to effectively utilize big data has attracted increasing attention [
28,
29], especially in the field of unsupervised learning without manual annotation. Previous works mainly fall into two categories: generative models and self-supervised approaches. The primary objective of generative models is to reconstruct the distribution of data as faithfully as possible. Classical generative models include Restricted Bolztmann Machines (RBMs) [
30], Auto-Encoder (AE) [
31], and Generative Adversarial Networks (GANs) [
32]. Self-supervised learning exploits internal structures of data and formulates predictive tasks in order to train a model. Specifically, the model needs to predict either an omitted aspect or component of an instance given the rest. To learn a representation of images, the tasks could be: predicting the context [
33], counting the objects [
34], filling in missing parts of an image [
35], recovering colors from grayscale images [
36], or even solving a jigsaw puzzle [
37]. For videos, self-supervised strategies include: leveraging temporal continuity via tracking [
38,
39], predicting future [
40], or preserving the equivariance of egomotion [
41]. Nevertheless, while self-supervised learning may capture relations among parts or aspects of an instance, it is unclear why a particular self-supervised task should help semantic recognition and which task would be optimal.
Object discovery from unlabeled data. Object discovery from unlabeled data is challenging due to the fact that it does not depend on any auxiliary information rather than a given unlabeled image. Thus, many methods are focused on solving the image co-localization problem [
11,
12,
13,
14,
21,
42,
43]. Object discovery or co-localization is a process for finding objects of the same class over multiple images or videos. Some earlier image co-localization methods [
12,
13,
14,
21] addressed this problem based on low-level features (e.g., SIFT, HOG). Recently, some works [
3,
42,
43,
44] learned a common object detector by using the features from some pre-trained CNN models, such as VGG-16 [
45], which was used in [
3,
44]. However, the discovery module fully that is based on unsupervised learning should not use pre-trained features on manually labeled ground truth.
Co-segmentation. The co-segmentation task is aiming to discover an identical object within a collection of images or videos. The first co-segmentation method was proposed by Rother et al. [
46], in which the same object is simultaneously segmented in two different images with histogram matching. Since then, a lot of works have begun to focus on co-segmentation to help further improve its performance [
17,
47,
48,
49], and these works either perform in a pair of images that contain the same object [
47,
49], or require some form of user interaction [
48]. Recent years, Some researchers extended co-segmentation techniques in various ways. Joulin et al. [
17] combined existing tools for bottom-up image segmentation with kernel methods commonly used in object recognition within a discriminative clustering framework. Inspired by the anisotropic heat diffusion, Kim et al. [
16] proposed a distributed co-segmentation approach for a highly variable large-scale image collection. Vicente et al. [
23] introduced the conception of “objectness” to the co-segmentation framework, suggesting foreground segment is required to be an object to improve co-segmentation results significantly. To match objects among images, Rubio et al. [
50] applied region matching to exploit inter-image information by establishing correspondences between the common objects that appear in the scene. Lee et al. [
51] proposed the notion of multiple random walkers on a graph, and then applied it to the image co-segmentation while using the repulsive restart rule. Generally, the co-segmentation techniques need a collection of images during testing. Different from that, each image is independently processed at test time in our model.
Video Foreground/Background Segmentation. Video foreground segmentation is the task of classifying every pixel in a video as foreground or background, separating all of the moving objects from the background. Early approaches [
4,
52,
53,
54] relied on heuristics in the optical flow field in order to identify moving objects, such as closed motion boundaries in [
4]. These initial estimates were then refined by utilizing external cues, such as saliency maps [
54] or object shape estimates [
53]. Another line of work focused on building probabilistic models of moving objects while using optical flow orientations [
55,
56]. These methods are not based on a robust learning framework and they fail to generalize well to unseen videos. The recent introduction of a standard benchmark, DAVIS 2016 [
9], has led to a renewed interest. More recent approaches [
8,
57,
58] proposed deep models for directly estimating motion masks. For example, [
8,
57,
59] adopted a learning-based approach and trained Convolutional Neural Networks (CNNs) that utilize RGB and optical flow as inputs for producing foreground segmentation. However, these approaches are based on either manually annotated mask labels as their supervising information, or a supervised pre-trained network in order to generate the object proposes. Our method directly estimates masks in a fully unsupervised manner.
3. Our Approach
Our goal is to train an unsupervised learning model from videos, and then the trained model can automatically discover the primary foreground object that appears in a single image, estimating both its bounding box and its segmentation mask.
The UnsupOD model is constructed, as illustrated in
Figure 1. Given an unconstrained collection of videos, our target is to learn a model that receives a video frame as input and produces a decomposition of it into a foreground, a background, and a mask. Video frames are continuously sent to the network without being shuffled in order to combine both the appearance and motion information. The learning objective is reconstructive, as we have only raw videos to learn from: namely, the model is trained, so that the combination of the three factors gives back the input frame.
In order to learn such a decomposition without supervision for any of the components, we add constraints to each of these components separately. The following sections describe how this is done, looking first at the foreground and background model (
Section 3.1), and then at how the segmentation mask is modeled (
Section 3.2), followed by details of the reconstruction process (
Section 3.3).
3.1. Foreground and Background Model
It is known that the object of interest usually has more complex and varied movements than its background scene: it has a distinctive appearance and usually causes occlusions and occupies less space. All of these differences make the foreground more difficult to model than the background. Because the background contains variations less than the foreground, we expect it to be better captured by the lower-dimensional subspace of the frames from a given video shot.
Under these observations, our solutions for constructing the foreground and background model are two-fold: (1) both of the networks are built on the fully convolutional auto-encoder architecture without skip connections between the encoder and decoder; (2) the settings for the foreground and background networks are the same, except that the bottleneck (i.e., the smallest layer in auto-encoder) of the background network is much narrower than that of the foreground network. The former separates the foreground and background, while the latter forces the network to model the background in the lower-dimensional subspace.
However, the powerful capability to model appearance of the auto-encoder makes a small amount of foreground pixels always remain in the background, which makes the object segmentation mask incomplete. To make the learned background clean to a great extent, we add a gradient regularization loss to the background:
Minimizing the background gradient forces the background to be as clean as possible, with no trace of the foreground (see the learned background shown in
Figure 1). It is also important to note that minimizing the background gradient will also result in a blurred and unreal background. Fortunately, the foreground and background are just auxiliary tasks, and our ultimate goal is to obtain a good object mask.
3.2. Segmentation Mask Model
A good segmentation mask m satisfies the following criteria: (i) it should present the object shape and appearance as refined as possible. (ii) It has to be as close as possible to a binary image. (iii) It should display smooth contours without having holes, namely the closed region.
The first criterion is obtained by employing the U-Net [
60] as the mask network. U-Net [
60] is the most widely used structure in image segmentation, especially in medical image analysis [
61,
62,
63], mainly because the encoder–decoder structure with skip connections allows for efficient information flow and helps to recover the full spatial resolution at the network output. After the last convolutional layer and the sigmoid activation function, the segmentation mask
m will be a two-dimensional output map and its value is in the range of (0,1). The second criterion is enforced via a binary regularization loss, as expressed in Equation (
2):
where
m(
x,
y) is the value of the two-dimensional output map at position (
x,
y). The third criterion is enforced by a closed regularization loss, as defined in Equation (
3):
It is challenging to segment the primary objects from complex backgrounds in a fully unsupervised manner. Although the motion information of the object in video sequences can provide some clues, the camera moving in some shooting scenes can also result in ambiguities. For example, in the airplane taking off or landing scene, the background is moving and varying, while the primary object airplane is “still” in the middle of the video frames, as shown in
Figure 2. Learning by motion clues will make the network mistakenly regard the moving background as the foreground object of interest. In order to overcome this problem, we introduce an area regularization constraint to guide the network, as expressed in Equation (
4):
where
is the mean value of
m, and
is the proportion factor,
, which restrains the size of the learned mask and gives a penalty when the network regards the background as the foreground object of interest.
To sum up, the final regularization loss for our mask model
is therefore:
where
,
and
are the weighting factors for corresponding regularization loss.
3.3. Image Reconstruction
Because the sigmoid nonlinearity is adopted in the last convolutional layer of the mask model, the value of output two-dimensional mask
m is in the range of (0,1). We first round
m to a binary mask
, and then the reconstruction
can be obtained by combining the foreground
and background
with the binary mask
:
After binarization, the value of the two-dimensional map m is 0 or 1. A value of 1 indicates that this pixel belongs to the foreground; otherwise, this pixel belongs to the background. The reconstructed image is formed by selecting pixel values from the background and foreground according to the guidance of .
The binarization is a rounding function, which is defined in Equation (
7):
However, the gradient of the binary function B is zero almost everywhere, except that it is infinite when . In the back-propagation algorithm, the gradient is computed layer-by-layer with the chain rule in a backward manner. Thus, this will make any layer before the binary never be updated during training.
Based on the straight-through estimator on gradient [
64] and inspired by [
65], we introduce a proxy function
in order to approximate
B:
Because the proxy function is differentiable, can be used in back-propagation. Importantly, we do not fully replace the binary function with a smooth approximation, but only its derivative, which means that B is still performed as usual in the forward pass.
Subsequently, the optimization loss for the whole model is therefore:
where
,
and
are the regularization for background and mask, respectively.
and
are the corresponding weighing factors for
and
.