1. Introduction
Optical flow estimation refers to the estimation of displacements of intensity patterns in image sequences [
1,
2]. Generally speaking, the problem can be formulated as a global energy optimization problem of the form
where the data term,
, measures the point-wise similarity between the input images given the estimated optical flow and the prior term,
, applies additional constraints for having a specific property for the flow field, for example smoothly varying flow fields. The choice of each term in the global energy functional and also the optimization algorithms varies in different methods for optical flow estimation.
As for the data term in the equation, one of the basic assumptions is to have constant brightness during movements of pixels. This assumption is the outcome of some other assumptions regarding the reflectance properties of the scene, the illumination and also the process of image formation in the camera [
3,
4]. These assumptions are not always true, and therefore, various methods are proposed to tackle the problems. Use of photometric invariant constraints, such as the constancy of the gradient in the work of Brox et al. [
5], higher order derivatives proposed in [
6] and color models with photometric invariant channels [
7,
8], have been investigated before. Another problem arises when having motion discontinuities and occlusions in the underlying flow filed, which can be remedied by the use of non-quadratic penalty functions for data terms and the smoothness term, as proposed by [
9,
10,
11] and many other techniques in recent years.
Use of local/sparse feature detector/descriptors has been widely investigated in various branches of computer vision, namely wide baseline matching, object and texture recognition, image retrieval, robot localization, video data mining, image mosaicking and recognition of object categories [
12,
13,
14,
15,
16,
17,
18,
19,
20]. In such techniques, the first step involves feature detection (corners, blobs, T-junctions, etc.). This is followed by assigning descriptor vectors to the neighborhood of each feature point and finally matching these descriptors between different scenes. Most of the descriptors can be divided into three major categories [
21]: (1) distribution-based; (2) spatial-frequency; and (3) differential descriptors. In the first class, histograms are used for representing different characteristics of appearance and shape. Spin images [
22] and scale-invariant feature transform (SIFT) [
23] are two well-known examples of this class. Describing the frequency content of an image is used in the second class. Fourier transform represents image content by decomposing it into a linear combination of a set of basis functions. To have a more explicit spatial representation in the transformed image, Gabor filters [
24] and wavelet transforms [
25] are more suitable and are more used in the field of texture classification. In the third class, a set of image derivatives is computed in order to approximate a pixel’s neighborhood. Steerable filters [
26] and complex filters [
27,
28] are two examples from this class.
Going towards more abstract problems like action recognition, pose or expression detection, object recognition/categorization, correspondence across different scenes, image retrieval, video classification and many more, local features/descriptors may not be the proper choice. In these cases, the need for dense descriptors is more suitable. Examples of this category can be found in [
29,
30,
31,
32,
33]. However, there have been limited investigations on the use of feature descriptors for optical flow estimation [
34,
35,
36].
Optical flow methods try to find a point-wise flow field between pixels of images in a sequence, and therefore, the use of descriptors in a dense manner can provide a good estimation. This is investigated in the work of Liu et al. [
34] for the problem of dense correspondence across scenes by the use of dense SIFT descriptors. In their work, SIFT descriptors are computed for each individual pixel in the image, and then, by defining an energy functional constrained by data, small displacement and smoothness terms, dual-layer loopy belief propagation is utilized for optimization. The proposed technique is proven to be useful in video retrieval, motion prediction from a single image, image registration and face recognition. However, a more comprehensive evaluation of the method for optical flow estimation is missing. In this work, the main intent is to extend the proposed framework of SIFT-flow [
34] to include more feature descriptors with additional analysis and investigations for optical flow estimation. For this purpose, for now, the use of seven texture descriptors in addition to SIFT is considered, and comprehensive analysis and comparisons with well-known datasets are provided. The used feature descriptors are: the Leung–Malik (LM) filter bank [
37], the Gabor filter bank [
24], the Schmid filter bank [
38], the Root Filter Set (RFS) filters, steerable filters [
26], Histogram of Oriented Gradients (HOG) [
32] and Speeded Up Robust Features (SURF) [
39].
The contributions of the current work can be summarized as follows:
Given the framework proposed by Liu et al. [
34], here, a comprehensive analysis is provided for the use of the framework for optical flow estimation. This is done by thorough comparisons using the widely-used Middlebury optical flow dataset [
3]. This is to fill the gap in the original paper as no quantitative comparisons are given with the state-of-the-art in optical flow estimation.
Aiming at extending the framework to include more dense descriptors, the use of a few other descriptors, namely the Leung–Malik (LM) filter bank, the Gabor filter bank, the Schmid filter bank, Root Filter Set (RFS) filters, steerable filters, Histogram of Oriented Gradients (HOG) and Speeded Up Robust Features (SURF), has been investigated and discussed for optical flow estimation.
To the best of our knowledge, this work is the first to utilize several of the proposed descriptors in a dense manner in the context of optical flow estimation. Therefore, we believe that this work will stimulate more interest in the field of computer vision for devising more rigorous algorithms based on the use of dense descriptors for optical flow estimation and dense correspondence.
One thing to note is that all of the images are converted to monochrome in this work, and therefore, the color is not utilized for optical flow estimation and dense correspondence. The rest of the paper is organized as follows:
Section 2 contains a detailed explanation of the different descriptors. In
Section 3, the general formulation of the framework is discussed, and the loopy belief propagation as the means for optimization is discussed in brief.
Section 4 contains comprehensive comparisons of the used descriptors for optical flow estimation. It also contains pointers towards possible enhancements and future directions of the research. Finally,
Section 5 concludes the paper.
2. Feature Descriptors
Feature detection and feature description are among the core components in many computer vision algorithms. Therefore, a wide plethora of approaches and techniques has been introduced in the past few decades to address the need for more robust and accurate feature detection/description. Even though there is no universal and exact definition of a feature that is independent of the specific application intended, methods of feature detection can be categorized into four major categories, as suggested by [
40]: edge detectors, corner detectors, blob detectors and region detectors. The process of feature detection is usually followed by feature description, a set of algorithms for describing the neighborhood of the found features. As one can imagine, different approaches can be defined for describing features, which lead to various descriptor vectors. Generally speaking, the methods of feature description can also be classified into four major classes [
40]: shape descriptors, color descriptors, texture descriptors and motion descriptors. By detecting the features and defining the descriptors, one can use the new representation of the input images for a wide range of applications, such as wide baseline matching, object and texture recognition, image retrieval, robot localization, video data mining, image mosaicking and recognition of object categories.
However, in some areas of computer vision, optical flow estimation or stereo matching for example, the outcome of an approach is required to be a dense flow field or depth map, respectively. Therefore, the use of sparse features is not suitable. In this case, by eliminating the first step of the process of feature detection/description, descriptors are defined for all of the pixels in the images. Of course, this requires significant changes in the process of feature matching with additional constraints in the optimization process. Here, the use of eight different texture descriptors is investigated and analyzed for optical flow estimation. In the following subsections, a detailed overview is given for each of the descriptors.
2.1. Gabor Filter Bank
Describing the frequency components of the images may also be considered in many applications. While Fourier transform decomposes the frequency components of an image into the basis functions, the spatial relations between the image pixels are not preserved in this representation [
21]. Gabor filters, on the other hand, are designed to overcome this problem. The concept of Gabor filters is similar to that of wavelet transform. The difference is due to the fact that Gabor filters/basis functions are not orthogonal, and this may impose challenges in terms of computational complexity. However, Gabor features are widely used in many computer vision applications, such as texture segmentation and face detection. The general equation for the complex Gabor filter can be defined as follows [
24,
41]:
where
represents a complex sinusoid defined as:
in which
is the spatial frequency and
P is the phase of the sinusoid. As is obvious, this function has two components, one real and one imaginary. Therefore, the general Gabor filter also is composed of real and imaginary parts. The second term on the right-hand side,
, represents a Gaussian envelope, which can be defined as follows:
In the above equation,
represent the location of the peak, while
a and
b are scaling parameters, and the
r subscript is for a rotation operation, which is represented by:
For this work, the efficient implementation of the Gabor filters in the frequency domain, elaborated in [
42], is used. The number of filter frequencies and the number of orientations of the filters are set to 10 and eight respectively, which result in a set of 80 filters.
Figure 1b shows a sample set of real-valued Gabor filters.
2.2. Schmid Filters
Schmid filters are a set of Gabor-like filters that combine frequency and scale with applications in texture classification and object recognition [
38]. It should be noted that the basis elements of the Schmid filter bank are isotropic and rotationally invariant. The general form of such filters is as follows:
in which
τ is the number of cycles of the harmonic function within the Gaussian envelope of the filter, which is the same as the one used in Gabor filters. The term
is added for obtaining the zero DC component. This way, the filters are robust to illumination changes and invariant to intensity translations. Following the same approach as [
38], 13 filters with scales
σ between two and 10 and
τ between one and four are used for creating dense descriptors.
Figure 1c shows the set of Schmid filters used in this study.
2.3. Leung–Malik Filters
The main idea behind the set of LM filter bank comes from the concept of two-dimensional textons in which a texture is characterized by its responses to a set of orientation and spatial-frequency selective linear filters. This filter bank has a mixture of edge, bar and spot filters at multiple scales and orientations. The original set consists of 48 filters [
37]; the first and second derivatives of the Gaussian at six orientations and three scales, which makes a total of 36 filters. It also includes eight Laplacian of Gaussian (LoG) filters, as well as four Gaussian filters. One version of the LM filter bank, named LM small, is composed of filters that occur at basic scales
and first and second derivative filters that occur at the first three scales with an elongation factor of three (
). In this version, the Gaussians occur at the four basic scales, while the eight LoG filters occur at
σ and
. In LM large, the filters occur at the basic scales
. For the current work, the values of the parameters are set to their default values, which result in a set of 48 filters.
Figure 1a shows the set of LM filters used in this study.
2.4. Root Filter Set Filters
The original Root Filter Set (RFS) consists of 38 filters, very similar to that of the LM filter bank. The filters are a Gaussian and a Laplacian of Gaussian, both with rotational symmetry and
. This filter bank also includes edge filters at three scales
and bar filters at the same three scales. These two sets are oriented, which occur at six orientations at each scale. In the literature, RFS is usually used in a maximum response manner, in which only the maximum response for each orientation is considered [
43]. This is mainly to achieve rotational invariance. However, in our implementation, the original set is used without considering the maximum response. The parameters of the filter bank are kept at their default values, which result in a set of 38 filters with size of
pixels and six orientations.
Figure 1d shows the set of original RFS filters.
2.5. Steerable Filters
The idea behind the design of steerable filters comes form this question: What are the conditions for any function
to be written as a linear sum of rotated versions of itself [
26]? This can be represented as follows:
where
M is the number of required terms and
is the set of interpolation functions. In polar coordinates,
and
, and considering
f as a function that can be expanded in a Fourier series in polar angle
ϕ, we have:
It has been proven that this holds if and only if the interpolation functions are solutions of:
for
and
. As stated and proven in [
26], all band-limited functions in angular frequency are steerable, given enough basis filters. Steerable filters, 2D or 3D, are proven to be very useful in different fields of computer vision over the years [
44,
45]. Here, steerable filters of up to order five are used for creating dense descriptors. Odd orders are edge detectors, while even orders detect ridges. The number of orientations is set to eight.
Figure 1e shows the set of steerable filters used here.
2.6. Histogram of Oriented Gradients
Histogram of Oriented Gradients (HOG) is based on computing a histogram of gradients in pre-determined directions [
32,
33]. The main idea comes from the observation that local object appearance and shape can usually be characterized by the distribution of local intensity gradients or edge directions. This can be implemented by dividing the image into small spatial regions, and for each region, a one-dimensional histogram of gradient directions or edge orientations is accumulated. To be more precise, at first, the gradient magnitudes in horizontal and vertical directions are computed, which results in a two-dimensional vector field for the image. In the second step, the magnitude of the gradient is quantized into several orientations. Then, these quantized descriptors are aggregated over blocks of pixels in the image domain. The next step is concatenating the responses of several adjacent pixel blocks, which is followed by normalizing the descriptors and sometime use of Principal Component Analysis (PCA) for reducing the dimensionality. Following the work of [
33], Haar-features are used for calculating the gradient magnitude responses. To achieve the dense descriptors, the size of sub-blocks is set to
. Assuming a sectioning of
of the input images with the number of orientations equal to eight, the final dense descriptors are of a size of 200.
Figure 2 shows a sample response of HOG on the
Badafshan (the first author’s village in central part of Iran) image.
2.7. Speeded Up Robust Features
The Speeded Up Robust Features (SURF) method is designed to be scale and rotation invariant [
39]. In SURF, features are detected using a Hessian detector, which in general consists of computing the convolution of a second order derivative Gaussian mask with the image in horizontal, vertical and diagonal directions for building the Hessian matrix for each pixel. Taking advantage of an approximation of these second order derivative Gaussian masks, the approximate determinant of the Hessian matrix can be computed very quickly. Moreover, the need for Gaussian smoothing and sub-sampling of the image for building the image pyramid can be eliminated by up-sampling these Haar-wavelet approximations and creating bigger masks for higher scales. Finally, the desired features are extracted by using a non-maximum suppression in a
neighborhood, followed by interpolating the maxima of the determinants of the Hessian matrices in scale and image space with the method proposed in the work of [
46].
Since in this work, we are only interested in dense descriptors, there is no need for detecting the features within the image. However, still, the responses to the Haar-wavelets are needed. For defining the descriptor, at first, the orientation at each pixel is assigned. This is done by combining the results of applying the Haar-wavelets in a circular neighborhood around the interest point (or around each pixel, for defining dense descriptors). Next, a square region centered around the interest point and oriented along the estimated orientation is created, which is of the size
,
s being the scale. This region is further divided into smaller
sub-regions, each of the size
. With
and
as horizontal and vertical Haar-wavelet responses, respectively, a four-element vector is defined for each sub-region as
in which the summations are calculated over each sub-region. Having a vector of four for each of 16 sub-regions, a vector of size 64 is created as the descriptor of the pixel. For more detail on the different types of SURF, the reader is referred to [
39].
2.8. Scale-Invariant Feature Transform
The four stages of Scale-invariant Feature Transform (SIFT) method are [
23]: (1) scale-space extrema detection; (2) keypoint localization; (3) orientation assignment; and (4) keypoint descriptors. For the first step, a Gaussian function is considered as the scale-space kernel based on the work of [
47]. By finding the scale-space extrema in the response of the image to Difference-of-Gaussian (DoG) masks, not only a good approximation for the scale-normalized Laplacian of Gaussian function is provided, but also as pointed out in the work of [
48], the detected features are more stable. The local extrema of the response of the image to the DoG masks of different scales is found in a
neighborhood of the interest point. For accurate localization of the keypoints in the set of candidate keypoints, a 3D quadratic function is fitted to the local sample points. By applying a threshold on the value of this fitting function at the extremum, keypoints located in low contrast regions that are highly affected by noise are eliminated. Moreover, thresholding the ratio of principal curvatures can also eliminate poorly-defined feature points near the edges. After finalizing the keypoints, orientations can be assigned. This is done using the gradients computed in the first step of the algorithm when computing DoG responses. Creating a 36-bin histogram for orientations in the keypoint’s neighborhood is the next step. Each neighbor contributes to the histogram by a weight computed based on its gradient magnitude and also by a Gaussian weighted circular window around the keypoint.
The final step is creating the local image descriptor. Using the location, scale and orientation determined for each keypoint up until now, the local descriptor is created in a manner that makes it invariant to differences in illumination and viewpoint. This is done by combining the gradients at keypoint locations, as computed in the previous steps, weighted by a Gaussian function over each
sub-region in a
neighborhood around the keypoint into eight-bin histograms. This results in a
element vector for each keypoint. Normalizing the feature vectors to unit length will reduce the effect of linear illumination changes. This is usually followed by thresholding the normalized vector and re-normalizing it again to reduce the effects of large gradient magnitudes. Here, the descriptors are computed for all of the image pixels to create a dense descriptor for the image. For more information regarding the detail and implementation of SIFT, the reader is referred to [
23].
Table 1 summarizes the number of dimensions of the various dense descriptors.
5. Conclusions
The concept of optical flow estimation, as the procedure of finding displacements of intensity patterns in image sequences, has gained much attention in the computer vision community. The general formulation usually involves the minimization of a constrained energy functional for finding the dense flow fields between the images in the sequence. One of the basic assumptions for the formulation is on the brightness level of different regions contained in the images, as it is assumed that the brightness should remain constant regardless of the movement of pixels/objects. This assumption, even though not completely true in general, drives many of the modern day optical flow estimation algorithms. However, many different techniques have been proposed in the recent years, which try to solve the problem in a more general manner. One solution can be sought in formulating the problem by use of more structurally-representative means: local descriptors. Local descriptors have been used in various branches of computer vision, namely object/texture recognition, image retrieval, etc. However, there has been limited investigation on the use of local descriptors for optical flow estimation [
34,
35].
In the current work, use of several dense descriptors is investigated for optical flow estimation. Given the framework provided by Liu et al. [
34], the first aim is to provide a more comprehensive comparison with the state-of-the-art for optical flow estimation, and therefore, several tests are carried out using the Middlebury optical flow datasets [
3]. Aiming at extending the framework, the use of seven descriptors, namely the Leung–Malik (LM) filter bank, the Gabor filter bank, the Schmid filter bank, Root Filter Set (RFS) filters, Steerable filters, Histogram of Oriented Gradients (HOG) and Speeded Up Robust Features (SURF), is introduced and investigated here. The presented results show great potential for using dense local descriptors for the problem of optical flow estimation and stereo matching, both in computer vision and also the medical image processing community and can stimulate great interest in devising more rigorous algorithms in the field.