Estimating Optical Flow with Streaming Perception and Changing Trend Aiming to Complex Scenarios

Wang, Zixiao; Zhang, Weiwei; Zhao, Bo

doi:10.3390/app13063907

Open AccessArticle

Estimating Optical Flow with Streaming Perception and Changing Trend Aiming to Complex Scenarios

by

Zixiao Wang

¹,

Weiwei Zhang

^1,2,* and

Bo Zhao

¹

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(6), 3907; https://doi.org/10.3390/app13063907

Submission received: 11 January 2023 / Revised: 12 March 2023 / Accepted: 16 March 2023 / Published: 19 March 2023

Download

Browse Figures

Versions Notes

Abstract

:

Most of the recent works on optical flow estimation focus on introducing more effective CNNs or even the Transformer architecture that is quite popular in today’s CV field, but they all still follow the traditional two-frame input structure. In this paper, we argue that optical flow estimation should be more focused on the dynamic information in the image sequence. To that end, we present a straightforward but effective streaming framework equipped with a novel prediction module for capturing optical flow trends, which is a point that has not yet been discussed in previous works. By employing the backbone of the strong baseline to extract the stream of features from three consecutive frames as inputs, this module can predict the current state and update the previous changing trend of optical flow for feature fusion at the next stage. Additionally, we redesign the training loss, providing two new cost functions, in particular, to measure the variations in the optical flow trend. At the forward stage, we deploy a historical buffer to maintain the inference speed so that it is unaffected by additional input frames. Compared with the strong baseline, our method improves the average end-point error by 21.12% on Sintel Clean and 12.94% on Sintel Final. In addition, it also outperforms existing methods in real-world scenarios with low brightness or fast-moving targets.

Keywords:

optical flow; streaming perception; three-frame input; fast-moving targets; low brightness

1. Introduction

Optical flow estimation still has potential and room for development, compared to other relatively mature vision tasks, such as image classification, segmentation, and object detection. Thanks to the high sensitivity of optical flow to pixels in motion, it is very promising in some complex application scenarios, such as motion capture, pose estimation, video interpolation, and even autonomous driving. Many years ago, Horn and Schunck [1] introduced the concept of optical flow. Since then, researchers have been working on how to more accurately match the features or pixels in consecutive “two-frame” images, from the traditional optical flow computing method that Lucas and Kanade [2] proposed in 1981 to the current optical flow estimation algorithms based on deep learning. Although this approach to designing algorithms based on optical flow definitions is more in line with traditional theory and common sense, the working mechanism of neural networks is not exactly the same as traditional algorithms. The current state-of-the-art methods of optical flow estimation all follow the traditional input structure, which only extracts features from images of two consecutive frames. Even though they have made big progress on both AEPE (Average End-Point Error) metrics and time cost, an input structure like this ignores the influence of motion continuity, and the result will also be limited by losing such dynamic information.

In summary, some basic qualities of motion, such as continuity and inertia, should not be neglected when estimating the motion state of any object. Humans, on the other hand, are able to observe an object’s continuous motion, gather information from the previous motion state, and swiftly build a plausible motion model to predict the object’s next moving step. This point is more valuable in ball games with high-speed movement, where well-trained athletes can accurately predict the position of the ball at the next moment through the trajectory and movement trend of various balls. Achieving such an effect requires not only the spatial information of the motion but also the temporal information of it. In order to build a model for capturing these abstracted motion features, the deep learning method once again caught our attention because it can extract potential features and learn from them autonomously. Moreover, we know that, for deep neural networks, more effective auxiliary information usually contributes positively to the forward–backward process. Different pixels (or blocks of pixels) in images from consecutive time nodes may move in completely different states (including speeds, accelerations, etc.) within the same frame. These differences come from multiple sources: different sizes of objects, moving states, occluded or not, and different topological distances. Therefore, based on the existing optical flow estimation architecture, this paper designed a novel dynamic predicting module containing temporal information to predict and update the optical flow trends of all pixels. By using this module, optical flow estimation can be more consistent with the moving patterns of objects when processing an image sequence.

The main contributions of this paper are as follows: compared with the input structure of the existing optical flow estimation methods, we propose to use three image frames (

f_{t - 2}, f_{t - 1}, f_{t}

) as input instead of two image frames (

f_{t - 1}, f_{t}

). We defined the features from the backbone output as optical flow frames

(Γ_{t - 1}, Γ_{t})

and proved that using the dynamic prediction module, which is implemented by using consecutive optical flow frames as input, is more beneficial for optical flow estimation, especially for scenarios with fast-moving targets. We demonstrated that adding temporal information supplemented by a specially designed cost function to the optical flow estimating process did improve the accuracy. Finally, we introduced a historical buffer during the forward process to make sure our proposed model is improved without unduly compromising the speed of its performance. Compared with the strong baseline, RAFT (Recurrent All-Pairs Field Transforms), our method improves the AEPE by 21.12% on the Clean benchmark of the Sintel dataset (from 1.61 to 1.27) and also reaches the SOTA level in other metrics listed in the experiment part of this paper. In Section 5, we demonstrate the advantages of our method over other methods at the real-world application level.

2. Related Works

Before RAFT emerged, the most compact and effective framework for estimating optical flow is PWC-Net [2], which contained five parts: Feature Pyramid [3], Warping Layer [4], Cost Volume Layer [5], Optical Flow Estimator [6] and Context Network. Almost all optical flow methods are implemented on this basis until 2020, when Teed [7] introduced a brand-new architecture consisting of three components (feature extraction coding layer, 4D correlation layer, and recurrent update operator) into the optical flow estimating domain, improved the AEPE to a significantly low number, and became the state-of-the-art. RAFT mainly addresses three drawbacks of previous optical flow estimation frameworks. First, previous frameworks generally adopted a coarse-to-fine design, i.e., estimating optical flow on a low-resolution feature map, then sampling and adjusting it on a high-resolution one. In contrast, RAFT maintains and updates a single but fixed optical flow field directly at high resolution, reducing the prediction error rate due to low resolution, the probability of missing small but fast-moving targets, and the iterations of training. Second, previous frameworks included some form of iterative refinement but did not restrict the weights between iterations, which led to a limit on the number of iterations. For example, IRR [8] used FlowNetS [9] or PWC-Net as iterative units. The former is limited by the size of the network (up to 38 M parameters), so it can only be valid for five iterations; the latter is limited by the levels of the pyramid [10,11]. As a new architecture, the update cycle of RAFT is periodic and lightweight, with mere 2.7 M parameters being used to update the operator, and it can be iterated more than a hundred times. Finally, the fine-tuning module in the previous framework usually used a common convolution or correlation layer [12]. Compared to this, the update operator designed by RAFT consists of a GRU (Gate Recurrent Unit) [13] structure, which contains some convolutional operations, and performs better on 4D multiscale correlation volumes. Therefore, in this paper, the RAFT architecture is chosen as the baseline.

The GMA (Global Motion Aggregation) [14] module solves the problem of estimating the optical flow for some occluded pixels. It is optimized and improved from the RAFT method, and inspired by the Transformer architecture [15], which is based on the Self-Attention mechanism [16,17] and has been sweeping the field of deep learning tasks in recent years. The architecture is known for its remarkable ability to model long-range dependencies, so it is more suitable for global estimating [18] compared to CNN, which has local limitations. The GMA module uses the projection of context features as query and key features in the Attention mechanism and the projection of motion features as value features, thus modeling the motion pixels that are occluded in the image in a globally aggregated way.

However, the Transformer architecture has the well-known problem that it requires significant additional overhead for training and prediction [10]. Therefore, using the idea of the GMA global estimation module and modeling motion pixel aggregation as a reference, we design a dual-stream input structure and continuous-momentary optical flow feature fusion module with the historical buffer. The details of the structure will be presented in the next section.

Streaming perception is mainly used in object detection tasks to detect and track multiple continuously moving targets, pioneered by Li, Wang, Ramanan et al. [19], who introduced this concept and derived relevant baseline and evaluation metrics, where it is worthwhile to refer to a meta-detector that draws on the use of a Kalman filter [20], decision scheduling theory [21], and asynchronous tracking [22,23] to achieve streaming perception. Yang, Liu, Li et al. [24] proposed another streaming perception structure based on this concept, which inputs two consecutive image frames to predict the target state in the next frame, and uses the ground truth of the next frame to supervise the parameters of the whole network. It is worth mentioning that this structure improves the prediction robustness for different targets at different moving speeds, preserving both the original features extracted from the current image frame and adding the changed features fused in consecutive frames. In this paper, we are inspired by this idea and consider all moving pixels as different moving targets. Since we temporarily disregard the time delay and the prediction of the next frame, we directly input continuous frames to estimate the optical flow of the current frame and use the ground truth optical flow value of the current frame for supervision. We will figure out in Section 4 whether or not such a design can also have a positive effect on the optical flow estimation for different motion pixels.

3. Method

3.1. Parameters Definition

In the first section, we would like to explain some symbols that will appear in the figures and equations of this paper.

Image frame: the time node of a certain image in the time-ordered image sequence of the input structure, which corresponds to

I_{n}

in the diagram of network structure, where n is in the range

[0, N)

, and N is the total number of images in an image sequence.

Optical flow frame: a new term is proposed specifically to explain the network structure designed in this paper. It represents the time node where a certain feature map is in the optical flow sequence (a new sequence generated by two consecutive image frames, which must also be ordered by time), and corresponds to

F_{\begin{matrix} i \end{matrix}}

in the diagram of the network structure. The functional mapping between the optical flow frame and the image frame is

F_{i} = Θ (I_{i}, I_{i - 1})

, where

i

is in the range

[1, N)

because the first frame of the optical flow can only be generated after the second image frame is input into the network.

Θ

indicates the backbone of one certain optical flow estimation method. The input of the backbone is two consecutive image frames, and the output is one feature map. In this paper, we choose the backbone from the strong baseline, RAFT, as

Θ

to extract the optical flow feature.

Optical flow value: we set the optical flow value as

f_{i}

in the formula of this paper. Actually,

f_{i}

is a matrix with the dimension of

H \times W \times 2

, and each element of the matrix represents the optical flow value of the pixel point in the image. Thus,

f_{i}^{g t}

and

f_{i}^{p r e d}

represent the ground truth optical flow value and the estimated optical flow value of the ith optical flow frame, respectively.

Optical flow trend: to make our proposed method more sensitive to the optical flow motion, especially at the position of the object contour, we introduce the parameter OFT (Optical Flow Trend) to describe the change of optical flow trend, whose value is obtained from the pixel-by-pixel difference in the optical flow values

(f_{i}, f_{i - 1})

between two consecutive optical flow frames, and is represented by the symbol

Γ

. Therefore, the computation for the OFT of the ith optical flow frame is

Γ_{i} = f_{i} - f_{i - 1}

.

3.2. Structure Overview

This section describes the training and inferencing process of our Streaming Perception and Changing Trend (it will be referred to as SPCT later in this paper) method.

In the training phase, as shown in Figure 1, the network receives three image frames (

f_{t - 2}, f_{t - 1}, f_{t}

), which are input to the backbone in the groups of (

f_{t - 2}, f_{t - 1})

and

(f_{t - 1}, f_{t})

. The backbone has the same structure as in RAFT, and employs a share-weight 1 × 1 convolution layer to reduce channels for both output flow features. In the next step, the 2D features output from the two consecutive optical flow frames are encoded into two feature streams, one continuous and the other momentary. These two streams are input into the Continuous-Momentary Feature Fusion (Hereinafter referred to as C-MFF) module simultaneously, and it integrates the two input streams into one fused optical flow feature through a simple concatenate operation. It decodes them through the GRU structure to obtain the optical flow estimation at frame

t

, compares it with the ground truth optical flow value as the supervision, then calculates the losses and updates the parameters in the network structure by backpropagation.

For the forward inference stage, compared to the traditional optical flow estimation algorithm that only needs to extract the optical flow frames once, our proposed structure requires the extraction of two optical flow frames from three consecutive image inputs before post-processing. In order not to double the computation of estimating the optical flow, as depicted in Figure 2, a historical buffer is used to store the last extracted optical flow feature so that only the current optical flow feature needs to be extracted at one time. After this, all the network needs to do is input the feature of the current optical flow, along with the one cached in the historical buffer, into the C-MFF module rather than re-extract the same feature that the previous operation has already processed. For the beginning of the image sequence, we directly copy the first optical flow feature

f_{0}^{’}

as a pseudo-historical buffer to fit the input form (

f_{0}^{’}, f_{0}, f_{1}

).

3.3. Continuous-Momentary Feature Fusion Module

The input of the C-MFF module contains two different data streams: (1) the continuous flow feature stream, which can be considered as dynamic information of the optical flow, obtained by concatenating two flow features corresponding to three different resolutions (1/2, 1/4, 1/8) which are all extracted from the share-weight backbone; and (2) the momentary flow feature stream, which is the flow feature that is output by the backbone simultaneously, can be considered as static information of the optical flow, and will be directly connected through the residual structure to the continuous flow feature. For the fusing strategy in the C-MFF module, we conducted some ablation experiments with reference to several ways that had been tested by Yang [24] (including element-wise add, concatenation, Spatial Transformer Network proposed by [25], and the non-local network proposed by [26,27]). We find that for our proposed structure, a direct concatenation method, achieves optimal results and performance. The details of the ablation experiments performed on various fusion strategies will be presented in Section 4.3.

3.4. Training Loss

In order to correct the parameters in the network more effectively, this paper designed a combined loss for the proposed structure. It consists of three different types of cost functions, including motion cost, angle cost, and trend cost.

The first part of the loss is called motion cost. It takes the Smooth L1 Loss [28] between the estimation of optical flow and the ground truth of optical flow, which improves the problem of not being smooth enough when it tends to zero compared to the L1 Loss. In addition, it is less sensitive to outliers when the input value is large compared to the L2 Loss. The loss function of the first part is defined as:

L_{i}^{M o t i o n} = ∥ S m o o t h_{L_{1}} {(f_{i}^{g t} - f_{i}^{p r e d}) ∥}_{1}

(1)

S m o o t h_{L_{1}} (x) = \{\begin{matrix} 0.5 x^{2}, i f |x| < 1 \\ |x| - 0.5 o t h e r w i s e \end{matrix}

(2)

where the

L_{i}^{M o t i o n}

denotes the momentary displacement loss of the ith optical flow frame;

f_{i}^{g t}

and

f_{i}^{p r e d}

denote the ground truth and estimation of the ith optical flow frame respectively.

The second part of the loss is angle cost introduces the directional differences between the estimation and ground truth. L1 Loss used in RAFT only takes the absolute value of the difference between the optical flow estimation and ground truth. It apparently ignores the error information in the moving direction or offset angle of each pixel. To address this drawback, the loss function in this paper additionally introduced the angle cost proposed by Gevorgyan, Zhora, et al. [29] to assist the motion cost:

L_{i}^{A n g l e} = \sum_{j, k} {[1 - 2 \sin^{2} (y a w - \frac{π}{4})]}_{j \times k}

(3)

where yaw is the offset angle from the predicted estimation of optical flow

f_{i}^{p r e d} (j, k)

to the ground truth optical flow

f_{i}^{g t} (j, k)

in pixel-wise, and its calculation formula is referred to in [29].

The third part of the loss is trend cost, specifically designed for the fused optical flow features with the purpose of comparing the trend of the estimated optical flow with the trend of the ground truth optical flow. After this, the former can be regressed towards the latter as much as possible. When calculating the pixel-wised trend of the optical flow along the two dimensions (height, width) of the images, to increase the penalty for the case where the trend of the estimation is exactly opposite to the trend of the ground truth during network learning, the cost function of this part is designed as a Differential L2 Loss, which is defined as:

L_{i}^{T r e n d} = \sum_{j, k} {[\sqrt{{(Γ_{i}^{g t} - Γ_{i}^{p r e d})}^{2}}]}_{j \times k}

(4)

where

L_{i}^{T r e n d}

denotes the continuous inter-frame trend loss from

i

to

i - 1

frame of optical flow, which obviously is calculated from

i = 1

, and the initial value is set to 0;

j

and

k

represent the number of rows and columns in the image matrix.

On this basis, considering that the optical flow trend can have significant differences even in two continuous moments at the positions where the moving objects completely leave (i.e., the contour of the moving object) or some new moving objects appear (especially at the boundary of the image). The most important impact on the global error for this case includes two situations of trend changing: abrupt changes in the optical flow state (from the static state to the dynamic state, or vice versa) and reversal of the optical flow’s direction. To regulate the sensitivity to the emergence of new moving objects, we introduced hyperparameters

δ

and

μ

. These parameters establish the weights of trend loss in the total loss. As the optimal combinations of these two hyperparameters, grid search could be a solution, and the details of this experiment are also tabled in Section 4.3.

R_{i} = \frac{n u m e l (|Γ_{i}^{g t} - Γ_{i - 1}^{g t}| \geq |Γ_{i}^{g t}|)}{j \times k}

(5)

ω_{i} = \{\begin{matrix} 1 / δ, R_{i} \geq μ \\ 1 / (1 - R_{i}), R_{i} < μ \end{matrix}

(6)

where we use function

n u m e l

to calculate the number of elements that satisfy the condition in brackets;

R_{i}

denotes the proportion of elements that have abrupt changes in trend from the whole image;

ω_{i}

is the conditional weight, which varies according to

R_{i}

and the hyperparameters

δ

and

μ

.

\hat{ω_{i}} = \frac{ω_{i} \sum_{i = 1}^{N} L_{i}^{T r e n d}}{\sum_{i = 1}^{N} ω_{i} L_{i}^{T r e n d}}

(7)

It is important to note that applying the weights

ω_{i}

directly to the trend cost function for each frame will also change the total trend cost, which may disturb the balance between the loss of positive and negative samples, thus reducing the performance of the whole optical flow estimation network. Refer to the strategy in [24,30]; we normalize

ω_{i}

to

\hat{ω_{i}}

in Equation (7) to ensure that the sum of the total trend cost is constant.

Finally, following the practice in designing the loss function in RAFT [7], we use a weight factor (controlled by hyperparameter

ν

) for each frame in a complete image sequence to suppress the phenomenon that the partial losses from earlier frames, which usually have greater impacts on the total loss than later frames. By introducing exponentially increasing weights in Equation (8), the total loss can achieve the purpose of global optimization.

L_{t o t a l} = \sum_{i = 1}^{N} ν^{N - i} (L_{i}^{M o t i o n} + L_{i}^{A n g l e} + \hat{ω_{i}} L_{i}^{T r e n d})

(8)

4. Experiments

4.1. Datasets Issues

Since our method is different from traditional optical flow estimation networks on the input side (ours needs three input frames for training and inferencing while others only need two), we are limited in choosing datasets. Many datasets such as KITTI only provide two consecutive frames for each scene, so we only use Sintel as the main training and validating dataset. When we need to train/validate on KITTI, we will replicate the first frame and use it as a pseudo

t - 2

frame to artificially add the 2 giving frames to 3, to ensure the process of our model can run properly. However, such an approach will severely degrade the accuracy of our model for training and validating on KITTI. Therefore, in this section, we do provide comparison results on KITTI, but only for reference purposes. For data in Sintel Clean, Figure 3 shows the visualized optical flow map output from RAFT (including its variants) and our method, compared with the original image and the flow map of ground truth.

4.2. Setups

We will evaluate our SPCT model on Sintel [31] and KITTI-2015 [32]. We choose training strategies similar to those in [2,7,8,9,14]: we used two NVIDIA RTX 2080Ti GPUs and pre-trained our SPCT model for 100k iterations on FlyingThings [33] with a batch size of 12 based on the Pytorch [34] framework and 100k iterations on FlyingThings3D with a batch size of 6. After that, we fine-tuned our model by training 50–120 k iterations on Sintel, KITTI 2015, and HD1K [35] with a batch size of 6 for different validating datasets. The hyperparameters of the backbone networks were kept the same as those in RAFT, i.e., the hyperparameter

ν

was set to 0.8.

4.3. Ablations

As mentioned in Section 3.3 and Section 3.4, in this section, we use three ablation experiments to determine the best way of fusing continuous and momentary optical flow features, searching for the optimal combination of the hyperparameters

δ

and

μ

, and verifying the effectiveness of both the C-MFF module and the total loss that we designed.

Best fusion scheme. During the process of exploring the best fusion strategies for two types of optical flow features, we compared two conventional algorithms (element-wise adding and concatenation) and two advanced algorithms (spatial transformer network [25] and non-local method [26,27]). In Table 1, we can clearly see that only the estimation result after the element-wise adding operation is worse than the baseline; the other three algorithms are more or less improved. This is probably because direct element-wise adding for the two extracted optical flow features would damage the inter-frame information as well as the changing trend, which is absolutely contrary to the whole idea of our SPCT. In addition, among the three methods that can enhance performance, it is clear that we could obtain the most optimal fused flow features just through the simplest concatenation operation.

Best hyperparameter scheme. As presented in Equation (6), Section 3.4, we decided to use two hyperparameters,

δ

and

μ

, as the regulator to adjust the sensitivity of trend cost. This sensitivity here is meant to recognize those pixels that change their state of motion abruptly. In detail,

μ

acts as a threshold to monitor the proportion of those emerging pixels in the whole image, while

δ

controls the degree of attention that the network is able to have when considering those emerging pixels. We set the default for

δ

at over 1.0 in order to make the attention of our model fairly loose, to pay to the unmatchable and mutable pixel pairs, and to make it focus on the useful motion trend during pixel movement. In reference to finding the optimized hyperparameter pairs, we handled this with a grid search, as presented in Table 2, and found out that the performance of our model reached the peak when we set

δ

and

μ

at 0.35 and 1.6, respectively.

Effect of the proposed module and loss. To validate the effect of the C-MFF module and proposed loss function, we conduct ablation experiments on the baseline flow estimator. As shown in Table 3, “baseline” denotes our basic flow estimator RAFT. Compared to this baseline, in AEVE evaluate metrics, our proposed C-MFF module can improve the performance on the Sintel training set by 0.11 and 0.09. On KITTI 2015 training set, the increase becomes less significant due to the incompatible input structure. In addition, our three-dimensioned loss function can also give a positive effect on the performance. When it comes to combining these two modules, the results can be further boosted by nearly 5~13% compared with the baseline on the Sintel training set.

4.4. Quantitative Results on Standard Datasets

We report the whole quantitative results in Table 4. To verify our design, we compare our performance with the original RAFT [7] and serval variants of that, including RAFT-A [36], GMA (and its extension) [14], CRAFT [37], and FM-RAFT [38]. Although our model has certain advantages in the trainset columns of the table, they still can hardly represent the generalizability of the models since these results are evaluated from the training sets. Here, we placed them on Table 4, just to comply with some conventional format for the reference. In addition, other columns of the table contain the estimated results of the testing sets from Sintel. The reason why we did not evaluate the KITTI 2015 testing set has been explained early in this paper, because every scene in the KITTI datasets only provides two continuous image frames, which will obviously limit the performance of our structure. According to this, we focus more on the results evaluated from the Clean version of the Sintel testing sets, and our model has improved the AEPE metric from 1.61 to 1.27 compared with original RAFT. That allows it achieve the best performance among all models listed on Table 4. For the Final version of Sintel testing sets, however, it also increased by 12.94 % (2.86

\to

2.49) compared with the original RAFT. This result is not as good as CRAFT and GMA, but it is moderately better than any other RAFT variant. We will discuss the possible explanation for this phenomenon.

5. Discussion

We have experimentally demonstrated that the operation of adding the optical flow trend information to correct the estimations when designing optical flow estimation algorithms is very effective during the processing of the image sequence. The source of inspiration given to the idea of this paper is the practices in the multi-object tracking field, where they usually match the detected objects at as many as possible time nodes, or even globally, and then correct and update the key variables such as motion state and trend of the paired objects in Kalman Filter or a similar functional module. Transferring this concept to the field of optical flow estimation, our method considers each pixel as a target, accepts three consecutive image frames as input, then matches and tracks pixels to correct and update them according to their optical flow values and trends for several iterations.

However, the optical flow trend proposed in this paper needs to extract the optical flow features from three consecutive image frames as inputs, which obviously may result in a loss or abrupt change of trend when encountering a fast-moving target beyond the image boundary, affecting the accuracy of the optical flow estimation within a certain range of the image boundary. Secondly, as the experiment has shown in the previous section, the performance of our model tested on the Sintel Final dataset does not reach the optimal among the models listed in the table as it does on the Sintel Clean dataset. Comparing the Clean and Final datasets (shown in Figure 4), we can find that the Final dataset has more artistic rendering and environmental effects, and also adds other processing such as motion blur, which make the visual effect smoother when perceived by human eyes, but which provide no help in accurately representing the fast-moving target in each frame. These differences diminish the advantage of our method which incorporates optical flow trend information in the estimating process. Finally, the good news is that the performance of our SPCT model would be more robust in scenes with a moving perspective (such as the camera set on a vehicle or a UAV, etc.) and under perturbation (which are mainly from a poorly fixed camera, such as overhead surveillance cameras, which are susceptible to significant shaking by wind). For this kind of situation, networks with our module will need fewer additional artificial compensation algorithms for parameters such as camera shift and exposure rate than those without.

To make our work more engineering and application-oriented, we used a data-collecting vehicle (deployed for collecting autonomous driving scene data) and deployed an overhead surveillance camera in a parking lot to verify the performance of our method for scenes with fast-moving targets and under low brightness environment.

For the first case, we captured sequences of images from a front camera on our experimental vehicle when driving into real road intersections during a driving scenario dataset-collecting project in Shanghai. As shown in Figure 5, we compared the optical flow estimation results predicted by RAFT, one of its variants, GMA, and our SPCT model. We pay more attention to those fast-moving targets when processing the captured images. In this frame, we can find out that only our SPCT model predicts the correct optical flow estimation result for the pedestrian (marked with a red dashed box) who just suddenly broke into the lane of the ego car from the obscured blue vehicle.

The scene from Figure 6 is captured from an overhead monitor set up on a gantry in the parking lot. Camera shaking is caused by windy weather or insufficient fixation of the gantry itself, which has some disturbance to the continuous frame images captured. Comparing the estimating results of RAFT, its variant, GMA and our SPCT, we found out that RAFT is nearly beaten by the shaking environment. The GMA does a better job, but also gets confused about the lane lines and the parking lines. Due to the stream perception and the additional changing trend information, our SPCT still shows good behavior on this task.

Actually, in a well-light environment, the enhancement of our model is not particularly outstanding, so we performed the same experiment again, only this time the brightness was lacking because it was at night. From this latest experiment, we obtained a set of results with a much larger gap of differences, as shown in Figure 7. Thanks to the input of three consecutive frames, our SPCT structure takes into account the optical flow at the earlier moments and its continuous trend compared to other conventional structures such as RAFT. This improvement makes SPCT more effective in the low-brightness environment and has even more significant performance improvements compared to daytime.

6. Conclusions

The two-frame input strategy has long been considered the default input mode for optical flow estimation. In this paper, we introduce a novel input structure that allows more frames to be processed. Our proposed C-MFF module is inspired by streaming perception, so it can fuse continuous and momentary feature maps and help compute more dynamic motion features. We also design a new loss function for our method, making it more sensitive to angle error and trend change errors. Our method has been validated by experiments, as shown in Section 4, and it has a moderately better effect on the Sintel dataset, especially on the Clean version. In Section 5, by applying optical flow estimation models in real-world intersection and parking lot scenarios, we demonstrate our SPCT has relatively better abilities in predicting fast-moving targets, resisting jitter interference, and being robust in the low-brightness environment than other listed models. We also expect further development, such as introducing our method into other optical flow estimation networks and applying them to the autonomous driving scenario, especially when the vehicles have insufficient performance for precepting some unexpected events or under some extreme environments.

Author Contributions

Conceptualization, Z.W. and W.Z.; Methodology, Z.W.; Software, Z.W.; Validation, W.Z. and B.Z.; Formal analysis, Z.W.; Investigation, Z.W. and W.Z.; Resources, W.Z.; Data curation, Z.W.; Writing—original draft, Z.W.; Writing—review & editing, W.Z. and B.Z.; Visualization, Z.W.; Supervision, W.Z. and B.Z.; Project administration, W.Z. and B.Z.; Funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by National Natural Science Foundation of China (No. 51805312, 52172388) and the APC was funded by Zhang W.

Data Availability Statement

The data that support this research are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Horn, B.K.; Schunck, B.G. Determining optical flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef] [Green Version]
Sun, D.; Yang, X.; Liu, M.-Y.; Kautz, J. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8934–8943. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Lee, K.-Y.; Chung, C.-D.; Chuang, Y.-Y. Scene warping: Layer-based stereoscopic image resizing. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 49–56. [Google Scholar]
Xu, J.; Ranftl, R.; Koltun, V. Accurate optical flow via direct cost volume processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1289–1297. [Google Scholar]
Fleet, D.; Weiss, Y. Optical flow estimation. In Handbook of Mathematical Models in Computer Vision; Springer: Berlin/Heidelberg, Germany, 2006; pp. 237–257. [Google Scholar]
Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 402–419. [Google Scholar]
Hur, J.; Roth, S. Iterative residual refinement for joint optical flow and occlusion estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5754–5763. [Google Scholar]
Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; Van Der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the Proceedings of the IEEE international conference on computer vision; Long Beach, CA, USA, 15–20 June 2015, pp. 2758–2766.
Huang, Z.; Shi, X.; Zhang, C.; Wang, Q.; Cheung, K.C.; Qin, H.; Dai, J.; Li, H. FlowFormer: A Transformer Architecture for Optical Flow. arXiv 2022, arXiv:2203.16194. [Google Scholar]
Luc, P.; Couprie, C.; Lecun, Y.; Verbeek, J. Predicting future instance segmentation by forecasting convolutional features. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 584–599. [Google Scholar]
Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2462–2470. [Google Scholar]
Dey, R.; Salem, F.M. Gate-variants of gated recurrent unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Medford, MA, USA, 6–9 August 2017; pp. 1597–1600. [Google Scholar]
Jiang, S.; Campbell, D.; Lu, Y.; Li, H.; Hartley, R. Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 9772–9781. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, M.; Wang, Y.-X.; Ramanan, D. Towards streaming perception. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 473–488. [Google Scholar]
Welch, G.; Bishop, G. An Introduction to the Kalman Filter; University of North Carolina at Chapel Hill: Chapel Hill, NC, USA, 1995. [Google Scholar]
Zilberstein, S. Using anytime algorithms in intelligent systems. AI Mag. 1996, 17, 73. [Google Scholar]
Zilberstein, S.; Mouaddib, A.-I. Optimal scheduling of progressive processing tasks. Int. J. Approx. Reason. 2000, 25, 169–186. [Google Scholar] [CrossRef] [Green Version]
Alzugaray, I.; Chli, M. Asynchronous corner detection and tracking for event cameras in real time. IEEE Robot. Autom. Lett. 2018, 3, 3177–3184. [Google Scholar] [CrossRef] [Green Version]
Yang, J.; Liu, S.; Li, Z.; Li, X.; Sun, J. Real-time Object Detection for Streaming Perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, June 21–24 2022; pp. 5385–5395. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Sun, D.; Roth, S.; Black, M.J. Secrets of optical flow estimation and their principles. In Proceedings of the 2010 IEEE computer society conference on computer vision and pattern recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2432–2439. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; Long Beach, CA, USA, 15–20 June 2015, pp. 1440–1448.
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Wu, S.; Yang, J.; Wang, X.; Li, X. Iou-balanced loss functions for single-stage object detection. Pattern Recognit. Lett. 2022, 156, 96–103. [Google Scholar] [CrossRef]
Butler, D.J.; Wulff, J.; Stanley, G.B.; Black, M.J. A naturalistic open source movie for optical flow evaluation. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 611–625. [Google Scholar]
Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar]
Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4040–4048. [Google Scholar]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Kondermann, D.; Nair, R.; Honauer, K.; Krispin, K.; Andrulis, J.; Brock, A.; Gussefeld, B.; Rahimimoghaddam, M.; Hofmann, S.; Brenner, C. The hci benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 27–30 June 2016; pp. 19–28. [Google Scholar]
Sun, D.; Vlasic, D.; Herrmann, C.; Jampani, V.; Krainin, M.; Chang, H.; Zabih, R.; Freeman, W.T.; Liu, C. Autoflow: Learning a better training set for optical flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 10093–10102. [Google Scholar]
Sui, X.; Li, S.; Geng, X.; Wu, Y.; Xu, X.; Liu, Y.; Goh, R.; Zhu, H. CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, June 21–24 2022; pp. 17602–17611. [Google Scholar]
Jiang, S.; Lu, Y.; Li, H.; Hartley, R. Learning optical flow from a few matches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 16592–16600. [Google Scholar]

Figure 1. Training pipeline. The whole training architecture consists of the RAFT backbone, a fusion module for continuous-momentary flow features, and a combination of the three types of losses.

Figure 2. Forward inference pipeline. By designing the historical buffer, the computing burden of the forward process in the inference phase is guaranteed not to be excessive compared to RAFT. In the beginning, we copy the feature from the first optical flow frame as the feature from a pseudo-previous frame. After that, the network will cache the last optical flow feature extracted from the backbone into the historical buffer. When the network is processing a new frame, it will obtain the previous feature directly from the buffer rather than computing the previous frames (

f_{t - 2}, f_{t - 1})

again, and then combine it with the current extracted feature into the C-MFF module.

Figure 2. Forward inference pipeline. By designing the historical buffer, the computing burden of the forward process in the inference phase is guaranteed not to be excessive compared to RAFT. In the beginning, we copy the feature from the first optical flow frame as the feature from a pseudo-previous frame. After that, the network will cache the last optical flow feature extracted from the backbone into the historical buffer. When the network is processing a new frame, it will obtain the previous feature directly from the buffer rather than computing the previous frames (

f_{t - 2}, f_{t - 1})

again, and then combine it with the current extracted feature into the C-MFF module.

Figure 3. The original image and flow maps of ground truth, original RAFT, and its variant GMA(-p) models compared with the flow map of our SPCT model on the Sintel Clean dataset. The green circles indicate partial errors eliminated by our model where the other listed models did not. What can be seen is that our SPCT model has a clear advantage in estimating the optical flow trend at the contour area, especially for images that have fast-moving targets, such as this flying scene. We attribute this advantage to introducing the information from the optical flow trends into our SPCT model.

Figure 4. Images from Sintel Clean (left column), Sintel Final (right column), and their visualized optical flow map estimated by our model. The images from the Sintel Final dataset have more background rendering and motion blur effects compared to the Clean one, resulting in our model learning relatively less or even incorrect information about the optical flow trend.

Figure 5. Applying our SPCT model in the real road scenario. The image sequences are captured from a front camera of our collecting vehicle when driving on an urban road in Shanghai. Comparing the three results, which are separately estimated from RAFT, GMA, and our SPCT models, only SPCT could predict the optical flow value of this pedestrian.

Figure 6. Applying optical flow estimation model for parking lot scenario in the daytime. This scene is collected from an overhead surveillance camera set up on a gantry on our campus. We also use some red dashed boxes to mark the regions of apparently incorrect optical flow estimation. The results presented here demonstrate that our SPCT still has a subtle advantage in this scenario.

Figure 7. Applying optical flow estimation model for parking lot scenario at night. This experiment is almost identical to the previous one in Figure 6 except for the deploying time. The additional low brightness conditions plague the optical flow estimation performances of these models, so there are different levels of noise on the optical flow map. The results presented here once again demonstrate that our SPCT excels in this scenario.

Table 1. Ablation experiments for choosing the best fusion scheme between continuous flow feature and momentary flow feature. We employ a basic RAFT optical flow estimator as the baseline for all experiments. This table shows the results on Sintel and KITTI datasets for all fusion strategies after training on FlyingChairs (C) and FlyingThings (T) datasets.

Training Data	Method	Sintel (Train)		KITTI-15 (Train)
Training Data	Method	Clean	Final	AEPE	Fl-All (%)
C + T	Baseline (RAFT)	1.43	2.71	5.04	17.40
	Element-wise Adding	1.47	2.81	5.18	17.74
	NL	1.38	2.69	4.96	17.14
	STN	1.33	2.65	4.92	16.98
	Concatenation	1.24	2.58	4.90	16.81

Table 2. Grid search of

δ

and

μ

in Equation (6) for trend cost conditional weights in the total loss function.

Table 2. Grid search of

δ

and

μ

in Equation (6) for trend cost conditional weights in the total loss function.

Training Data	$δ$	$μ$	Sintel (Train)		KITTI-15 (Train)
Training Data	$δ$	$μ$	Clean	Final	AEPE	Fl-All (%)
C + T	0.3	1.5	1.29	2.57	5.03	17.07
	0.3	1.6	1.25	2.62	4.99	17.29
	0.3	1.7	1.27	2.59	4.90	17.07
	0.35	1.5	1.31	2.60	4.91	17.24
	0.35	1.6	1.24	2.58	4.90	16.81
	0.35	1.7	1.24	2.62	4.99	16.83
	0.4	1.5	1.30	2.60	5.08	17.34
	0.4	1.6	1.29	2.58	4.99	17.51
	0.4	1.7	1.32	2.60	5.29	17.03

Table 3. The effect of the proposed pipeline combined with baseline (RAFT), C-MFF, and our specifically designed loss. Where ‘*’ indicates the module is introduced into the training structure.

Training Data	Baseline	C-MFF	Our Loss	Sintel (Train)		KITTI-15 (Train)
Training Data	Baseline	C-MFF	Our Loss	Clean	Final	AEPE	Fl-All
C + T	*			1.43	2.71	5.04	17.40
	*	*		1.32	2.62	5.01	17.24
	*		*	1.39	2.65	4.95	16.96
	*	*	*	1.24	2.58	4.90	16.81

Table 4. Results on Sintel and KITTI 2015 datasets. “C + T” refers to all the models that are pre-trained on the FlyingChairs [9] and FlyingThings [33] datasets. “S/K” refers to methods that are fine-tuned on the Sintel and KITTI datasets. “(+H)” refers to some methods also fine-tuned on the HD1K dataset. “+p” denotes the joint position and content-wise attention model defined in the GMA paper [14]. “RAFT-A” denotes the method proposed by the paper [36]. The evaluation strategy includes the AEPE and Fl-all metrics.

Training Data	Method	Sintel (Train)		KITTI-15 (Train)		Sintel (Test)
Training Data	Method	Clean	Final	AEPE	Fl-All	Clean	Final
C + T	RAFT	1.43	2.71	5.04	17.40	-	-
	RAFT-A	1.95	2.57	4.23	-	-	-
	GMA	1.30	2.74	4.69	17.10	-	-
	GMA + p	1.33	2.87	4.83	16.60	-	-
	CRAFT	1.27	2.79	4.88	17.50	-	-
	FM-RAFT	1.29	2.95	6.80	19.30	-	-
	Ours	1.24	2.58	4.90	16.81	-	-
C + T + S/K (+H)	RAFT	0.76	1.22	0.63	1.50	1.61	2.86
	RAFT-A	-	-	-	-	2.01	3.14
	GMA	0.62	1.06	0.57	1.20	1.39	2.47
	GMA + p	0.65	1.11	0.58	1.30	1.54	2.63
	CRAFT	0.60	1.06	0.58	1.34	1.45	2.42
	FM-RAFT	0.79	1.70	0.75	2.10	1.72	3.60
	Ours	0.52	1.08	0.61	1.38	1.27	2.49

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Zhang, W.; Zhao, B. Estimating Optical Flow with Streaming Perception and Changing Trend Aiming to Complex Scenarios. Appl. Sci. 2023, 13, 3907. https://doi.org/10.3390/app13063907

AMA Style

Wang Z, Zhang W, Zhao B. Estimating Optical Flow with Streaming Perception and Changing Trend Aiming to Complex Scenarios. Applied Sciences. 2023; 13(6):3907. https://doi.org/10.3390/app13063907

Chicago/Turabian Style

Wang, Zixiao, Weiwei Zhang, and Bo Zhao. 2023. "Estimating Optical Flow with Streaming Perception and Changing Trend Aiming to Complex Scenarios" Applied Sciences 13, no. 6: 3907. https://doi.org/10.3390/app13063907

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimating Optical Flow with Streaming Perception and Changing Trend Aiming to Complex Scenarios

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Parameters Definition

3.2. Structure Overview

3.3. Continuous-Momentary Feature Fusion Module

3.4. Training Loss

4. Experiments

4.1. Datasets Issues

4.2. Setups

4.3. Ablations

4.4. Quantitative Results on Standard Datasets

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI