Real-Time Optical Flow Estimation Method Based on Cross-Stage Network

Park, Min-Hong; Cho, Jae-Hoon; Kim, Yong-Tae

doi:10.3390/app132413056

Open AccessArticle

Real-Time Optical Flow Estimation Method Based on Cross-Stage Network

by

Min-Hong Park

¹

,

Jae-Hoon Cho

²

and

Yong-Tae Kim

^1,*

¹

School of ICT Robotics and Mechanical Engineering, Hankyong National University, Anseong 456-749, Republic of Korea

²

Smart Convergence Technology Research Center, Hankyong National University, Anseong 456-749, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(24), 13056; https://doi.org/10.3390/app132413056

Submission received: 10 October 2023 / Revised: 5 November 2023 / Accepted: 27 November 2023 / Published: 7 December 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this paper, a real-time optical flow estimation method based on a cross-stage network is proposed. The proposed model is designed with a network structure with encoders and decoders. The proposed method combines cross-stage network technology with the network structure of FlowNet2 and RAFT to achieve improved parameter number and estimation performance. For real-time optical flow estimation, it is important to maintain performance while reducing the number of parameters in the network. In the proposed method, structural convergence is performed to increase performance while reducing the number of parameters by applying the cross-stage network structure. The proposed model is designed to solve the bottlenecks in model accuracy and complexity by separating feature extraction and flow estimation processes. Flying Chairs, Flying Things 3D, and KITTI datasets were used to evaluate the performance of the proposed model, and the experimental results show superior performance compared to previous traditional methods.

Keywords:

optical flow; cross-stage; FlowNet2; RAFT; CSPNet; real-time

1. Introduction

Visual motion information is a very important factor in the field of computer vision. An optical flow method is a typical way to estimate the motion information from a continuous image. Optical flow is a technology that estimates the movement of each pixel in an image sequence and is utilized in a variety of applications, including object tracking, motion recognition, and 3D reconstruction [1,2,3,4].

The optical flow method can be largely divided into knowledge-driven methods, data-driven methods, and hybrid-driven methods [5]. Traditional methods have been mainly used in the past, and these methods use a rule-based approach based on image brightness changes and gradient information. However, these methods have problems that do not work properly or have limited accuracy in complex scenarios [6,7]. Due to the development of deep learning, research has been actively conducted in recent years to solve optical flow prediction problems using deep learning methods. The Convolutional Neural Network (CNN) is very effective in image processing and can be leveraged to solve optical flow prediction problems.

A major improvement came from the improved design of similarity and normalization items, and the advent of deep neural networks has greatly advanced the optical flow methods. FlowNet [8] was the first end-to-end convolution network for optical flow estimation. It consists of two pyramid structures. The first pyramid is used for flow estimation, and the second pyramid is used for normalizing optical flow boundaries. FlowNet 2.0 [9] is an extended version of FlowNet, which uses larger model sizes and more complex architectures to improve accuracy. FlowNet2 consists of a large network cascade containing more than 160 M parameters. It adopted a stacked architecture with warping operations to achieve similar performance to the latest techniques. SpyNet [10] is a compact CNN model with an architecture similar to FlowNet2. SpyNet estimates flow fields using spatial pyramid structures. A series of tasks such as PWC-Net [11,12], LiteFlowNet [13,14], and VCN [15] have used coarse to sophisticated estimation methodologies. Among them, LiteFlowNet estimates the flow field using cascading flow estimation. At each pyramid level, LiteFlowNet uses flow estimation from the previous level to warp the second image in the direction of the first image. The flow field is then estimated by comparing the warped image with the feature map of the first image. LiteFlowNet uses an f-lconv layer at each pyramid level to normalize the flow fields. The f-lconv layer uses both direction and brightness information in the flow field to normalize the flow field. These models had the problem of missing small and fast-moving objects at rough stages. To address this issue, Teed and Deng [16] proposed RAFT (Recurrent All-Pairs Field Transforms), which uses a multi-step search window and recursive approach while performing optical flow estimation at a rough stage.

RAFT is a state-of-the-art optical flow estimation algorithm that is designed to calculate dense optical flow fields. Optical flow refers to the pattern of apparent motion of objects, surfaces, and edges in a visual scene, caused by the relative motion between an observer (e.g., a camera) and the scene. RAFT stands out for its accuracy and the ability to estimate optical flow under various challenging conditions. The success of RAFT largely lies in the large number of iterative refinements it can perform [17]. Therefore, RAFT has better performance than PWC-NET, but it has the disadvantage of being late in prediction speed, and many studies are being conducted to improve these problems [18,19].

In this paper, we design an optical flow network that is the basis for camera pose tracking and propose a network structure with fewer parameters and improved performance than the conventional models. The proposed technique calculates optical flows through successive input images and conducts experiments comparing them with conventional models. This study aims to explore an alternative CNN architecture for accurate and efficient optical flow estimation. Although this work was inspired by the success of FlowNet2 and RAFT, we take a closer look at the key elements that allow us to fully unleash the potential of deep convolutional networks in combination with classical principles. There are two general principles for improving the design of FlowNet2 and SPyNet. The first principle is pyramid feature extraction. The proposed network, named CAFT (Cross-stage All-Pairs Field Transforms), consists of encoders and decoders. The encoder maps each given pair of images into a pyramid of multi-scale, high-dimensional features. The decoder estimates the flow field in a rough and precise framework. At each pyramid level, the decoder selects features of the same resolution from the feature pyramid and uses them to infer the flow field. This design structure creates a light network in the form of a CSPyramid for flow inference. The proposed network isolates the feature extraction and flow estimation processes. This helps to better understand bottlenecks in accuracy and model size. Although CAFT has a simpler architecture than FlowNet2 and RAFT, it demonstrates competitive performance in terms of accuracy and speed. CAFT is suitable for a variety of applications that require real-time optical flow estimation.

2. Related Research

2.1. FlowNet

FlowNet is an algorithm to obtain optical flow through supervised learning. The FlowNet method has the problem of a lack of datasets, which results in the limitation of optical flow estimation for supervisory learning. FlowNet is CNNs end-to-end learning method, and for end-to-end learning using two images, a correlation layer is used to perform pixel-by-pixel localization. FlowNet proposes two structures, FlowNetSimple and FlowNetCorr, as shown in Figure 1.

FlowNetSimple uses two images required for optical flow calculation as input through the number of input channels of 6. Subsequent network structures follow traditional CNN structures. Based on supervisory learning, it allows the network to decide on its own how to extract the optical flow. FlowNetCorr has a more complex structure and no concat process. Instead, the input stage is divided to receive input for the two images. In the previous process of the correlation layer, which is characteristic of FlowNetCorr, both inputs had the same structure. This structure extracts each feature map from the two images and combines them into one through the correlation layer.

To improve the performance of FlowNet, FlowNet2 is designed using a learning method that can detect both large and small movements well by combining the FlowNetC and FlowNetS networks. There are three main improvements in the scheduling of datasets, building networks, and additional areas dealing with small displacements. Several experiments were conducted in each area to determine the structure and learning method of the model. Figure 2 shows the structure of FlowNet2. FlowNet2 has a variant structure of FlowNet in which a cascaded network finesses the previous flow field to overcome the limitations of accuracy. Therefore, the FlowNet2 model consists of more than 160 M parameters, which has a problem that cannot be applied to applications that require real-time estimation.

2.2. Pyramid Structure-Based Optical Flow Network

SpyNet was first designed with a structure called a spatial schematic network for optical flow calculation. The previous FlowNet has a structure in which the input images are concatenated, and optical flow is extracted immediately through an inference. However, SpyNet performs downscaling and obtains optical flows at the smallest scale. Based on this procedure, the optical flow at the next scale is calculated, and this process is repeated. Therefore, in terms of parameters, SpyNet has the advantage of having much smaller parameters compared to previous networks that obtain optical flows. However, the SpyNet method did not show much advantage when comparing accuracy with previous methods. Figure 3 shows the structure of SpyNet.

PWC-Net [20] is small but uses an effective CNN model for optical flow, applying well-known design principles of pyramid, warping, and cost volume. This method warps the CNN features of the second image using current optical flow estimates cast onto a learnable feature pyramid. The cost volume is then constructed using the warped features of the first image, which is processed by the CNN network to decode the optical flow. Figure 4 shows the structure of PWC-Net.

2.3. LiteFlowNet

LiteFlowNet is a lightweight version of the FlowNet2.0 architecture, which is designed to predict optical flow vectors between pairs of input images using convolutional neural networks. It is specifically optimized for execution on embedded systems or low-resource environments, offering accurate optical flow prediction while reducing model size and computational requirements. Cascaded flow inference is one of the characteristics used in LiteFlowNet. This approach involves performing multiple consecutive prediction steps to improve the accuracy of optical flow estimation. Typically, two or more prediction stages are employed, where each stage takes the results from the previous stage along with the original images as input. In the first prediction stage, the input image pair is processed to estimate an initial optical flow vector field. Then, the initial result is combined with the original images, and the combined result is passed as input to the second prediction stage. Additional prediction stages can be performed if needed, with each stage utilizing information from previous stages and providing progressively more accurate optical flow results. By incorporating cascaded flow inference, LiteFlowNet leverages multiple prediction steps to refine and enhance its optical flow estimation. This technique allows for a smoother and more consistent output by iteratively improving upon initial predictions using richer contextual information.

In summary, LiteFlowNet builds upon FlowNet2.0 but offers a lighter architecture that enables execution on embedded systems or low-resource computing environments. Cascaded flow inference enhances its performance by employing consecutive prediction stages that iteratively refine and improve optical flow estimation based on previous results and contextual information. Figure 5 shows the structure of LiteFlowNet.

2.4. RAFT

The RAFT (Recurrent All-Pairs Field Transforms) is an innovative approach designed to improve optical flow prediction accuracy significantly. By leveraging a recursive architecture combined with transformations for all pixel pairs, RAFT achieves impressive results in various challenging scenarios characterized by substantial image variations or intricate motion patterns. The core principle of RAFT lies in its recurrent architecture, which enables iterative refinement of predicted optical flows using contextual information from surrounding pixels at each location within input images. Moreover, by performing transformations on every possible pair of pixels, RAFT ensures comprehensive interactions throughout the entire image during training stages.

To train the RAFT model effectively, a differentiable loss function is employed to minimize discrepancies between actual ground truth flows and predicted flows generated by RAFTs estimations through backpropagation techniques. Overall, this approach presents promising advancements in accurate optical flow estimation while addressing complexities found in real-world environments. Figure 6 shows the structure of RAFT.

3. Proposed CAFT Method

3.1. Optical Flow Method Using CSPNet Structure

In general, deep learning network structures with excellent performance are composed of many parameters to increase accuracy, which has the problem of greatly increasing the amount of computation. To solve the problem, a Cross-Stage Parallel Network (CSPNet) [20,21] as shown in Figure 7 was proposed. Unlike the general deep learning structure in which the output of the previous layer is connected to the input of the next layer, CSPNet uses only a part of the output of the previous layer as input to reduce computation and solve the gradient vanishing problem to improve performance. Equation (1) represents the output of CSPNet by the calculated weight of the layer.

\begin{array}{l} x_{k} = W_{k} * [x_{0^{″}}, x_{1}, \dots, x_{k - 1}] \\ x_{T} = W_{T} * [x_{0^{″}}, x_{1}, \dots, x_{k}] \\ x_{U} = W_{U} * [x_{0^{'}}, x_{T}] \end{array}, \begin{array}{l} W_{k}^{'} = f (W_{k}, g_{0^{″}}, g_{1}, g_{2}, \dots, g_{k - 1}) \\ W_{T}^{'} = f (W_{T}, g_{0^{″}}, g_{1}, g_{2}, \dots, g_{k}) \\ W_{U} = f (W_{U}, g_{0^{'}}, g_{T}) \end{array}

(1)

where

x_{0} = [x_{0^{'}}, x_{0^{''}}]

is the feature map of input images,

x_{T}

is the output of partial dense block,

x_{U}

is the output of partial transition layer,

x

is the output of each layer, and

W

is a weight at each layer. In Equation (1),

f

is a weight update function. This paper proposes a new structure that can effectively deliver information on feature maps while reducing computational complexity in the structure of CSPNet. The proposed structure is shown in Figure 8.

The proposed deep learning model with the CSPNet structure has the following components: It consists of an encoder part that extracts a characteristic vector for each pixel, a correlation layer that creates a 4D correlation layer volume for all pairs of pixels, and a ConvGRU-based update operator that searches for values in the third correlation volume and repeatedly updates the flow field initialized to zero.

The encoder, which extracts features from the entire structure, extracts one feature per pixel from both input images and has the structure of a Cross Partial structure. Next, there is a correlation layer that constitutes a 4D W × H × W × H correlation volume by taking the internal product of all the feature vector pairs. The last two dimensions of the 4D volume are pooled in multiple ways. A multi-scale volume set can be constructed using scale. Next, there is an update operator that updates the optical flow again by querying values from a set of correlated volumes using current estimates.

3.2. Extracting Features of Images

Features of the input image are extracted using a convolution network. The feature encoder network applies to both images

I_{1}

and

I_{2}

, and maps input images to dense feature maps with low resolution. Encoder

g_{θ}

extracts features at 1/8 resolution.

g_{θ} = R^{H \times W \times 3} \to R^{H / 8 \times W / 8 \times D}

(2)

In Equation (2), the resolution changes, where D = 256 is set. The feature encoder consists of a total of six residual blocks: two at 1/2 resolution, two at 1/4 resolution, and two at 1/8 resolution, as shown in Figure 9.

The input image feature map is divided into two 32 channels (

x_{0^{'}}, x_{0^{″}}

) to calculate the cross-stage. The number of convolutions in the cross-stage block is a hyper-parameter determined by the designer or user. In the proposed technique, the purpose of the cross-stage block at the encoder stage is to balance the overall amount of computation with preserving the spatial information of meaningful pixels as much as possible. Therefore, in this paper, the convolution in the cross-stage block of the encoder is fixed at k = 2, and the output of each layer is represented as Equation (3).

\begin{array}{l} x_{1} = W_{k} * [x_{0^{″}}] \\ x_{2} = W_{T} * [x_{0^{″}}, x_{1}] \\ x_{U} = W_{U} * [x_{0^{'}}, x_{1}, x_{2}, x_{0^{″}}] \end{array}, \begin{array}{l} W_{1}^{'} = f (W_{k}, g_{0^{″}}, g_{1}, g_{2}, \dots, g_{k - 1}) \\ W_{2}^{'} = f (W_{T}, g_{0^{″}}, g_{1}, g_{2}, \dots, g_{k}) \\ W_{U} = f (W_{U}, g_{0^{'}}, g_{1}, g_{2}, g_{0^{″}}) \end{array}

(3)

As can be seen from Equation (3), the proposed technique has small parameters while maintaining gradient information compared to CSPNet. Figure 10 shows the update block structure in the optical flow estimation. Additionally, the context network extracts features only from the first input image

I_{1}

. The structure of the context network

h_{θ}

is the same as the feature extraction network. The feature network

g_{θ}

and context network

h_{θ}

constitute the first step and are performed only once.

3.3. Visual Similarity of Images

Visual similarity is calculated by constructing the entire volume of correlation between all pairs to calculate visual similarity. For a given image feature

g_{θ} (I_{1}) \in R^{H \times W \times D}

and

g_{θ} (I_{2}) \in R^{H \times W \times D}

, the correlation volume is constructed through the interior between all feature vector pairs. The correlation volume is calculated by single matrix multiplication, as shown in Equation (4).

C (g_{θ} (I_{1}), g_{θ} (I_{2})) \in R^{H \times W \times H \times W}, C_{i j k l} = \sum_{h} {g_{θ} (I_{1})}_{i j h}, g_{θ} {(I_{2})}_{k l h}

(4)

3.4. Structure of Correlation Volume

The correlation volume consists of four layers,

C_{1}

,

C_{2}

,

C_{3}

and

C_{4}

, by pooling into kernel sizes 1, 2, 4, and 8 of the last two dimensions. These four kernels inherited the values used in RAFT. Therefore, each volume

C_{k}

has a dimension of

H \times W \times H / 2^{k} \times W / 2^{k}

. This set of volumes provides information about the displacements of large and small, respectively. However, by maintaining the first two dimensions (the dimension of

I_{1}

), high-resolution information can be maintained to recover the motion of small and fast-moving objects.

Next, we define

L_{C}

, an inspection operator that generates feature maps through indexing in a correlation volume structure. When the estimated values

F_{1}

and

F_{2}

of the current optical flow are given, each pixel

x = (u, υ)

is mapped to the estimated corresponding point

x^{'} = (u + f^{1} (u), υ + f^{2} (υ))

of

I_{1}

to

I_{2}

. The surrounding local grid of

x^{'}

is defined by using the

L_{1}

distance as shown in Equation (7).

N (x^{'})_{r} = \{x^{'} + d x | d x \in Z^{2}, {‖d x‖}_{1} \leq r\}

(5)

A grid is a set of integer offsets within the unit of radius r at

x^{'}

. Indexing is performed on the correlation volume using the local neighborhood

N (x^{'})

. Since

N {(x^{'})}_{r}

is a real-name grid, bilinear sampling is used. Here, inspections are performed on all correlation volumes, and

C^{k}

, the correlation volume at level

k

, is indexed using a grid

N {(x^{'} / 2^{k})}_{r}

. A constant radius over several levels means a larger context at a lower level. For the lowest level, radius 4 is used, and

k = 4

corresponds to the range of 256 pixels of the original resolution. The values at each level are then associated with a single feature map.

3.5. Calculation of High-Resolution Images

The correlation of all corresponding pairs of images has a scale of

O (N^{2})

with respect to the number of pixels

N

, but is calculated only once and is constant with respect to the number of iterations

M

. However, there is the same implementation method that scales on the

O (N M)

scale using the linearity of internal and average pooling. Considering the cost volume in level

m

,

C_{i j k l}^{m}

, and feature maps

g^{(1)} = g_{θ} (I_{1}), g^{(2)} = g_{θ} (I_{2})

, the following Equation (6) is obtained.

C_{i j k l}^{m} = \frac{1}{2^{2 m}} \sum_{p}^{2^{m}} \sum_{q}^{2^{m}} 〈g_{i j}^{(1)}, g_{2^{m}}^{(2)}_{k + p, 2^{m} l + 1}〉 = 〈g_{i, j}^{(1)}, \frac{1}{2^{2 m}} (\sum_{p}^{2^{m}} \sum_{q}^{2^{m}} g_{2^{m}}^{(2)}_{k + p, 2^{m} l + q})〉

(6)

The above equation is the average of the correlation responses of the

2^{m} \times 2^{m}

grid. It means that the value of

C_{i j k l}^{m}

can be calculated internally between the feature vectors of

g_{θ} {(I_{1})}_{i j}

and

g_{θ} (I_{2})

. In the alternative implementation method, the correlation is not calculated in advance, but instead a pooled image feature map is calculated in advance. In each iteration, each correlation value is calculated only at the required moment, that is, when retrieved. This results in the complexity of

O (N M)

.

3.6. Update Process

The update operator estimates the sequence of the flow estimate

\{f_{1}, \dots, f_{N}\}

at the initial time point

f_{0} = 0

. Each iteration generates

Δ f

, an update direction applied to the current estimate. That is,

f_{k + 1} = Δ f + f_{k}

. The update operator receives the flow, correlation, and potential hidden state as input, and outputs the update

Δ f

and the updated hidden state. The architecture of the update operator is designed to mimic the optimization algorithm. For this, we use weights connected through depth, and we use boundary activation functions to ensure convergence. The update operator is learned by performing an update such that the sequence converges to a fixed point

f_{k} \to f^{*}

.

In initialization, the flow field is selected as zero at all locations by default. However, iterative approaches provide the flexibility to experiment with alternatives. When applied to successive images, warm-start initialization is used to project the optical flow of the previous frame pair forward to the next frame pair and to fill and initialize the projection area using neighbor interpolation.

The input retrieves the correlation feature from the correlation pyramid if the current flow estimate

f^{k}

is given. The correlation features are calculated through two convolution layers. In addition, two convolutional layers are applied to the flow estimate itself to generate a flow feature. Finally, the input of the context network is directly injected. The input feature map consists of a connection of correlation, flow, and context features.

A key component of the update operator is a gate activation unit based on a GRU unit. It replaces the fully connected layer with convolution and is shown in Equation (7).

\begin{matrix} z_{t} = σ ({C o n v}_{3 \times 3} ({[h}_{t - 1}, x_{t}], W_{z})) \\ r_{t} = σ ({C o n v}_{3 \times 3} ({[h}_{t - 1}, x_{t}], W_{r})) \\ \tilde{h_{t}} = t a n h ({C o n v}_{3 \times 3} ({[r_{t} ⊙ h}_{t - 1}, x_{t}], W_{h})) \\ h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h_{t}} \end{matrix}

(7)

where,

x_{t}

is a link between previously defined flows, correlations, and context features. We also evaluate the model with detachable ConvGRU units by replacing 3 × 3 convolution with 1 × 5 convolution and 5 × 1 convolution to increase the acceptance area without significantly increasing the size of the model.

In the optical flow prediction, the hidden state output by the GRU passes through two convolution layers to predict the flow update

Δ f

. The output flow is 1/8 the resolution of the input image. During training and evaluation, the predicted flow field is upsampled with the same resolution as the true value. The network outputs an optical flow at 1/8 the resolution. The optical flow is upsampled to maximum resolution by making the overall resolution flow of each pixel a convex combination of 3 × 3 grids with near-low resolution.

Two convolutional layers are used to predict the

H / 8 \times W / 8 \times (8 \times 8 \times 9)

mask, and SoftMax is performed on nine neighbor weights. The final high-resolution flow field is converted into a flow field in the

H \times W \times 2

dimension by using this mask to take a weighted combination of neighbors and then changing permutation and shape.

4. Experimental Result

To evaluate the performance of the model, we used the Flying Chairs [8] dataset and the Flying Things 3D [9] dataset (called dataset T). Learning and evaluation were performed using Flying Chairs and Flying Things 3D datasets, and additionally, the results of fine-tuned models are evaluated through MPI Sintel [22] (referred to as dataset S) and KITTI-2015 [23] (referred to as dataset K). Figure 11 shows examples of the Flying Chair and Things dataset, and Figure 12 shows the MPI Sintel dataset. Also, the equipment used for learning was the Intel i9 CPU, RTX3090ti GPU, and 128 GB of RAM.

The MPI Sintel dataset is a cinematic 3D animated dataset created in 2012 by the Max Planck Institute for Computer Science and TU Eindhoven. This dataset contains 1064 rendered stereo images and 23 different scenes. Each image has a resolution of 1024 × 436 pixels and an 8-bit channel. The MPI Sintel dataset is used to evaluate various visual computational tasks such as optical flow, depth estimation, object recognition, and 3D reconstruction. This dataset mainly evaluates optical flows and provides a variety of metrics that can be used to evaluate the performance of each model. The main indicator is end point error (EPE), which refers to the average of the distance between optical flow prediction and actual optical flow.

The KITTI dataset consists of 200 learning and test images, respectively, of stereo, flow, and scene flow. These KITTI datasets have the indicators F1-epe and F1-all. As mentioned above, EPE represents how far it is from the actual value of the prediction flow in pixels. This value averages all pixels in all images. First, EPE is calculated independently in each image and then averaged over all images, and in the KITTI dataset, the indicator is defined as F1-epe. F1-all is a matrix of KITTI datasets that is defined by the author of the data set and is used in KITTI reader boards. This value corresponds to the average of pixels with an EPE of less than 3 pixels, or less than 5% of the flow.

In this paper, the proposed optical flow model, CAFT, was implemented using Pytorch (version 1.8) [24]. All layers are initialized with random weights. During learning, the AdamW [25] optimizer is used. We also evaluated the performance of our proposed model on the Sintel and KITTI datasets. Fine-tuning is performed for each data set. To compare with previous studies, we show the results of a fine-tuned model for Sintel and a fine-tuned model for KITTI only. Table 1 shows the number of parameters and inference speed of the networks of the proposed model and traditional methods.

Table 2 shows the results of the Sintel and KITTI datasets, and Figure 13 plots comparing parameter counts, inference time, training iterations, and accuracy. After training on FlyingChairs (C dataset) and FlyingThings (T dataset), generalization performance was tested on Sintel (train), and we show that the proposed CAFT method outperforms all existing models on both Clean and Final datasets.

Table 2 shows the performance comparison based on the average value of errors for a specified optical flow point, which is expressed as a mean end-point error (MEPE). We found that the MEPE values of the proposed CAFT were 68% superior to HD3 methods, which performed the lowest on the Sintel (train) Clean (C+T) dataset, and 52%, 50%, 45%, 45%, 39%, and 14%, respectively, compared to PWC-Net, LiteFlowNet, LiteFlowNet, LiteFlowNet2, MaskFlowNet, FlowNet, and RAFT.

The Sintel (train) Final (C+T) dataset performed 76% better than HD3 and 21% better than RAFT. The Sintel Final dataset is less refined than the Clean dataset and contains data such as noise and low resolution. The proposed method showed the best performance in the results of the Sintel (train) and Final (C+T) data. This result shows that the proposed method works stably even in various environments. In the F1-epe of KITTI-15 (train(C+T), MEPE was the best at 4.87, and HD3 was the lowest at 13.17 (63%). In addition, PWC-Net and LiteFlowNet showed the same 10.39 (53%). LiteFlowNet2, FlowNet2, and RAFT represented 8.97 (46%), 10.08 (52%), and 5.04 (3%), respectively.

Also, in F1-all, the proposed method was the best at 12.8, while HD3, which showed the lowest performance, showed 24.0 (47%), and RAFT showed 17.4 (26%). In the experimental results of the model learned using C+T+S+K data, the proposed CAFT showed the best result of 0.62 in Sintel Clean(Train). LiteFlowNet2 showed 1.30 (52%), and RAFT showed 0.76 (18%). In the Sintel Final (Train), the proposed CAFT performed 40% better than LiteFlowNet2 and 20% better than RAFT. In the KITTI-15 Fl-eps(train) and Fl-all data, the proposed CAFT showed the best performance with 0.58 and 1.1, respectively. Sintel Clean and Final (test) showed 1.55 and 2.37, about 20% and 25% better performance than RAFT, respectively. The optical flow results of each tiger and wall scene in the MPI Clean and Final datasets are shown in Figure 14 and Figure 15.

5. Discussion

The proposed structure is designed based on the conventional optimization-based approach, and feature encoders extract pixel-wise features. The correlation layer calculates the visual similarity between pixels. The update operator mimics the steps of the iterative optimization algorithm. However, unlike traditional approaches, characteristics and motion dictionary probabilities are not created manually but are learned and replaced by characteristic encoders and update operators. Previous studies [9,10,11,12,13,14,15] design differently from the high-resolution upsampling method used.

The following describes the advantages of the proposed method. First, by using a single high-resolution flow field, we solve the problem of low-resolution error recovery in detailed continuous calculations, missing small and fast-moving objects, and the large number of training times generally required for continuous calculations.

These performances are shown in Table 2 and Figure 13. The results of the sintel (Final) in Table 2 confirm that the proposed algorithm performs well even in the presence of low-resolution images or noise. In addition, the diagram of training operations in Figure 13 shows that our method performs well even with a small number of learning times.

Next, the update operator has a repetitive and light structure. Many recent studies [9,10,11,12,13,28] include some form of repetitive improvement but do not weigh between them. Previous methods used recurrent units in FlowNet and PWC-Net. However, the size of the network, that is, the number of parameters, is limited to less than five times. However, in the proposed network structure, the number of parameters is 4.0 M, as shown in Table 1, and can be applied more than 100 times.

Finally, the update operator consists of ConvGRU units that inspect a four-dimensional multi-scale correlation volume. On the other hand, the improvement module in the previous task uses only the general convolution or correlation threshold layer.

Given continuous RGB images

I_{1}

and

I_{2}

, a dense displacement field (

f^{1}

,

f^{2}

) in which each pixel

(u, υ)

is mapped from the corresponding coordinates (

(u^{'}, υ^{'}) = (u + f^{1} (u), υ + f^{1} (υ))

), is estimated. After extracting the first feature, the second visual similarity is calculated, and the update is repeatedly performed. All the above steps are differentiable and have a structure that can be learned end-to-end.

6. Conclusions

In this paper, we propose a cross-end-based optical flow estimation network. The proposed CAFT uses only a single high-speed flow field and has a cross-stage structure that can reduce the number of parameters and increase accuracy. This structure enables fast optical flow calculations with a smaller number of parameters compared to other traditional optical flow techniques. Also, the proposed method applies a network-based cross-stage technique with an encoder and decoder and improves accuracy by connecting the input of the previous layer to the output stage. The proposed method has a 27.5% smaller number of parameters than RAFT, and the optical flow calculation speed is improved by more than 70%. To evaluate the performance of the proposed model, the Flying Chairs (C dataset), Flying Things 3D (T dataset), Sintel (Clean, Final), and KITTI datasets were used. The C and T (Train) datasets were used for learning the proposed network, and the Sintel dataset and KITTI-15 were used for both training and testing.

In the model using C+T learning data, the proposed CAFT on Sintel (train) data were about 68% better than HD3 and about 14% better than the existing best RAFT.

Additionally, Sintel Final(train) showed 76% improved performance compared to HD3 and 21% better than RAFT. In the KITTI-15 (Train) data experiment, F1-epe and F1-all showed results of 4.87 and 12.8, respectively, and the performance was superior to other traditional methods.

In the model using C+T+S+K learning data, the experimental results of Sintel (test) and KITTI-15 (test) showed 1.55, 2.37, and 4.31, respectively, with results improved by 20%, 25%, and 15% compared to RAFT.

In all experimental results, the proposed CAFT was superior to other methods by up to 77% and at least 3%. This result shows that the proposed method is superior to other methods in terms of fast optical flow calculation and accuracy while having a small number of network parameters.

Author Contributions

Conceptualization, methodology, and software, M.-H.P.; investigation, J.-H.C.; writing and original draft preparation, Y.-T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Institute of Planning and Evaluation for Technology in Food, Agriculture, and Forestry (IPET) through the Open Field Smart Agriculture Technology Short-Term Advancement Program, funded by the Ministry of Agriculture, Food, and Rural Affairs (MAFRA) (122032-03-1SB010).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

FlyingChairs, FlyingThing3D: https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html; Sintel: http://sintel.is.tue.mpg.de; KITTI: https://www.cvlibs.net/datasets/kitti/eval_flow.php (accessed on 26 November 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Souhila, K.; Karim, A. Optical flow based robot obstacle avoidance. Int. J. Adv. Robot. Syst. 2007, 4, 2. [Google Scholar] [CrossRef]
Agarwal, A.; Gupta, S.; Singh, D.K. Review of optical flow technique for moving object detection. In Proceedings of the 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), Greater Noida, India, 14–17 December 2016; pp. 409–413. [Google Scholar]
Han, X.; Gao, Y.; Lu, Z.; Zhang, Z.; Niu, D. Research on moving object detection algorithm based on improved three frame difference method and optical flow. In Proceedings of the 2015 Fifth International Conference on Instrumentation and Measurement, Computer, Communication and Control (IMCCC), Qinhuangdao, China, 8–20 September 2015; pp. 580–584. [Google Scholar]
Roxas, M.; Oishi, T. Real-time simultaneous 3D reconstruction and optical flow estimation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 885–893. [Google Scholar]
Zhai, M.; Xiang, X.; Lv, N.; Kong, X. Optical flow and scene flow estimation: A survey. Pattern Recognit. 2021, 114, 107861. [Google Scholar] [CrossRef]
Horn, B.K.; Schunck, G.B. Determining Optical Flow. Artificial Intelligence 17. Artic. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef]
Lucas, B.D. An iterative technique of image registration and its application to stereo. In Proceedings of the 7th IJCAI, Vancouver, BC, Canada, 24–28 August 1981; pp. 674–679. [Google Scholar]
Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; Van Der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2462–2470. [Google Scholar]
Ranjan, A.; Black, M.J. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4161–4170. [Google Scholar]
Sun, D.; Yang, X.; Liu, M.-Y.; Kautz, J. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8934–8943. [Google Scholar]
Sun, D.; Yang, X.; Liu, M.-Y.; Kautz, J. Models matter, so does training: An empirical study of cnns for optical flow estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1408–1423. [Google Scholar] [CrossRef] [PubMed]
Hui, T.-W.; Tang, X.; Loy, C.C. Liteflownet: A lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8981–8989. [Google Scholar]
Hui, T.-W.; Tang, X.; Loy, C.C. A lightweight optical flow cnn—Revisiting data fidelity and regularization. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2555–2569. [Google Scholar] [CrossRef] [PubMed]
Yang, G.; Ramanan, D. Volumetric correspondence networks for optical flow. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper/2019/hash/bbf94b34eb32268ada57a3be5062fe7d-Abstract.html (accessed on 26 November 2023).
Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 402–419. [Google Scholar]
Xu, H.; Zhang, J.; Cai, J.; Rezatofighi, H.; Tao, D. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8121–8130. [Google Scholar]
Bai, S.; Geng, Z.; Savani, Y.; Kolter, J.Z. Deep equilibrium optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 620–630. [Google Scholar]
Xu, H.; Zhang, J.; Cai, J.; Rezatofighi, H.; Yu, F.; Tao, D.; Geiger, A. Unifying flow, stereo and depth estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 4, 13941–13958. [Google Scholar] [CrossRef] [PubMed]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Park, M.-H.; Cho, J.-H.; Kim, Y.-T. CNN Model with Multilayer ASPP and Two-Step Cross-Stage for Semantic Segmentation. Machines 2023, 11, 126. [Google Scholar] [CrossRef]
Butler, D.J.; Wulff, J.; Stanley, G.B.; Black, M.J. A naturalistic open source movie for optical flow evaluation. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 611–625. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch. In Proceedings of the Thirty-first Conference on Neural Information Processing Systems, Long Beach, CA, USA, 7–9 December 2017. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Yin, Z.; Darrell, T.; Yu, F. Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6044–6053. [Google Scholar]
Zhao, S.; Sheng, Y.; Dong, Y.; Chang, E.I.; Xu, Y. Maskflownet: Asymmetric feature matching with learnable occlusion mask. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6278–6287. [Google Scholar]
Hur, J.; Roth, S. Iterative residual refinement for joint optical flow and occlusion estimation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5754–5763. [Google Scholar]

Figure 1. The structures of FlowNetSimple and FlowNetCorr. 6 and 3 in the red circle indicate the number of channels in the input, and corr means the correlation of the feature map.

Figure 2. The structure of FlowNet2.

Figure 3. The structure of SpyNet.

Figure 4. The structure of PWC-Net.

Figure 5. The structure of LiteFlowNet.

Figure 6. The structure of RAFT.

Figure 7. The structure of Cross-Stage Partial DenseNet.

Figure 8. Network structure for optical flow calculation.

Figure 9. Feature and context encoder structure.

Figure 10. Update block works in optical flow.

Figure 11. Flying Chair and Flying Things dataset.

Figure 12. MPI Sintel dataset.

Figure 13. Comparing parameter counts, inference time, and training iterations vs. accuracy.

Figure 14. Optical flow results on tiger scene of MPI Sintel datasets.

Figure 15. Optical flow results on wall scene of MPI Sintel datasets.

Table 1. The number of parameters and inferences from the networks of the proposed model and traditional methods. Param represents the number of learning parameters in each method, and Time(s) represents the time it takes to infer an optical flow.

Model	LiteFlowNetX	LiteFlowNet	IRR-PWC	PWCNet	FlowNet2	VCN	RAFT	Proposed CAFT
Param	0.9 M	5.4 M	6.4 M	8.8 M	162.0 M	6.2 M	5.8 M	4.2 M
Time(s)	-	0.971	0.211	0.026	0.124	0.262	0.102	0.059

Table 2. Optical flow dataset learning results for each model. ‘C + T’ denotes training only on the FlyingChairs and FlyingThings datasets, and ‘C+ S + K + H’ denotes fine-tuning on the combination of FlyingChairs, FlyingThings, Sintel, and KITTI training sets.

Training Data	Method	Sintel (Train)		KITTI-15 (Train)		Sintel (Test)		KITTI-15 (Test)
Training Data	Method	Clean	Final	F1-epe	F1-All	Clean	Final	F1-All
C+T	HD3 [26]	3.84 (68%)	8.77 (76%)	13.17 (63%)	24.0 (47%)	-	-	-
	PWC-Net [11]	2.55 (52%)	3.93 (46%)	10.39 (53%)	28.5 (55%)	-	-	-
	LiteFlowNet [13]	2.48 (50%)	4.04 (47%)	10.39 (53%)	28.5 (55%)	-	-	-
	LiteFlowNet2 [14]	2.24 (45%)	3.78 (44%)	8.97 (46%)	25.9 (51%)	-	-	-
	MaskFlowNet [27]	2.25 (45%)	3.61 (41%)	-	23.1 (45%)	-	-	-
	FlowNet2 [9]	2.02 (39%)	3.54 (40%)	10.08 (52%)	30.0 (57%)	-	-	-
	RAFT [16]	1.43 (14%)	2.71 (21%)	5.04 (3%)	17.4 (26%)	-	-	-
	Proposed CAFT	1.23	2.13	4.87	12.8	-	-	-
C+T+S+K	LiteFlowNet2 [14]	1.30 (52%)	1.62 (40%)	1.47 (61%)	4.8 (77%)	3.48 (55%)	4.69 (49%)	7.74 (44%)
	PWC-Net+ [12]	1.71 (64%)	2.34 (58%)	1.5 (61%)	5.3 (79%)	3.45 (55%)	4.60 (48%)	7.72 (44%)
	MaskFlowNet [27]	-	-	-	-	2.52 (38%)	4.17 (43)	6.10 (29%)
	VCN [15]	1.66 (63%)	2.24 (56%)	1.16 (50%)	4.1 (73%)	2.81 (45%)	4.40 (46%)	6.30 (32%)
	RAFT [16]	0.76 (18%)	1.22 (20%)	0.63 (8%)	1.5 (27%)	1.94 (20%)	3.18 (25%)	5.10 (15%)
	Proposed CAFT	0.62	0.98	0.58	1.1	1.55	2.37	4.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, M.-H.; Cho, J.-H.; Kim, Y.-T. Real-Time Optical Flow Estimation Method Based on Cross-Stage Network. Appl. Sci. 2023, 13, 13056. https://doi.org/10.3390/app132413056

AMA Style

Park M-H, Cho J-H, Kim Y-T. Real-Time Optical Flow Estimation Method Based on Cross-Stage Network. Applied Sciences. 2023; 13(24):13056. https://doi.org/10.3390/app132413056

Chicago/Turabian Style

Park, Min-Hong, Jae-Hoon Cho, and Yong-Tae Kim. 2023. "Real-Time Optical Flow Estimation Method Based on Cross-Stage Network" Applied Sciences 13, no. 24: 13056. https://doi.org/10.3390/app132413056

APA Style

Park, M.-H., Cho, J.-H., & Kim, Y.-T. (2023). Real-Time Optical Flow Estimation Method Based on Cross-Stage Network. Applied Sciences, 13(24), 13056. https://doi.org/10.3390/app132413056

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Optical Flow Estimation Method Based on Cross-Stage Network

Abstract

1. Introduction

2. Related Research

2.1. FlowNet

2.2. Pyramid Structure-Based Optical Flow Network

2.3. LiteFlowNet

2.4. RAFT

3. Proposed CAFT Method

3.1. Optical Flow Method Using CSPNet Structure

3.2. Extracting Features of Images

3.3. Visual Similarity of Images

3.4. Structure of Correlation Volume

3.5. Calculation of High-Resolution Images

3.6. Update Process

4. Experimental Result

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI