Video Global Motion Compensation Based on Affine Inverse Transform Model

Zhang, Nan; Liu, Weifeng; Xia, Xingyu

doi:10.3390/s23187750

Open AccessArticle

Video Global Motion Compensation Based on Affine Inverse Transform Model

by

Nan Zhang

¹,

Weifeng Liu

^1,*

and

Xingyu Xia

²

¹

School of Electrical and Control Engineering, Shaanxi University of Science and Technology, Xi’an 710021, China

²

School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(18), 7750; https://doi.org/10.3390/s23187750

Submission received: 21 July 2023 / Revised: 2 September 2023 / Accepted: 5 September 2023 / Published: 8 September 2023

(This article belongs to the Special Issue Applications of Video Processing and Computer Vision Sensor II)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Global motion greatly increases the number of false alarms for object detection in video sequences against dynamic backgrounds. Therefore, before detecting the target in the dynamic background, it is necessary to estimate and compensate the global motion to eliminate the influence of the global motion. In this paper, we use the SURF (speeded up robust features) algorithm combined with the MSAC (M-Estimate Sample Consensus) algorithm to process the video. The global motion of a video sequence is estimated according to the feature point matching pairs of adjacent frames of the video sequence and the global motion parameters of the video sequence under the dynamic background. On this basis, we propose an inverse transformation model of affine transformation, which acts on each adjacent frame of the video sequence in turn. The model compensates the global motion, and outputs a video sequence after global motion compensation from a specific view for object detection. Experimental results show that the algorithm proposed in this paper can accurately perform motion compensation on video sequences containing complex global motion, and the compensated video sequences achieve higher peak signal-to-noise ratio and better visual effects.

Keywords:

image processing; global motion compensation; feature point matching; affine transformation; target detection

1. Introduction

At present, intelligent analysis technology has realized the detection, identification, tracking, and human behavior analysis of moving targets. It is widely used in the military, intelligent transportation, medicine, and other fields [1,2,3]. Among them, the detection of moving objects is the most basic and critical link. In a video sequence, according to the motion of the recording device itself, object detection is mainly divided into target detection in the static background and target detection in the dynamic background [4,5,6,7].

The motion of the background is usually caused by the change in the position of the recording device, which is called global motion [8,9]. The movement of the foreground is the movement of the moving object relative to the recording device, which is a local movement [10]. In this paper, the dynamic background refers to the global motion caused by the transformation of the camera position during video shooting. The moving target to be detected refers to the local motion caused by the target movement in the video sequence.

Target detection in the static background mainly includes the frame difference method [11], background difference method [12], etc. These methods are very effective in static backgrounds and are well established with high accuracy [13,14]. However, global motion makes object detection in dynamic backgrounds more complicated than that in static backgrounds. Most of the target detection and segmentation algorithms are suitable for static backgrounds, but cannot be effectively applied to dynamic backgrounds [15].

Moving object detection under a dynamic background [16,17] mainly includes optical flow method [18,19] and global motion compensation method [20,21]. The global motion compensation method first estimates the global motion and then analyzes and calculates the motion parameters for motion compensation. Thus, the problem of object detection in a dynamic background is transformed into object detection in a static background. For global motion estimation, it is first necessary to obtain global motion parameters. Usually, we use the method of extracting the feature points of two adjacent frames of images and matching them.

The motion vector between images is obtained first, and then the motion parameters are obtained by fitting the motion model [22]. In practical applications, motion vectors are generally obtained by matching feature points between images. Commonly used image feature point matching algorithms include SIFT (Scale-invariant feature transform) [23], ORB (Oriented FAST and Rotated BRIEF) [24] and SURF (Speed Up Robust Features) [25], etc.

The SIFT algorithm has scale invariance and can detect a large number of key points in the image for fast matching. However, since the SIFT algorithm does not consider the geometric constraints of the space, it leads to a high mismatch rate [26,27]. ORB combines the FAST feature point detection [28] with the BRIEF feature descriptor [29], and it has been further improved and optimized from their original implementation. SURF has improved compared to the SIFT algorithm. Through the combination of the Harris feature [30] and the overall image, the running speed of the program is greatly improved and the mismatch rate of feature points is reduced [31].

Although ORB takes less time than SURF, it provides lower matching rates in rotation and shearing scenarios of different strengths. Therefore, SURF is considered a suitable compromise between speed and performance. After the feature point matching pair of the image is obtained by the SURF algorithm, it is necessary to use the MSAC (M-Estimate Sample Consensus) algorithm [32] to further obtain the motion parameters contained in the feature point matching pair. And fit the affine transformation [33] model to complete the global motion estimation.

The traditional global motion compensation method is to use the adjacent two-frame images to conduct a global motion estimation to obtain the motion model. By performing model transformation on the previous frame image to predict the next frame image, a compensated frame image is obtained [34,35]. When the moving object moves slowly, the difference between the moving object and the background motion in two consecutive frames of images is small, making it susceptible to be easily mistaken for the background. The frame difference between the latter frame image and the compensated frame image makes the extracted moving object incomplete [36].

C. Song et al. [37] introduce a multiscale motion compensation network (MS-MCN) that works with a pyramid flow decoder to generate multiscale optical flows and perform motion-compensated prediction of the current frame from the previous frame. X. Liu et al. [38] propose a novel dynamic local filter network to perform implicit motion estimation and compensation by employing, via locally connected layers, sample-specific and position-specific dynamic local filters that are tailored to the target pixels. Son, Hyeongseok et al. [39] proposed a motion compensation network that combines the detail features with pre-computed structure features using a structure injection scheme, and then uses a feature matching-based motion compensation module to estimate the motion between the current and previous frames. Although the global motion compensation method based on deep learning has good effect, it is more costly in terms of time and space due to high algorithm complexity.

In this paper, we refer to the traditional idea of global compensation and make improvements. After the affine transformation model is obtained by performing global motion estimation on the adjacent front and rear frame images, an inverse transformation model of the affine transformation is established to act on the latter frame image. And sequentially output compensated images of consistent size with a specific view. The video sequence composed of compensated images can directly perform frame difference operations to extract moving targets.

We aim at the problem of the high false alarm rate of moving target detection in video sequences caused by global motion and start with the conversion of dynamic background to static background in video sequences. We propose a global motion compensation algorithm based on an affine inverse transform model: By combining the SURF algorithm with MSAC processing, we obtain an affine transformation matrix representing the global motion of the video sequence. On this basis, an inverse transformation model of affine transformation is proposed, which is used to compensate for the global motion. We also realize the conversion from a dynamic background to a static background in a video sequence. Finally, the moving target is detected by the frame difference method.

2. Problem Description

The video sequence V under the dynamic background is regarded as an ordered set composed of n frames of images:

\{\begin{matrix} V_{n} (Θ_{n}) = {I_{1} (θ_{1}), I_{2} (θ_{2}), \dots, I_{k} (θ_{k}), I_{k + 1} (θ_{k + 1}) \dots I_{n} (θ_{n})} \\ I_{k} (θ_{k}) = I_{k b} (θ_{k b}) \cup I_{k o} (θ_{k o}) \\ I_{k o} (θ_{k o}) = I_{k o_{1}} (θ_{k o_{1}}) \cup I_{k o_{2}} (θ_{k o_{2}}) \dots \cup I_{k o_{i}} (θ_{k o i}) \dots \cup {I_{k o}}_{m} (θ_{k o m}) \end{matrix}

(1)

In the Equation (1),

I_{k} (θ_{k})

represents the image of the kth frame;

I_{k b} (θ_{k b})

represents the set of pixels that make up the background in the image of the kth frame;

I_{k o} (θ_{k o})

represents the pixel point set of all moving objects in the image of the kth frame. Assume that there are m moving targets in total, among which,

I_{k o_{i}} (θ_{k o i})

constitutes the pixel point set of the moving target marked as i.

The parameters for

2 D

transformations of an image encompass translation, rotation, scaling, shearing, mirroring, and composite transformations. Translation refers to the distance by which an image is moved along its horizontal and vertical directions. It is typically represented by horizontal translation value (

t_{x}

) and vertical translation value (

t_{y}

), which can be positive or negative. In rotation, we rotate the object at a particular angle

θ

from its origin. It is typically measured in degrees (°), where a positive value indicates counterclockwise rotation and a negative value indicates clockwise rotation. To change the size of an object, scaling transformation is used. In the scaling process, it either expands or compresses the dimensions of the object. Scaling can be achieved by multiplying the original coordinates of the object with the scaling factor to get the desired result.

Reflection is the mirror image of the original object. In other words, it can say that it is a rotation operation with 180°. In reflection transformation, the size of the object does not change. A transformation that slants the shape of an object is called the shear transformation. There are two shear transformations X-Shear and Y-Shear. One shifts X coordinates values and the other shifts Y coordinate values. In both cases, only one coordinate changes and the other preserves its values. However, transformations are often composite rather than existing in isolation. Composite transformations can be achieved by combining transformation matrices, as shown in Equation (2):

[T] [X] = [X] [T_{1}] [T_{2}] \dots [T_{n}]

(2)

where

[T_{i}]

Ti represents the transformation matrices for translation, rotation, scaling, mirroring, and shearing, respectively.

[X]

represents the video. The application scenario for the proposed algorithm in this paper is global motion compensation for videos captured by a camera. The degree of distortion in the videos is not high; hence, we set the transformation matrices for translation, rotation, and scaling.

\{\begin{matrix} θ_{k b} = {[a_{k}, b_{k}, t_{k}]}^{T} \\ θ_{k o_{i}} = θ_{k b} + η_{k o_{i}} \end{matrix}

(3)

In the Equation (3),

θ_{k b}

represents the motion state of the background pixel of the image in the kth frame, that is, the global motion state. The motion parameter

a_{k}, b_{k}, t_{k}

describes the scaling, rotation, and translation state of the background, respectively;

θ_{k o_{i}}

represents the motion state of the foreground moving object based on the centroid of the moving object marked as I in the kth frame image. This motion state superimposes the current background motion state

θ_{k b}

and its own motion state

η_{k o_{i}}

.

Due to the superposition between the global motion and the motion of the moving object, it is difficult to distinguish between the moving object and the background, and the detection of the moving object is more difficult. In order to accurately and completely extract the moving target in the video sequence, it is necessary to remove the influence of the global motion on the target motion as much as possible. Then, the global motion compensation problem of video sequences under dynamic background can be described as follows: Find a suitable algorithm to obtain the parameters of the global motion estimator

a_{k}^{'}, b_{k}^{'}, t_{k}^{'}

. On this basis, seek a suitable global motion compensation method:

\{\begin{matrix} {\hat{θ}}_{k b} = {[a_{k}^{'}, b_{k}^{'}, t_{k}^{'}]}^{T} \\ θ_{k o_{i}}^{'} = θ_{k b} + η_{k o_{i}} - {\hat{θ}}_{k b} \end{matrix}

(4)

In the Equation (4),

{\hat{θ}}_{k b}

represents the estimation of the motion state of the background pixel of the image in the kth frame. That is the estimation of the global motion state of the image in the kth frame. The motion parameter

a_{k}^{'}, b_{k}^{'}, t_{k}^{'}

describes the scaling, rotation, and translation states of the background in the global motion estimator, respectively.

θ {^{'}}_{k o_{i}}

represents the motion state of the moving object marked as I in the kth frame image after compensation.

3. Global Motion Compensation Algorithm Based on Affine Inverse Transform Model

3.1. Feature Point Matching

SURF is a commonly used feature point matching algorithm. When matching two images in the same scene, we first use the Hessian matrix to generate all the points of interest in the image. Then, we extract the feature points after eliminating the unreasonable interest points, and generate their descriptors. The matching pairs of feature points are obtained through the comparison of descriptors.

The Hessian matrix is a square matrix composed of the second-order partial derivatives of a multivariate function, which describes the gray gradient changes in all directions. It generates points of interest by obtaining all “suspicious” extremum points. Before constructing the Hessian matrix, the image needs to be Gaussian filtered, and the Gaussian blur coefficient is

σ

. Assume that a Hessian matrix H is established at a certain pixel in the grayscale image

P (u, v)

corresponding to the kth frame image:

H =  [\begin{matrix} L_{u u} (P, σ) & L_{u v} (P, σ) \\ L_{u v} (P, σ) & L_{v v} (P, σ) \end{matrix}]

(5)

In the Equation (5),

L_{u u} (P, σ)

is the convolution of the image

I_{k} (P)

at the pixel point P and the second-order Gaussian template

\frac{\partial^{2} g (σ)}{\partial u^{2}}

, as shown in Equation (6).

L_{v v} (P, σ)

and

L_{u v} (P, σ)

are shown in the same way, as shown in Equations (7) and (8):

L_{u u} (P, σ) = \frac{\partial^{2} g (σ)}{\partial u^{2}} \otimes I_{k} (P)

(6)

L_{v v} (P, σ) = \frac{\partial^{2} g (σ)}{\partial v^{2}} \otimes I_{k} (P)

(7)

L_{u v} (P, σ) = \frac{\partial^{2} g (σ)}{\partial u \partial v} \otimes I_{k} (P)

(8)

In which:

g (σ) = \frac{1}{2 π σ^{2}} e^{- (u^{2} + v) / 2 σ^{2}}

(9)

It can be seen that the determinant of the Hessian matrix of each pixel is as follows:

\det (H) = L_{u u} L_{v v} - {(0.9 L_{u v})}^{2}

(10)

Equation (9) is also the discrimination of the Hessian matrix, where 0.9 is the weight coefficient. If the value of the determinant is not 0, it is determined that the pixel point is a possible extremum point, and this point is called an interesting point.

In order to obtain the feature points that can be used for feature matching in

I_{k} (θ_{k})

, the interest points need to be screened. At the

d \times d \times d

neighborhood of each interest point, we use a filter of size

d \times d

to perform non-extreme value suppression: We compare each interest point with

d^{3} - 1

pixels in its scale space and 2D image space neighborhood. If it is not a maximum value or a minimum value, it will be eliminated, and the remaining key points will be saved as feature points. Figure 1 is a schematic diagram of non-extreme value suppression, assuming

d = 3

. The interest points marked as × are compared to the surrounding 26 interest points marked as O to eliminate non-extremum points.

As shown in Figure 2, in order to obtain the feature point matching pair, the descriptor of the feature point needs to be generated, and the

g \times g

block is established with each feature point as the center. Each block contains

h \times h

pixels, and the block is then rotated to the feature orientation. Haar wavelet [40] is used to calculate the response value for each small block. Then, we use the feature vector shown in Equation (11) to represent the feature of the small block:

F =  [\sum d_{x}, \sum |d_{x}|, \sum d_{y}, \sum |d_{y}|]

(11)

where

\sum d_{x}

and

\sum d_{y}

represent the Haar wavelet response values in the horizontal direction and vertical direction relative to the characteristic direction, respectively.

\sum |d_{x}|

and

\sum |d_{y}|

represent the sum of the absolute values of the Haar wavelet responses in the horizontal direction and vertical direction relative to the characteristic direction, respectively. We combine the feature vectors of

g \times g

blocks to obtain the

z = 4 \times g \times g

dimensional feature descriptor of the feature point.

According to the above method, the feature point sets of two adjacent frames of images

I_{k} (θ_{k}) I_{k + 1} (θ_{k + 1})

in the video sequence V under the dynamic background are respectively established:

α^{k} = \{α_{1}^{k}, α_{2}^{k}, \dots, α_{n_{k}}^{k}\}

,

α^{k + 1} = \{α_{1}^{k + 1}, α_{2}^{k + 1}, \dots, α_{n_{k + 1}}^{k + 1}\}

. E is the number of feature points of

I_{k} (θ_{k})

. F is the number of feature points of

I_{k + 1} (θ_{k + 1})

. The descriptor sets corresponding to the feature point sets

α^{k}

and

α^{k + 1}

are:

R (α^{k}) = {R (α_{1}^{k}), R (α_{2}^{k}), \dots, R (α_{n_{k}}^{k})}

,

R (α^{k + 1}) = {R (α_{1}^{k + 1}), R (α_{2}^{k + 1}), \dots, R (α_{n_{k + 1}}^{k + 1})}

.

And the descriptors in the sets

R (α^{k})

and

R (α^{k + 1})

all have z-dimensional features. By comparing the descriptors in the two descriptor sets, the matching of the image

I_{k} (θ_{k})

and

I_{k + 1} (θ_{k + 1})

feature points is completed. We use Euclidean distance to measure their similarity. The shorter the Euclidean distance, the better the matching degree of the two feature points. Finally, the best feature point matching point pair is selected as the feature point matching pair:

\{\begin{matrix} R (α_{p}^{k}) = (r_{p_{1}}^{k}, r_{p_{2}}^{k}, \dots, r_{p_{z}}^{k}) \\ R (α_{p}^{k}) \in R (α^{k}) \\ α_{p}^{k} \in α^{k} \\ p = 1, 2, \dots, n_{k} \end{matrix}

(12)

\{\begin{matrix} R (α_{q}^{k + 1}) = (r_{q_{1}}^{k + 1}, r_{q_{2}}^{k + 1}, \dots, r_{q_{z}}^{k + 1}) \\ R (α_{q}^{k + 1}) \in R (α^{k + 1}) \\ α_{q}^{k + 1} \in α^{k + 1} \\ q = 1, 2, \dots, n_{k + 1} \end{matrix}

(13)

d (R (α_{p}^{k}), R (α_{q}^{k + 1})) = {∥R (α_{p}^{k}) - R (α_{q}^{k + 1})∥}_{2} = \sqrt{\sum_{j = 1}^{z} {(r_{p j}^{k} - r_{q j}^{k + 1})}^{2}}

(14)

Among Equations (12)–(14),

R (α_{p}^{k})

represents the descriptor of the p th feature point in the feature point description subset

R (α^{k})

of the image

I_{k} (θ_{k})

.

r_{p_{z}}^{k}

represents the z th dimension feature of the descriptor.

R (α_{q}^{k + 1})

represents the descriptor of the q th feature point in the feature point description subset

R (α^{k + 1})

of image

I_{k + 1} (θ_{k + 1})

.

r_{q_{z}}^{k + 1}

represents the zth dimension feature of the descriptor.

d (R (α_{p}^{k}), R (α_{q}^{k + 1}))

represents the measure of similarity between descriptors

R (α_{p}^{k})

and

R (α_{q}^{k + 1})

. When their value is less than the threshold

λ

, the feature points

α_{p}^{k}

and

α_{q}^{k + 1}

corresponding to

R_{α} (k)

and

R_{β} (k + 1)

are called a feature point matching pair.

All feature point matching pairs that meet the above conditions are composed of feature point matching pairs between images

I_{k} (θ_{k})

and

I_{k + 1} (θ_{k + 1})

:

S_{f} = \{(α_{p}^{k}, α_{q}^{k + 1}) |α_{p}^{k} \in α^{k}, α_{q}^{k + 1} \in α^{k + 1}, d (R (α_{p}^{k}), R (α_{q}^{k + 1})) < λ\}

(15)

3.2. Global Motion Estimation

To obtain accurate motion parameters between adjacent frame images

I_{k} (θ_{k})

and

I_{k + 1} (θ_{k + 1})

in a video sequence V with dynamic backgrounds, and to estimate the background motion, it is necessary to eliminate a portion of feature point matches in

S_{f}

that do not satisfy the motion transformation. In this paper, we combine the MSAC algorithm to remove outliers as much as possible from the set

S_{f}

of feature point matches obtained by SURF. MSAC is a variant of RANSAC (Random Sample Consensus) [41] that overcomes the sensitivity to thresholds in RANSAC by modifying the cost function. Additionally, the MSAC algorithm not only considers the number of model data points but also reflects the degree of fit of the model data, making it superior to the RANSAC algorithm overall.

The majority of feature point matches in the set

S_{f}

between image

I_{k} (θ_{k})

and

I_{k + 1} (θ_{k + 1})

can be generated by a single model, and there are at least

n_{s}

point pairs (

n_{s} \leq min (E, F)

) available for fitting the model parameters. These parameters are iteratively estimated as follows:

1. Randomly select

n_{k}

feature points from the

S_{f}

set and use them to fit a model

M_{k}

.

2. For the remaining feature points in

S_{f}

, calculate the transformation error for each point. If the error exceeds a threshold, mark it as an outlier; otherwise, identify it as an inlier and add it to the set

I S

for further record.

3. If the cost function C of the current inlier set

I S

is smaller than the cost function

C_{b e s t}

of the best inlier set

I S_{b e s t}

, update

I S_{b e s t} = I S

.

4. This entire process constitutes one iteration. If the number of iterations exceeds k, terminate the process. Otherwise, increment the iteration count and repeat the above steps. The value of the iteration count G is determined by Equation (16).

G = \frac{\log (1 - w)}{\log (1 - ω^{n_{s}})}

(16)

Whereas, w represents the probability of

n_{k}

points being inliers after G iterations, typically set to 0.99;

ω

represents the ratio of inliers among

n_{k}

feature points. The cost function during the iteration process is defined by Equation (17).

C = \sum_{s \in S} L (W (s, φ))

(17)

Whereas, W represents the error function,

φ

represents the estimated model parameters; S represents the set of matching point pairs; s represents a pair of matching points in the set; L represents the loss function, defined by Equation (18).

L (γ) = \{\begin{matrix} e & γ \leq T \\ T & γ > T \end{matrix}

(18)

In which, e represents the error that can be calculated using the error function W. T is the error threshold used to distinguish inliers. Furthermore, due to the inconsistent motion speeds between the foreground objects and the background, as well as the significantly lower number of feature points on the foreground objects compared to the background, during the iterative process of obtaining the final motion model, foreground feature points with transformation errors exceeding the threshold are identified as outliers and consequently eliminated.

The inlier point matches obtained through the MSAC algorithm between the images

I_{k} (θ_{k})

and

I_{k + 1} (θ_{k + 1})

are used to form a new set of matches,

S_{i}

. Additionally, the iterative process yields a global motion estimation model, an affine transformation matrix, for the motion between

I_{k} (θ_{k})

and

I_{k + 1} (θ_{k + 1})

, as shown in Equation (18). Taking any pair of matching points,

α_{p}^{k} (u_{p}, v_{p})

and

α_{q}^{k + 1} (u_{q}, v_{q})

, from the set

S_{i}

, where

α_{p}^{k} \in α^{k}

and

α_{q}^{k + 1} \in α^{k + 1}

, the affine transformation matrix

M_{k}

is applied to the feature point

α_{p}^{k}

to obtain the point

{\hat{α}}_{q}^{k + 1} ({\hat{u}}_{q}, {\hat{v}}_{q})

, as shown in Equation (19).

M_{k} =  [\begin{matrix} a_{1}^{k} & b_{1}^{k} & t_{1}^{k} \\ b_{2}^{k} & a_{2}^{k} & t_{2}^{k} \\ 0 & 0 & 1 \end{matrix}]

(19)

[\begin{matrix} {\hat{u}}_{q} \\ {\hat{v}}_{q} \\ 1 \end{matrix}] =  [\begin{matrix} a_{1}^{k} & b_{1}^{k} & t_{1}^{k} \\ b_{2}^{k} & a_{2}^{k} & t_{2}^{k} \\ 0 & 0 & 1 \end{matrix}]  [\begin{matrix} u_{p} \\ v_{p} \\ 1 \end{matrix}] =  [\begin{matrix} a_{1}^{k} u_{p} + b_{1}^{k} v_{p} + t_{1}^{k} \\ b_{2}^{k} u_{p} + a_{2}^{k} v_{p} + t_{2}^{k} \\ 1 \end{matrix}]

(20)

In the Equation (20),

(t_{1}^{k}, t_{2}^{k})

represents the translation of the two feature point coordinates,

(a_{1}^{k}, a_{2}^{k})

reflects the corresponding rotational changes between the two feature points, and

(b_{1}^{k}, b_{2}^{k})

reflects the corresponding scaling changes. These six parameters collectively serve as the estimation parameters

{\hat{θ}}_{(k + 1) b}

for the global motion state of the image

I_{k + 1} (θ_{k + 1})

. After applying the aforementioned method to obtain the affine transformation matrices

M = \{M_{1}, M_{2}, \dots, M_{k}, \dots, M_{n - 1}\}

that describe the global motion estimation between adjacent frames in the video sequence V with dynamic backgrounds, the changes in the motion parameters

(a, b, t)

can be used to estimate the variation in the background motion throughout the entire video sequence.

3.3. Global Motion Compensation

To describe the image, as shown in Figure 3, this paper defines a Cartesian coordinate system

u - v

with the top-left corner of the image as the origin. In this coordinate system, the coordinates

(u, v)

of each pixel represent the column and row numbers of that pixel in the image array, and the value corresponds to the grayscale intensity of the pixel. This

u - v

coordinate system is based on pixel units and serves as the image coordinate system.

However, since the image coordinate system alone cannot represent the physical position of each pixel in the image, this paper also establishes an

x - y

imaging plane coordinate system based on centimeters as the unit. This coordinate system has its origin at the center of the image. Figure 3 illustrates the relationship between the pixel coordinate system

(u - v)

and the imaging plane coordinate system

(x - y)

.

Based on the obtained motion parameters between adjacent frame images, this paper proposes a global motion compensation algorithm based on the affine inverse transformation model. This algorithm compensates for the motion in the video sequence V with dynamic backgrounds on a frame-by-frame basis. To illustrate the process, let us consider the global motion compensation of the

k + 1

th frame image

I_{k + 1} (θ_{k + 1})

.

First, the affine transformation matrices

\{M_{1}, M_{2}, \dots, M_{k}\}

are inverted to obtain the inverse transformation matrices

\{M_{1}^{- 1}, M_{2}^{- 1}, \dots, M_{k}^{- 1}\}

, where the affine transformation matrix

M_{k}

is defined as shown in Equation (19). Next, the inverse transformation matrices are applied to each pixel of the image

I_{k + 1} (θ_{k + 1})

, as shown in Equation (21), resulting in a new frame image

{\tilde{I}}_{k + 1} ({\tilde{θ}}_{k + 1})

.

[\begin{matrix} {\tilde{u}}_{k + 1} \\ {\tilde{v}}_{k + 1} \\ 1 \end{matrix}] = M_{1}^{- 1} M_{2}^{- 1} \dots M_{k}^{- 1}  [\begin{matrix} u_{k + 1} \\ v_{k + 1} \\ 1 \end{matrix}]

(21)

in which,

(u_{k + 1}, v_{k + 1})

represents the pixel coordinates of

I_{k + 1} (θ_{k + 1})

, while

({\tilde{u}}_{k + 1}, {\tilde{v}}_{k + 1})

represents the pixel coordinates of the resulting image

{\tilde{I}}_{k + 1} ({\tilde{θ}}_{k + 1})

.

Due to the transformation calculations, the resulting pixel coordinates

({\tilde{u}}_{k + 1}, {\tilde{v}}_{k + 1})

of

I_{k + 1} (θ_{k + 1})

cause a change in the position of the image

{\tilde{I}}_{k + 1} ({\tilde{x}}_{k + 1})

within its imaging plane. Additionally, each frame image has a different imaging plane coordinate system, resulting in variations in the imaging standards across the images. To address this issue, this paper adopts the imaging plane coordinate system

x_{1} - y_{1}

of the first frame image

I_{1} (θ_{1})

as the output view’s imaging plane coordinate system. The pixel coordinate range of

I_{1} (θ_{1})

is also used as the pixel range of the output view, as shown in Figure 4. The image

{\tilde{I}}_{k + 1} ({\tilde{θ}}_{k + 1})

is transformed using the inverse affine transformation in relation to this output view, resulting in the globally motion-compensated image

I_{k + 1}^{'} (θ_{k + 1}^{'})

.

Since the original video sequence V contains global motion, different frames capture scenes that are not completely consistent. This means that certain pixels present in one frame may not appear in subsequent frames. Therefore, if the pixel coordinates of

{\tilde{I}}_{k + 1} ({\tilde{θ}}_{k + 1})

in the

x_{1} - y_{1}

coordinate system exceed the pixel range of the output view (as illustrated by the non-shadowed region in Figure 4), it indicates that the corresponding scene composed of those pixels does not appear in the first frame image. As a result, those pixels are excluded from the output view. The final output image

I_{k + 1}^{'} (θ_{k + 1}^{'})

only includes the pixels from

{\tilde{I}}_{k + 1} ({\tilde{θ}}_{k + 1})

that fall within the pixel range of the output view. As depicted in Figure 4, the pixel values in the shadowed region of the output view correspond to the corresponding pixel values in the image

{\tilde{I}}_{k + 1} ({\tilde{θ}}_{k + 1})

, while the remaining region (non-shadowed region of the output view) is assigned a pixel value of 0.

The motion state

θ_{(k + 1) o i}^{'}

of the motion target labeled as i in the

k + 1

th frame image

I_{k + 1}^{'} (θ_{k + 1}^{'})

can be represented by Equation (22).

θ_{(k + 1) o_{i}}^{'} = θ_{(k + 1) b} + η_{(k + 1) o_{i}} - {\hat{θ}}_{(k + 1) b}

(22)

In Equation (22),

θ_{(k + 1) b}

represents the global motion state in the image

I_{k + 1} (θ_{k + 1})

,

η_{(k + 1) o_{i}}

represents the motion state of the motion target labeled as i in the image

I_{k + 1} (θ_{k + 1})

, and

{\hat{θ}}_{(k + 1) b}

represents the estimation of the global motion state in the image

I_{k + 1} (θ_{k + 1})

.

The global motion compensation, using the inverse transformation model of the affine transformation, is applied to

{I_{2} (θ_{2}), I_{3} (θ_{3}), \dots, I_{n} (θ_{n})}

consecutively. The resulting compensated output images are then concatenated in sequential order to form a new video sequence

V^{'}

.

V_{n}^{'} (Θ_{n}) = {I_{1} (θ_{1}), I_{2}^{'} (θ_{2}^{'}), \dots, I_{k}^{'} (θ_{k}^{'}), I_{k + 1}^{'} (θ_{k + 1}^{'}), \dots, I_{n}^{'} (θ_{n}^{'})}

(23)

In the sequence

{I_{1} (θ_{1}), I_{2}^{'} (θ_{2}^{'}), \dots, I_{k}^{'} (θ_{k}^{'}), I_{k + 1}^{'} (θ_{k + 1}^{'}), \dots, I_{n}^{'} (θ_{n}^{'})}

, the image coordinate system and imaging plane coordinate system are consistent throughout. They are established based on the reference image

I_{1} (θ_{1})

. The video sequence

V^{'}

primarily captures the motion of foreground objects, while the background motion is almost negligible.

The adjacent frames of the video sequence

V^{'}

are subtracted using frame differencing, resulting in the absolute difference of the grayscale values of the two frames, as shown in Equation (24). By comparing this difference with a threshold, the motion characteristics of the video can be analyzed to determine the presence of moving objects in the image sequence.

D_{k} (τ, υ) = |I_{k + 1}^{'} (τ, υ) - I_{k}^{'} (τ, υ)|

(24)

In the Equation (24),

I_{k + 1}^{'} (τ, υ)

and

I_{k}^{'} (τ, υ)

represent the grayscale values of the pixel points in the adjacent frames

I_{k}^{'} (θ_{k}^{'})

and

I_{k + 1}^{'} (θ_{k + 1}^{'})

of the video sequence

V^{'}

.

3.4. Computational Complexity

In SURF, assuming the image size is

W \times H

and the sampling step is s, the complexity of feature detection is rough

O ((W / s) \times (H / s))

, where

W / s

and

H / s

represent the image size after sampling. For each detected interest point, the calculation of the main orientation involves computing Haar wavelet responses within a region and identifying the direction with the highest response. Assuming the complexity of Haar wavelet response calculation for each interest point is

O (P^{2})

, where P represents the size of the computation region, and if there are N detected interest points, the complexity of main orientation calculation is

O (N \times P^{2})

. With the descriptor dimension D, the number of wavelet responses M, the computation region size

P \times P

, and N interest points, the complexity of descriptor generation is rough

O (N \times D \times M \times P^{2})

. In summary, the overall computational complexity of the SURF algorithm can be approximately represented as

O (S U R F) = O ((W / s) \times (H / s)) + O (N \times P^{2}) + O (N \times D \times M \times P^{2})

(25)

The MSAC algorithm requires multiple iterations to find an appropriate model, where each iteration involves sampling, model estimation, and inlier-outlier classification. Assuming a total of T iterations, each iteration involves

O (P)

operations, where P is the number of data points. Therefore, the total complexity of iterations is

O (T \times P)

. In each iteration, MSAC randomly samples a subset of data from the dataset for model estimation. Assuming each sampling involves S data points, and a total of T iterations, the complexity of sampling is

O (T \times S)

. During each iteration, MSAC needs to estimate model parameters and compute the fitting error between data points and the model. The complexity of model estimation and evaluation typically depends on the problem’s characteristics and the chosen model. Assuming the complexity of model estimation and evaluation in each iteration is

O (F)

, the total complexity for model estimation and evaluation is

O (T \times F)

. To sum up, the overall computational complexity of the MSAC algorithm can be approximately represented as

O (M S A C) = O (T \times P) + O (T \times S) + O (T \times F)

(26)

where T is the number of iterations, N is the number of data points, S is the sampling size, and P is the complexity of model estimation and evaluation.

In the global motion compensation module, the inversion of an affine transformation matrix has a computational complexity that can be considered at a constant level, denoted as

O (1)

. Assuming that the computational complexity of matrix multiplication and vector addition is

O (3)

, then the computational complexity of motion compensation is

O (3)

.

The overall complexity of the algorithm in this paper is

O (O U R S) = O (S U R F) + O (M S A C) + O (3)

(27)

Son et al. [39] present an efficient multi-task network (RVDMC) for real-time video deblurring and motion compensation that shares computation and features between tasks by extracting useful details and injecting structural information, enabling state-of-the-art efficiency and flexible quality-speed trade-offs. The complexity of this algorithm can be expressed as follows:

The network architecture of the algorithm is composed of several multi-task units (MTUs). Each MTU consists of three main components: multi-task detail network

F_{n}

, deblurring network

D_{n}

, and motion compensation network

M_{n}

, where n is the index of a multi-task unit. H, W is image height and width; C is the number of channels in feature maps; K is the kernel size of convolutions; S is the stride of convolutions; D is the size of the matching window for motion estimation; N is the number of multi-task unit stacks. Thus, the complexity for a single stack of

F_{n}

is

O (F_{n}) = O (H \times W \times C^{2} \times K^{2}) + N \times O (H \times W \times C^{2} \times K^{2}) = O (N \times H \times W \times C^{2} \times K^{2})

(28)

In the motion compensation network

M_{n}

, only cost volume calculation is considered.

O (M_{n}) = O (H \times W \times C^{2} \times K^{2})

(29)

For deblurring network

D_{n}

, there is only a single convolution. The complexity of

D_{n}

is:

O (D_{n}) = O (H \times W \times C^{2} \times K^{2})

(30)

The total complexity of the RVDMC algorithm is:

O (R V D M C) = O (N \times H \times W \times C^{2} \times K^{2})

(31)

As can be seen from Equations (27) and (31), the complexity of the proposed algorithm in this paper is much lower than the global motion compensation algorithm based on deep learning.

4. Experiment

In this section, we conducted background motion compensation experiments on two video sequences,

V_{P}

and

V_{Q}

, with dynamic backgrounds. The

V_{P}

sequence consists of 280 frames, while the

V_{Q}

sequence consists of 210 frames.

\{\begin{matrix} V_{P {1 : 280}} (θ_{p {1 : 280}}) = {I_{P 1} (θ_{P 1}), I_{P 2} (θ_{P 2}), \dots, I_{P 280} (θ_{P 280})} \\ V_{Q {1 : 210}} (θ_{Q {1 : 210}}) = {I_{Q 1} (θ_{Q_{1}}), I_{Q 2} (θ_{Q 2}), \dots, I_{Q 210} (θ_{Q 210})} \end{matrix}

(32)

In experiments, the proposed algorithm is used to estimate and compensate for the motion of

V_{P}

and

V_{Q}

video sequences. Furthermore, we performed motion object detection on these sequences to demonstrate the effectiveness and accuracy of the proposed algorithm in background motion compensation.

4.1. Obtaining Valid Feature Point Matches

The video sequences

V_{P}

and

V_{Q}

, chosen for the experiment, have frame sizes of the common display standard

1280 \times 720

. To highlight the effects of feature point matching, we selected two frames with a significant time interval for experimentation. Specifically, we chose the first frame and the 200th frame. We applied the SURF feature point extraction and matching techniques to these frames and combined the results with the MSAC algorithm to obtain inlier point matches. The resulting feature points and inlier point matches for

V_{P}

are shown in Figure 5, while the results for

V_{Q}

are displayed in Figure 6.

In Figure 5, there are a total of 1103 feature point matches and 619 inlier point matches between the first frame and the 200th frame of

V_{P}

. In Figure 6, there are 593 feature point matches and 155 inlier point matches between the first frame and the 200th frame of

V_{Q}

. It can be observed that the number of inlier points is significantly lower compared to the total number of feature points. This indicates that a portion of feature points that are not suitable for fitting the motion model has been eliminated.

4.2. Motion Estimation on Video Sequences

The affine transformation model consists of six motion parameters, as shown in Equation (19). Table 1 presents the values of these six parameters for the two sets of images mentioned in Section 4.1.

According to the affine transformation model,

a_{1}

and

a_{2}

represent the scaling of the recording device, with values close to 1 indicating minimal changes in image size.

b_{1}

and

b_{2}

represent the rotation of the recording device, with small values suggesting that the images have undergone little to no rotation.

t_{1}

and

t_{2}

represent the translation motion of the recording device, with larger values indicating significant translational movement of the device.

For all the consecutive frame images in the video sequences

V_{P}

and

V_{Q}

, we followed the aforementioned steps and obtained a total of 279 and 209 sets of parameter values, respectively. These parameter values can be categorized into scaling parameters, rotation parameters, and translation parameters based on their nature. Additionally, corresponding two-dimensional line graphs were created to visualize the dynamic changes of the background motion parameters. The resulting plot depicting the variation of the background motion parameters is shown in Figure 7.

In Figure 7a,d, the scaling changes in the video sequences are reflected, b and e depict the rotation variations, while c and f illustrate the translation changes. Each of the six plots demonstrates different degrees of background motion variations, providing a comprehensive motion estimation analysis for the two video sequences. The motion estimation results reveal that both video sequences,

V_{P}

and

V_{Q}

, exhibit background motion consisting of scaling, rotation, and translation transformations, without following any specific pattern or regularity.

4.3. Global Motion Compensation and Object Detection

After estimating the global motion, we apply the proposed background motion compensation algorithm to the video sequences

V_{P}

and

V_{Q}

with dynamic backgrounds. We perform an inverse transformation on the affine transformation matrices between adjacent frames in the video sequences

V_{P}

and

V_{Q}

. Then, we apply the inverse transformation matrices to the subsequent frames, as shown in Equation (21). Finally, we output the new video sequences

V_{P}^{'}

and

V_{Q}^{'}

in the views created based on the standards

I_{P 1} (θ_{_{P 1}})

and

I_{Q 1} (θ_{Q_{1}})

, respectively.

\{\begin{matrix} {V^{'}}_{P {1 : 280}} ({θ^{'}}_{p {1 : 280}}) = {I_{P 1} (θ_{P 1}), I_{P 2}^{'} (θ_{P 2}^{'}), \dots, I_{P 280}^{'} (θ_{P 280}^{'})} \\ {V^{'}}_{Q {1 : 210}} ({θ^{'}}_{Q {1 : 210}}) = {I_{Q 1} (θ_{Q 1}), I_{Q 2}^{'} (θ_{Q 2}^{'}), \dots, I_{Q 210}^{'} (θ_{Q 210}^{'})} \end{matrix}

(33)

The video sequences

V_{P}^{'}

and

V_{Q}^{'}

are the result of global motion compensation using the inverse affine transformation model, transforming them into video sequences with static backgrounds. Figure 8 shows the first frame image P1 in

V_{P}^{'}

and the adjacent frames

I_{P 125}^{'} (θ_{P 125}^{'})

and

I_{P 126}^{'} (θ_{P 126}^{'})

after global motion compensation. Figure 9 shows the first frame image

I_{Q 1} (θ_{Q 1})

in

V_{Q}^{'}

and the adjacent frames

I_{Q 41}^{'} (θ_{Q 41}^{'})

and

I_{Q 42}^{'} (θ_{Q 42}^{'})

after compensation.

From Figure 8 and Figure 9, it can be observed that due to the presence of global motion in

V_{P}^{'}

and

V_{Q}^{'}

, different frames capture slightly different scenes. After applying the proposed algorithm for global motion compensation, there may be pixels that fall outside the view boundaries or do not satisfy the output view criteria. These pixels result in varying degrees of black borders, as expected in the compensated images.

Based on the obtained video sequences

V_{P}^{'}

and

V_{Q}^{'}

with static backgrounds, motion detection can be performed by computing the frame differences between consecutive frames. To demonstrate the effectiveness of the algorithm proposed in the paper, a comparison is made between the direct frame difference images, the frame difference images obtained using a traditional global motion estimation algorithm with compensation frames, and the frame difference images obtained using the algorithm proposed in the paper. Figure 10 and Figure 11 illustrate this comparison.

Figure 10a,b represent the 223rd and 224th frames of the

V_{P}

sequence, respectively. Figure 10c shows the frame difference image obtained using a traditional algorithm with compensation frames. Figure 10e shows the frame difference image obtained using the algorithm proposed in the paper.

By comparing the two frame difference images, significant differences in the object detection results can be observed. In Figure 10c, the direct frame difference result, the trees and buildings in the background are detected along with the moving objects. In Figure 10d, the frame difference result obtained using the traditional algorithm, although some moving objects are detected, the detection is incomplete. However, in Figure 10e, the result obtained using the algorithm proposed in the paper, the moving objects are detected, and their outlines are clearly visible.

Figure 11a,b represent the 134th and 135th frames of the

V_{Q}

sequence, respectively. c shows the frame difference image obtained directly from these two frames. d shows the frame difference image obtained using a traditional algorithm with compensation frames. e shows the frame difference image obtained using the algorithm proposed in the paper.

In comparison to the

V_{P}

sequence, the

V_{Q}

sequence has more background motion, which leads to stronger interference in the direct frame difference result. In Figure 11c, the objects and the background are merged together, making it difficult to distinguish the moving objects. In Figure 11d, the result obtained using the traditional algorithm, there are more noise points and the edges of the objects are not clear. However, in Figure 11e, the result obtained using the algorithm proposed in the paper, the interference from the global motion is significantly reduced, and the moving objects are clearly detected.

In this paper, we adopt an objective quality metric called Peak Signal to Noise Ratio (PSNR) to demonstrate the effectiveness of the proposed algorithm. A higher PSNR value indicates lower image distortion. We calculate the PSNR values between the grayscale frames of adjacent frames in the dynamic background video sequences

V_{P}

and

V_{Q}

. We also calculate the PSNR values between the compensated frames obtained by the traditional global motion estimation algorithm and the original grayscale frames. Furthermore, we calculate the PSNR values between the adjacent frames after applying the proposed global motion compensation algorithm. We compare the PSNR values obtained from these three algorithms to assess the level of image distortion. The results are presented in Figure 12 for visual comparison.

Figure 12 presents the PSNR value variations.

p 1

represents the PSNR value changes between the original adjacent frames,

p 2

represents the PSNR value changes between the compensated frames obtained by the traditional global motion method and the original grayscale frames, and

p 3

represents the PSNR value changes between the adjacent frames obtained by the proposed algorithm.

From the PSNR value variations in Figure 12a,b, it can be observed that the proposed algorithm achieves the highest PSNR values for the adjacent frames in both video sequences

V_{P}

and

V_{Q}

. This indicates that the proposed algorithm yields the least distortion in the frames after global motion compensation. Therefore, it confirms the robustness and accuracy of the proposed algorithm.

The video sequence

V_{P}

comprises 280 frames, while

V_{Q}

consists of 210 frames. Figure 13 illustrates the runtime variation curves for each image in the frequency sequences

V_{P}

and

V_{Q}

. Excluding the algorithm initialization time, the average processing time of the approach proposed in this paper for each image is 0.35 s.

Shao et al. [1] designed a lightweight parallel network (HRSiam) with a high spatial resolution to locate the small objects in satellite videos. Son et al. [39] proposed a real-time video deblurring framework (RVDMC) consisting of a lightweight multi-task unit that supports both video deblurring and motion compensation in an efficient way. Table 2 compares the running time of these two deep learning models and our proposed algorithm on one image. The runtime data for the deep learning models listed in Table 2 are based on PyTorch (PT) model executions and do not account for the impact of model acceleration techniques.

5. Conclusions

This paper proposes a global motion compensation algorithm for video object detection based on the affine inverse transformation model. The algorithm utilizes SURF in combination with the MSAC algorithm to obtain global motion parameters and further fits an affine transformation model. It estimates the background motion of the entire video sequence. Based on this, an inverse transformation model of affine transformation between adjacent frames is proposed. It is applied to globally compensate for the motion in the entire video sequence. The compensated frames are then outputted in a view created based on the first frame of the video sequence as a reference, completing the transformation from dynamic background to static background.

The effectiveness of the proposed algorithm is further demonstrated by evaluating the detected motion objects using frame differencing between adjacent frames and comparing the peak signal-to-noise ratio (PSNR) among different algorithms. However, due to the complexity and non-periodicity of background motion, the proposed algorithm may still introduce some noise in the detected objects. Further optimization and improvements are needed to address this issue.

The video processing results can be viewed at https://youtu.be/hMFII2rpc4s (accessed on 5 May 2023) and https://youtu.be/NAPBZkvBahg (accessed on 5 May 2023).

Author Contributions

Conceptualization, W.L., N.Z. and X.X.; methodology, W.L.; software, N.Z. and X.X.; validation, N.Z. and X.X.; formal analysis, N.Z.; investigation, N.Z.; resources, N.Z.; data curation, N.Z.; writing—original draft preparation, N.Z.; writing—review and editing, W.L.; visualization, W.L.; supervision, W.L.; project administration, W.L.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the NSFC (62376147), and Shaanxi province key research and development program (2021GY-087).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data openly available in a public repository. https://youtu.be/hMFII2rpc4s, accessed on 5 May 2023 and https://youtu.be/NAPBZkvBahg, accessed on 5 May 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shao, J.; Du, B.; Wu, C.; Gong, M.; Liu, T. Hrsiam: High-resolution siamese network, towards space-borne satellite video tracking. IEEE Trans. Image Process. 2021, 30, 3056–3068. [Google Scholar] [CrossRef] [PubMed]
Schmid, J.F.; Simon, S.F.; Mester, R. Features for Ground Texture Based Localization—A Survey. arXiv 2020, arXiv:2002.11948. [Google Scholar]
Uzair, M.; Brinkworth, R.S.; Finn, A. Bio-inspired video enhancement for small moving target detection. IEEE Trans. Image Process. 2020, 30, 1232–1244. [Google Scholar] [CrossRef] [PubMed]
Jiang, Z.; Huynh, D.Q. Multiple pedestrian tracking from monocular videos in an interacting multiple model framework. IEEE Trans. Image Process. 2017, 27, 1361–1375. [Google Scholar] [CrossRef] [PubMed]
Jardim, E.; Thomaz, L.A.; da Silva, E.A.; Netto, S.L. Domain-transformable sparse representation for anomaly detection in moving-camera videos. IEEE Trans. Image Process. 2019, 29, 1329–1343. [Google Scholar] [CrossRef] [PubMed]
Feng, Z.; Zhu, X.; Xu, L.; Liu, Y. Research on human target detection and tracking based on artificial intelligence vision. In Proceedings of the 2021 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 14–16 April 2021; pp. 1051–1054. [Google Scholar]
Qin, L.; Liu, Z. Body Motion Detection Technology in Video. In Proceedings of the 2021 3rd International Conference on Robotics and Computer Vision (ICRCV), Beijing, China, 6–8 August 2021; pp. 7–11. [Google Scholar]
Kong, K.; Shin, S.; Lee, J.; Song, W.J. How to estimate global motion non-Iteratively from a coarsely sampled motion vector field. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3729–3742. [Google Scholar] [CrossRef]
Mohan, M.M.; Nithin, G.; Rajagopalan, A. Deep dynamic scene deblurring for unconstrained dual-lens cameras. IEEE Trans. Image Process. 2021, 30, 4479–4491. [Google Scholar] [CrossRef]
Zhuo, T.; Cheng, Z.; Zhang, P.; Wong, Y.; Kankanhalli, M. Unsupervised online video object segmentation with motion property understanding. IEEE Trans. Image Process. 2019, 29, 237–249. [Google Scholar] [CrossRef]
Luo, X.; Jia, K.; Liu, P.; Xiong, D.; Tian, X. Improved Three-Frame-Difference Algorithm for Infrared Moving Target. In Proceedings of the 2020 IEEE 5th International Conference on Image, Vision and Computing (ICIVC), Beijing, China, 10–12 July 2020; pp. 108–112. [Google Scholar]
Zhao, C.; Basu, A. Dynamic deep pixel distribution learning for background subtraction. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 4192–4206. [Google Scholar] [CrossRef]
Zhang, H.; Liu, Z. Moving target shadow detection based on deep learning in video SAR. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 4155–4158. [Google Scholar]
Zhao, Y.; Zhao, J.; Li, J.; Chen, X. RGB-D salient object detection with ubiquitous target awareness. IEEE Trans. Image Process. 2021, 30, 7717–7731. [Google Scholar] [CrossRef]
Wang, Z.; Wang, S.; Zhang, X.; Wang, S.; Ma, S. Three-zone segmentation-based motion compensation for video compression. IEEE Trans. Image Process. 2019, 28, 5091–5104. [Google Scholar] [CrossRef] [PubMed]
Xu, K.; Jiang, X.; Sun, T. Anomaly detection based on stacked sparse coding with intraframe classification strategy. IEEE Trans. Multimed. 2018, 20, 1062–1074. [Google Scholar] [CrossRef]
Liu, H.; Hua, G.; Huang, W. Motion Rectification Network for Unsupervised Learning of Monocular Depth and Camera Motion. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 2805–2809. [Google Scholar]
Matsushita, Y.; Yamaguchi, T.; Harada, H. Object tracking using virtual particles driven by optical flow and Kalman filter. In Proceedings of the 2019 19th International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 15–18 October 2019; pp. 1064–1069. [Google Scholar]
Meng, Z.; Kong, X.; Meng, L.; Tomiyama, H. Lucas-Kanade Optical Flow Based Camera Motion Estimation Approach. In Proceedings of the 2019 International SoC Design Conference (ISOCC), Jeju, Republic of Korea, 6–9 October 2019; pp. 77–78. [Google Scholar]
Golestani, H.B.; Sauer, J.; Rohlfmg, C.; Ohm, J.R. 3D Geometry-Based Global Motion Compensation For VVC. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 2054–2058. [Google Scholar]
Talukdar, A.K.; Bhuyan, M. A Novel Global Motion Estimation and Compensation Framework in Compressed Domain for Sign Language Videos. In Proceedings of the 2020 International Conference on Wireless Communications Signal Processing and Networking (WiSPNET), Chennai, India, 4–6 August 2020; pp. 20–24. [Google Scholar]
Hong-Phuoc, T.; Guan, L. A novel key-point detector based on sparse coding. IEEE Trans. Image Process. 2019, 29, 747–756. [Google Scholar] [CrossRef] [PubMed]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Rodríguez, M.; Facciolo, G.; von Gioi, R.G.; Musé, P.; Morel, J.M.; Delon, J. Sift-aid: Boosting sift with an affine invariant descriptor based on convolutional neural networks. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 4225–4229. [Google Scholar]
Medley, D.O.; Santiago, C.; Nascimento, J.C. Deep active shape model for robust object fitting. IEEE Trans. Image Process. 2019, 29, 2380–2394. [Google Scholar] [CrossRef]
Rosten, E.; Porter, R.; Drummond, T. Faster and better: A machine learning approach to corner detection. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 32, 105–119. [Google Scholar] [CrossRef]
Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. Brief: Binary robust independent elementary features. In Proceedings of the Computer Vision—ECCV 2010: 11th European Conference on Computer Vision, Crete, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 778–792. [Google Scholar]
Harris, C.; Stephens, M. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Manchester, UK, 31 August–2 September 1988; Citeseer: Princeton, NJ, USA, 1988; Volume 15, pp. 147–152. [Google Scholar]
Lv, H.; Zhang, H.; Zhao, C.; Liu, C.; Qi, F.; Zhang, Z. An Improved SURF in Image Mosaic Based on Deep Learning. In Proceedings of the 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), Xiamen, China, 5–7 July 2019; pp. 223–226. [Google Scholar] [CrossRef]
Torr, P.H.; Murray, D.W. The development and comparison of robust methods for estimating the fundamental matrix. Int. J. Comput. Vis. 1997, 24, 271–300. [Google Scholar] [CrossRef]
Yang, J.; Lu, Z.; Tang, Y.Y.; Yuan, Z.; Chen, Y. Quasi Fourier-Mellin transform for affine invariant features. IEEE Trans. Image Process. 2020, 29, 4114–4129. [Google Scholar] [CrossRef]
Ho, M.M.; Zhou, J.; He, G.; Li, M.; Li, L. SR-CL-DMC: P-frame coding with super-resolution, color learning, and deep motion compensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 124–125. [Google Scholar]
Zhao, L.; Wang, S.; Zhang, X.; Wang, S.; Ma, S.; Gao, W. Enhanced motion-compensated video coding with deep virtual reference frame generation. IEEE Trans. Image Process. 2019, 28, 4832–4844. [Google Scholar] [CrossRef]
Li, B.; Han, J.; Xu, Y.; Rose, K. Optical flow based co-located reference frame for video compression. IEEE Trans. Image Process. 2020, 29, 8303–8315. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Lu, M.; Ma, Z.; Wang, F.; Xie, Z.; Cao, X.; Wang, Y. Neural video coding using multiscale motion compensation and spatiotemporal context model. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3182–3196. [Google Scholar] [CrossRef]
Liu, X.; Kong, L.; Zhou, Y.; Zhao, J.; Chen, J. End-to-end trainable video super-resolution based on a new mechanism for implicit motion estimation and compensation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 2416–2425. [Google Scholar]
Son, H.; Lee, J.; Cho, S.; Lee, S. Real-Time Video Deblurring via Lightweight Motion Compensation. Comput. Graph. Forum 2022, 41, 177–188. [Google Scholar] [CrossRef]
Zuo, F.; de With, P.H. Fast facial feature extraction using a deformable shape model with haar-wavelet based local texture attributes. In Proceedings of the 2004 International Conference on Image Processing, ICIP’04, Singapore, 24–27 October 2004; Volume 3, pp. 1425–1428. [Google Scholar]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of non-extremum suppression.

Figure 2. Feature point descriptor.

Figure 3. Image coordinate system and imaging plane coordinate system.

Figure 4. Output view.

Figure 5. Matching results of

V_{P}

key points and inliers. (a)

V_{P}

key point matching; (b)

V_{P}

inliers points matching.

Figure 5. Matching results of

V_{P}

key points and inliers. (a)

V_{P}

key point matching; (b)

V_{P}

inliers points matching.

Figure 6. Matching results of

V_{Q}

key points and inliers. (a)

V_{Q}

key point matching; (b)

V_{Q}

inliers points matching.

Figure 6. Matching results of

V_{Q}

key points and inliers. (a)

V_{Q}

key point matching; (b)

V_{Q}

inliers points matching.

Figure 7. Variation diagram of background motion parameters. (a) change graph of VP scaling parameters; (b) variation diagram of VP rotation parameters; (c) variation diagram of

V_{Q}

translation parameters; (d) change graph of

V_{Q}

scaling parameters; (e) variation diagram of

V_{Q}

rotation parameters; (f) variation diagram of

V_{Q}

translation parameters.

Figure 7. Variation diagram of background motion parameters. (a) change graph of VP scaling parameters; (b) variation diagram of VP rotation parameters; (c) variation diagram of

V_{Q}

translation parameters; (d) change graph of

V_{Q}

scaling parameters; (e) variation diagram of

V_{Q}

rotation parameters; (f) variation diagram of

V_{Q}

translation parameters.

Figure 8. Three frames of

V_{P}^{'}

. (a) frame 1; (b) frame 125; (c) frame 126.

Figure 8. Three frames of

V_{P}^{'}

. (a) frame 1; (b) frame 125; (c) frame 126.

Figure 9. Three frames of

V_{Q}^{'}

. (a) frame 1; (b) frame 125; (c) frame 126.

Figure 9. Three frames of

V_{Q}^{'}

. (a) frame 1; (b) frame 125; (c) frame 126.

Figure 10. Comparison of

V_{P}

sequence image detection results. (a) frame 223; (b) frame 224; (c) direct frame difference; (d) traditional algorithm; (e) our algorithm.

Figure 10. Comparison of

V_{P}

sequence image detection results. (a) frame 223; (b) frame 224; (c) direct frame difference; (d) traditional algorithm; (e) our algorithm.

Figure 11. The results of the three algorithms. (a) frame 134; (b) frame135; (c) direct frame difference; (d) traditional algorithm; (e) our algorithm.

Figure 12. PSNR value change chart. (a) change diagram of PSNR value of video

V_{P}

; (b) change diagram of PSNR value of video

V_{Q}

.

Figure 12. PSNR value change chart. (a) change diagram of PSNR value of video

V_{P}

; (b) change diagram of PSNR value of video

V_{Q}

.

Figure 13. Runtime change chart. (a) change diagram of runtime of video

V_{P}

; (b) change diagram of runtime of video

V_{Q}

.

Figure 13. Runtime change chart. (a) change diagram of runtime of video

V_{P}

; (b) change diagram of runtime of video

V_{Q}

.

Table 1. Six parameter values for affine transformation between images.

Parameters	$a_{1}$	$a_{2}$	$b_{1}$	$b_{2}$	$t_{1}$	$t_{2}$
$V_{P}$	0.9953	0.9870	−0.0220	0.0023	−0.6703	54.1088
$V_{Q}$	0.9615	0.9935	0.0416	−0.0728	−63.5515	−61.2953

Table 2. The running time of each image is compared between the proposed algorithm and the deep learning algorithm on GPU and CPU.

Model	HRSiam	RVDMC	Ours
CPU Runtime	0.48	2.3	0.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, N.; Liu, W.; Xia, X. Video Global Motion Compensation Based on Affine Inverse Transform Model. Sensors 2023, 23, 7750. https://doi.org/10.3390/s23187750

AMA Style

Zhang N, Liu W, Xia X. Video Global Motion Compensation Based on Affine Inverse Transform Model. Sensors. 2023; 23(18):7750. https://doi.org/10.3390/s23187750

Chicago/Turabian Style

Zhang, Nan, Weifeng Liu, and Xingyu Xia. 2023. "Video Global Motion Compensation Based on Affine Inverse Transform Model" Sensors 23, no. 18: 7750. https://doi.org/10.3390/s23187750

APA Style

Zhang, N., Liu, W., & Xia, X. (2023). Video Global Motion Compensation Based on Affine Inverse Transform Model. Sensors, 23(18), 7750. https://doi.org/10.3390/s23187750

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Video Global Motion Compensation Based on Affine Inverse Transform Model

Abstract

1. Introduction

2. Problem Description

3. Global Motion Compensation Algorithm Based on Affine Inverse Transform Model

3.1. Feature Point Matching

3.2. Global Motion Estimation

3.3. Global Motion Compensation

3.4. Computational Complexity

4. Experiment

4.1. Obtaining Valid Feature Point Matches

4.2. Motion Estimation on Video Sequences

4.3. Global Motion Compensation and Object Detection

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI