*2.1. Higher Order Motion Prediction Models*

In all existing video encoding standards, translation-based prediction is supported. It can be defined mathematically by the following equation:

$$
\begin{bmatrix} \mathbf{x'}\\\mathbf{y'} \end{bmatrix} = \begin{bmatrix} \mathbf{x} \\\mathbf{y} \end{bmatrix} + \begin{bmatrix} v\_x \\ v\_y \end{bmatrix} \tag{1}
$$

where *x y t* represent the coordinates of the points on the reference picture, *x y<sup>t</sup>* the coordinates *t*

of the points on the current picture, and *vx vy* the motion vector.

Higher order motion prediction models are models that use more than two parameters to represent motion. While it is possible to define motion models with an arbitrarily high amount of parameters, in practice two models have been used the most: the affine motion model, that uses six parameters, defined by Equation (2), and the zoom and rotation model, that uses four parameters, defined by Equation (3).

$$
\begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} + \begin{bmatrix} v\_x \\ v\_y \end{bmatrix} \tag{2}
$$

$$
\begin{bmatrix} \mathbf{x'} \\ \mathbf{y'} \end{bmatrix} = \begin{bmatrix} a & b \\ -b & a \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} + \begin{bmatrix} v\_x \\ v\_y \end{bmatrix} \tag{3}
$$

In these equations, *a* and *b* are the affine motion parameters, *vx* and *vy* are the translational motion parameters. By comparing with Equation (1), we can see that they are very similar, with an additional two or four parameters added.

Tsutake et al. [10] proposed using two 3-parameter models for affine motion compensation to replace the zoom and rotation model, that are defined as follows:

$$
\begin{bmatrix}
\mathbf{x'} \\
\mathbf{y'}
\end{bmatrix} = \begin{bmatrix}
1+s & 0 \\
0 & 1+s
\end{bmatrix} \begin{bmatrix}
\mathbf{x} \\
\mathbf{y}
\end{bmatrix} + \begin{bmatrix}
v\_x \\
v\_y
\end{bmatrix} \tag{4}
$$

$$
\begin{bmatrix}
\mathbf{J} \\
\mathbf{J}
\end{bmatrix} = \begin{bmatrix}
\mathbf{I}\_1 & \mathbf{I}\_2
\end{bmatrix} \begin{bmatrix}
\mathbf{J}
\end{bmatrix} \begin{bmatrix}
\mathbf{J}
\end{bmatrix} \tag{5}
$$

$$
\begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} 1 & -r \\ r & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} + \begin{bmatrix} v\_x \\ v\_y \end{bmatrix} \tag{5}
$$

The two 3-parameter models are simplifications of the 4-parameter zoom and rotation model. The first model, described in Equation (4), sets *b* to 0 and *a* to 1 + *s*, as a value of 0 for *s* represents a translation. The second model, described in Equation (5), sets *b* to *r* and *a* to 1, so a value of 0 for *r* represents a translation.

Because it is common that the movement is either zooming or rotation rather than a combination of both, it is common that one of the two affine parameters is much smaller than the other. In this case, reducing the number of parameters will reduce the coding cost of the prediction without losing much accuracy.

Using this dual model option allows for good efficiency, but it requires doing the parameter estimation process twice.

#### *2.2. Transform Computation*

As seen in the previous equations that represent higher-order motions, they result in a motion vector that depends on the position within the block. While implementations are very good at computing predictions with a constant motion vector (and especially for integer motion vectors as they are simple copy and paste), they are not designed for a constantly changing motion vector.

In the proposed affine motion compensation in JEM [7], this problem is avoided by using constant motion vectors for blocks of 4 × 4 samples. However, this also means it is not true affine motion compensation.

In Reference [8,9], the authors suggest doing a 1/16*th* sample interpolation using a eight-tap filter. While it is quite slow, as the gradient method converges quickly towards the optimal value, it does not add too much additional burden to the encoder.

In Reference [10], because the method requires to evaluate more transforms, the interpolation is faster, using the quarter sample interpolation from HEVC and using bilinear interpolation between the four surrounding samples. To avoid the need for computing the interpolation many times, the interpolated samples are stored in a buffer for each reference picture.

#### *2.3. Gradient-Based Parameter Estimation*

In Reference [5,7–9], a gradient method is used to estimate the affine motion parameters. This method is based on the Newton–Raphson method, which is a method that allows finding the root of a function with an iterative process. The general form is given by the following equation:

$$\mathbf{x}\_1 = \mathbf{x}\_0 - \frac{f(\mathbf{x}\_0)}{f'(\mathbf{x}\_0)} \tag{6}$$

It is possible to generalize this equation to multi-dimensional problems. With affine motion compensation, we have the following error function:

$$E = \sum\_{(\mathbf{x}, \mathbf{y})} (\arg(\mathbf{x}, \mathbf{y}) - \operatorname{ref}(\mathbf{x}', \mathbf{y}')) \tag{7}$$

where *org*(*x*, *<sup>y</sup>*) refers to the original value of the sample at coordinates *x y<sup>t</sup>* in the current picture, *ref*(*x*, *<sup>y</sup>*) refers to the sample value at coordinates *x y<sup>t</sup>* in the reference picture.

#### *2.4. Block-Matching-Based Estimation*

Reference [10] use a different method than the others to find the affine parameters. Because they use less parameters, the complexity increase is lower. However, even with only three parameters, the search around neighbors, if using a standard diamond or square pattern, goes from 8 transform computations to 26, and affine prediction is also more costly to compute.

Their idea is to decouple the search for the parameters. As with other methods, they start with an initial estimation based on the classical translation-based motion estimation. Then, they try values in the entire search range, with a step size of 4Δ, where Δ represents the quantization step for the affine parameter. They use the best value they found during this search for the next iterations. The first iteration checks the neighbors at a distance of 2Δ, then the second with a distance of Δ. This will give the best affine parameter for the given translation parameters. But the best translation parameters might be different in case of affine prediction, so the second step, the parameter refinement, is performed.

The parameter refinement works by alternating translation parameter refinement and affine parameter refinement. In both cases, the encoder will look for the closest neighbors, at a quarter sample distance for the translation and Δ for the affine parameter. The refinement stops when either a maximum number of iterations or no more improvement happens.

#### *2.5. Motion Parameter Prediction and Entropy Coding*

To achieve optimal efficiency when using affine motion prediction, it is important to signal the affine motion parameters with as few bits as possible. Every method uses the same coding as HEVC for the translational parameters, making full use of the motion vector prediction coding.

Reference [9] improves the translational motion vector coding by estimating the change in the translation parameter between blocks. Block-to-block translational shift compensation (BBTSC) corrects the translational shift, allowing merge mode to be used much more often as there is no need to signal the motion vector difference. This results in an improvement of 6% in the tested sequences.

Coding the affine motion parameters is difficult, as it is more difficult to predict them from neighboring blocks. The first limitation is not all blocks are going to use affine prediction, so it may be often necessary to code them without a prediction, but even in the case where a neighbor uses affine prediction, it may use a different reference picture, and scaling the motion parameters is challenging, as simply multiplying every value by the distance ratio does not work. Reference [11] tackles this problem by allowing motion scaling to work on affine parameters. They propose decomposing the transform into separate transforms, for example a rotation and a zoom operation, and scale each matrix appropriately, then combine them again to get the new parameters.

For the quantization, the most common, used in References [3,8,9], is a quantization step of 1/512. Reference [10] evaluates different quantization step sizes, from 1/16 to 1/512. They find that using such a fine quantization step gives no coding efficiency benefit, and that 1/256 is enough to get the best efficiency. As their method is a semi-exhaustive search, reducing the number of possible values is also good for encoding speed. They also choose to limit the maximum quantized parameter to 16, as higher values are too rare and seldom used.

#### *2.6. Optical Flow*

Estimating the movement between two pictures has been a subject of research for a long time, as it has numerous applications. For video coding, it is necessary for finding motion vectors, and is often done through computationally expensive methods that check the error for each possible motion vector, with more recent methods improving the search algorithms to keep the encoding time reasonable. In those cases, only the cost for the whole block is considered, so the movement estimation is often not accurate at a more granular level.

However, in many applications, the movement for each pixel is desired. This is typically referred to as optical flow. One of the most famous and popular methods for estimating optical flow is the Lucas-Kanade method [12]. It has been used a lot and gives satisfying results for simple movements. It is quite fast, which is one of the reasons for its popularity. Because it is included in the OpenCV library, it is also very easy to use, while many methods do not release their code, which adds the additional burden of implementation to potential users.

A recent application that also shows potential for video coding is frame interpolation, where by computing the movement for each pixel between the two frames, it is possible to estimate the missing frame with remarkable accuracy, which was demonstrated in EpicFlow [13]. To speed up the process, it is possible to use the motion vectors that are used for encoding the frames as estimators of the motion for a given block, then refine the optical flow to a pixel level, as was proposed in HEVC-Epic [14], which offers a good increase in speed compared to EpicFlow, but is still very slow, taking several seconds for estimating a single frame.

While the computed interpolated frame could be used in encoding with a new kind of prediction, it would make decoding too slow. Decoding needs to be possible on inexpensive hardware to see any large scale adoption.

While the state of the art optical flow methods achieve impressive accuracy, this comes at the cost of increased computation, and depending on the methods the time required varies depending on the picture. When considering hardware implementations and real time constraints, as is the case in encoding, it is important to ensure that the computations will always be bounded as to avoid the need for additional circuitry that will be used only in few cases. In this paper, the optical flow method from Ce Liu [15] was considered because the computation cost varies solely on the size of the input picture and the parameters for the number of iterations.

It also offers very nice properties for hardware implementation, as all the operations are highly parallel in nature, which makes them very easy to implement in hardware. While the software implementation is not parallelized, it would be possible to improve the speed relatively easily.

#### **3. Proposed Method**

#### *3.1. Optical Flow Estimation*

For each picture using inter-picture prediction, optical flow is computed using the current picture and the first picture in the reference picture list. While computing it for every picture in the reference picture list leads to better approximations, the required time is much higher, and the proposed method aims to provide good encoding efficiency with a faster encoding than similar methods. For the reference picture, the picture before encoding is used. This offers two advantages: first, this allows optical flow to be computed before the picture is encoded, and second, the motion estimation is more accurate and follows the real movement better, especially when the quantization parameter is large and the reconstructed picture is of lower quality.

After obtaining an approximate displacement for each pixel in the current picture, the estimation is performed for each CTU. As in Reference [3], using smaller blocks improves only slightly the encoding efficiency, but it would require a lot more time. The estimation is based on resolving the linear equation for the 4-parameter model transform with two points in the block.

As the translation parameter can be more accurately estimated with the standard motion estimation technique, only the parameters *a* and *b* are considered. Using *x* and *y* as the distance between the input points and *x* and *y* as the distance between the output points, we can estimate *a* and *b* with the following equation:

$$\begin{aligned} a &= 1 + s = \frac{xx' + yy'}{x^2 + y^2} \\ b &= -r = \frac{x'y - xy'}{x^2 + y^2} \end{aligned} \tag{8}$$

To get good results, the points should be far enough apart, so points around the edge of the current block are used. If the points are too close together, cancellation is likely to occur, as the subpixel motion estimation through optical flow is imprecise. To remove the risk of bad estimations from outliers, the values of *a* and *b* are estimated for multiple couples of points, and the median value is retained. When the block is on the edges of the picture and contains pixels outside the reconstructed picture, we cannot compute optical flow on these samples. This happens when the input size is not a multiple of the largest coding block size. In this case, we use samples that are within the reconstructed picture for the computations.
