Fast Gaussian Filter Approximations Comparison on SIMD Computing Platforms

Rybakova, Ekaterina O.; Limonova, Elena E.; Nikolaev, Dmitry P.

doi:10.3390/app14114664

Open AccessArticle

Fast Gaussian Filter Approximations Comparison on SIMD Computing Platforms

by

Ekaterina O. Rybakova

^1,2,*

,

Elena E. Limonova

^1,3

and

Dmitry P. Nikolaev

^1,3

¹

Smart Engines Service LLC, 117312 Moscow, Russia

²

Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, 119991 Moscow, Russia

³

Federal Research Center “Computer Science and Control”, Russian Academy of Sciences, 119333 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4664; https://doi.org/10.3390/app14114664

Submission received: 18 April 2024 / Revised: 26 May 2024 / Accepted: 27 May 2024 / Published: 29 May 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Gaussian filtering, being a convolution with a Gaussian kernel, is a widespread technique in image analysis and computer vision applications. It is the traditional approach for noise reduction. In some cases, performing the exact convolution can be computationally expensive and time-consuming. To address this problem, approximations of the convolution are often used to achieve a balance between accuracy and computational efficiency, such as with running sums, Bell blur, Deriche approximation, etc. At the same time, modern computing devices support data parallelism (vectorization) via Single Instruction Multiple Data (SIMD) and can process integer numbers faster than floating-point approaches. In this paper, we describe several methods for approximating a Gaussian filter, implement the SIMD and quantized versions, and compare them in terms of speed and accuracy. The experiments were performed on central processing units with a x86_64 architecture using a family of SSE SIMD extensions and an ARMv8 architecture using the NEON SIMD extension. All the optimized approximations demonstrated 10–20× speedup while maintaining the accuracy in the range of 1 ×

10^{- 5}

or higher. The fastest method is a trivial Stack blur with a relatively high error, so we recommend using the second-order Vliet–Young–Verbeek filter and quantized Bell blur and running sums as more accurate and still computationally efficient alternatives.

Keywords:

approximations; computational efficiency; Gaussian smoothing; image filtering; impulse response; quantization; SIMD

1. Introduction

Image processing is widely used in various fields, such as computer vision, medical imaging, and remote sensing. It includes various techniques for analyzing digital images to balance color, reduce image noise, and sharpen the image. A fundamental procedure here is Gaussian smoothing, which aims to enhance the image by applying a Gaussian filter. The Gaussian filter is a weighted moving average filter that provides more weight to the central pixel, gradually decreasing the weight as it moves away from the center. It creates a smoothing effect, blurring the image slightly. Gaussian smoothing is essential because of its ability to increase the image quality by reducing the high-frequency noise and removing unwanted artifacts. It can improve the visibility of important image features and make the subsequent image-processing tasks more accurate and reliable [1,2,3,4]. These properties make Gaussian smoothing indispensable for many applications, including neural network training [5].

However, Gaussian filtering comes at a computational cost that is negligible for single-filter applications but may be noticeable for multiple processing, especially on edge devices requiring real-time performance. Most methods (namely Finite Impulse Response filters; see Section 4.1) are based on convolving the image with some kernel, requiring numerous multiplications and additions for each pixel. The computational complexity of such an approach can depend on the size of the convolution kernel and the dimensions of the input image. Larger kernels and high-resolution images often require more processing power and can noticeably slow down the image processing algorithm.

Various optimizations can be employed to mitigate the computational cost. Advancements in hardware, such as the utilization of multi-core architectures [6] or exploiting GPU capabilities [7,8,9], producing high-performance implementations [10], have contributed to reducing the impact of the computational cost and improving the efficiency of Gaussian filtering.

Another approach for acceleration is to approximate the Gaussian filter with a simpler filter that provides a similar smoothing effect. Such methods significantly reduce the computational complexity compared to traditional Gaussian filtering, making it suitable for real-time image processing applications (for example, at the image preprocessing stage in unmanned aerial vehicle position detection [11]). The appeal of these approximations lies in their ability to strike a balance between computational efficiency and accuracy.

Many applications of Gaussian filtering use central processing units (CPUs) for data handling. Modern CPUs come with the Single Instruction Multiple Data (SIMD) architecture, which allows them to process multiple data items simultaneously using SIMD extensions. This is completed by placing the data in vectors of 128, 256, or 512 bits wide and processed by specialized instructions (SSE, AVX, AVX-512, and AMX for x86/x64; NEON and SVE for ARM). This approach accelerates the computations by several times, which leads to a significant increase in the performance. CPUs and SIMD extensions also process integer data more efficiently, so the floating-point values are often quantized to achieve computation speedup. However, quantization can affect the accuracy of the algorithm and requires careful examination.

The paper overviews several fast and computationally efficient methods that approximate Gaussian filtering and compares their complexity and precision. In the experimental part, we describe the floating-point and integer-quantized implementations of the considered approximations and examine their SIMD versions.

Our contributions are as follows:

We provide a compact but sufficient description of the seven methods proposed in the literature for approximating a Gaussian filter.
We estimate the accuracy using the same notation for all the methods and present their computational complexity calculated in terms of the amount of additive and multiplicative operations.
We compute and provide the parameters for lower-order Deriche and Vliet–Young–Verbeek filters in addition to those that had previously been considered in the original papers, following the authors’ methodologies.
We consider quantized versions for the described methods except for the Deriche and Vliet–Young–Verbeek filters and describe the approach to vectorizing all the methods using SIMD extensions.
We conduct experiments on CPUs with x86_64 and ARMv8 architectures and measure the processing time for the methods under consideration.

In Section 3, we describe the procedure of Gaussian filtering and its most basic optimizations. In Section 4, we provide a comparative overview of several methods proposed in the literature to approximate the Gaussian filter, and, in Section 5, we compare them in terms of speed and accuracy. Section 6 outlines our implementation of the considered methods. The results of the computational experiments are presented and discussed in Section 7. The paper is concluded in Section 8.

2. Related Work

The optimization of Gaussian filtering algorithms has a long history, starting with the appearance of the earliest techniques for applying blurring in computer graphics. Various approaches have been developed, including algorithmic improvements, numerous filter approximations, and the utilization of specialized hardware to speed up the blurring process.

In the early papers, the researchers proposed various algorithmic optimizations, like approximations to Gaussian filtering, that reduce the computational effort using fewer and simpler arithmetic operations. They considered Finite Impulse Response (FIR) [12,13,14] and Infinite Impulse Response (IIR) [15,16,17] digital filters. These approximations are described in detail below.

Gaussian blur optimizations are still developing and incorporate various programming strategies, such as parallel computing, special hardware use, efficient device-specific algorithms, and use scenarios. Currently, the single-threaded solutions cannot keep up with the performance and speed required for image processing techniques. Orders of magnitude speedup can be achieved with the help of high parallelism. Researchers have suggested parallel solutions for fast blurring on central processing units (CPUs) and graphics processing units (GPUs) [18,19].

Solutions supporting thread-level parallelism and a data-level approach utilizing specialized vector instructions have also been proposed [20,21]. Computer vision libraries are being developed to provide many practical high-performance algorithms for image processing. Some include optimized implementations of Gaussian blur using parallel solutions [22,23].

The data type also affects the efficiency of computation. For this reason, quantization is a widespread technique for reducing the runtime of various image processing methods [24,25,26] by translating the calculations from a floating-point data type to a fixed-point one stored in integers. For instance, regarding digital filtering, the authors in [27] proposed fast and accurate blurring by the Look-Up Table technique and using coefficients rounded to integers. Methods for digital image filtering using the residue number system were described in [28,29], where the idea was to perform all the computations on fixed-point numbers by replacing the division with the multiplication of all the fractions by a power of two and rounding to integers.

All these optimizations led to substantial improvements in the computational performance, making it a valuable tool for real-time image processing, computer vision, and graphics applications.

3. Gaussian Filtering

The Gaussian kernel, hereinafter denoted by

g_{σ}

, is a continuous function defined by the following equation:

g_{σ} [m, n] = \frac{1}{2 π σ^{2}} e^{- \frac{1}{2} (\frac{m^{2} + n^{2}}{σ^{2}})},

(1)

where

σ

is the standard deviation. It is important to note that the Gaussian kernel is always normalized.

The procedure of Gaussian filtering is based on applying the kernel defined in (1) to an image. It works by convolving the image with a Gaussian kernel, determining the weights with which pixels affect the filtering result.

Basic Optimizations

One can see that the values of the Gaussian kernel decrease as the argument moves away from zero, and this kernel goes to zero at infinity. However, an infinite-size mask cannot be used in computations. For very large positive or negative arguments, the values of the Gaussian kernel become so small that they can be ignored, and the computations can be focused only in the neighborhood of zero; such an optimization is called “truncation of the tails”. A three-sigma rule of thumb states that, if the mask is of size

K \times K

, then K should roughly equal

2 π σ \approx 6 σ

. That would capture about

99.73 %

of the non-zero values of the Gaussian function. Then, convolution will have the following form:

y [i, j] = \sum_{m = - M}^{M} \sum_{n = - M}^{M} g_{σ} [m, n] x [i - m, j - n],

(2)

where

x [i, j]

denotes the value of

(i, j)

-th pixel of image x,

y [i, j]

is the resulting value of

(i, j)

-th pixel of output image y, and

K = 2 M + 1

. The illustration of how discrete convolution works is provided below (Figure 1).

Applying this operation to each image pixel x, we obtain the values of output image y, the convolution result. There will be a problem related to boundary pixels: the closer a pixel is to the image boundaries, the more of the mask lies outside the image. There are two standard solutions to deal with such a border issue. The first variant ignores border pixels and performs convolution only for those pixels for which this is possible. The second one is to define boundary pixel neighborhoods or by some extrapolation. It can be set to constant values, mirrored, etc.

The important thing that makes the Gaussian mask attractive is the fact that it is separable. It is illustrated by the following expression:

\begin{matrix} y [i, j] & = \frac{1}{2 π σ^{2}} \sum_{m = - M}^{M} \sum_{n = - M}^{M} e^{- \frac{1}{2} (\frac{m^{2} + n^{2}}{σ^{2}})} x [i - m, j - n] = \\ = \sum_{m = - M}^{M} \frac{1}{\sqrt{2 π} σ} e^{- \frac{1}{2} (\frac{m^{2}}{σ^{2}})} \cdot \sum_{n = - M}^{M} \frac{1}{\sqrt{2 π} σ} e^{- \frac{1}{2} (\frac{n^{2}}{σ^{2}})} x [i - m, j - n] = \\ = \sum_{m = - M}^{M} g_{σ} [m] \cdot \sum_{n = - M}^{M} g_{σ} [n] x [i - m, j - n] . \end{matrix}

(3)

So, the convolution with a 2D Gaussian kernel can be decomposed into a pair of convolutions with 1D Gaussian kernels: one vertical and one horizontal. It is important to note that this decomposition is not itself an approximation and still provides an exact result.

The representation provided by (3) renders applying the Gaussian filtering much computationally cheaper. If we do not consider the Gaussian kernel’s separability, we will perform

K^{2}

multiplications and

K^{2} - 1

additions per pixel to convolve a 2D image with a 2D kernel. Convolution with a 1D kernel requires K multiplications and

(K - 1)

additions per pixel, and filtering with a separable 2D kernel only doubles the cost of the 1D case. So, utilizing the separability property, the computational cost is reduced to

2 K

multiplications and

2 (K - 1)

additions per pixel. The kernel separability already allows us to reduce the computational complexity by one order.

In the following, we will use the separability property throughout, so all further description is limited to the consideration of one-dimensional filters.

4. Approximations

There are two main types of approximations for Gaussian convolution filters: Finite Impulse Response (FIR) filters and Infinite Impulse Response (IIR) filters. Each of these types has distinct characteristics and trade-offs that affect their performance and implementation complexity. FIR filters are often desirable when exact precision is required as they provide a closer approximation to the Gaussian kernel. It makes them well-suited for applications that demand accurate filtering, such as image enhancement and feature extraction. IIR filters are more suitable for applications where computational efficiency is of primary concern. Their recursive nature allows for efficient computation, making IIR filters appealing for real-time applications or scenarios where processing power is limited. They find applications in real-time video processing, audio signal processing, and communication systems.

In this section, we describe several of the best-known FIR and IIR filters proposed over the last few decades.

4.1. FIR Approximations

The Finite Impulse Response (FIR) filters use a finite number of input samples to produce each output sample. In the context of our paper, FIR filters use an approximated version of the truncated Gaussian function as the impulse response and can be expressed in the following form:

y [i] = \sum_{n = i - M}^{i + M} h [i - n] x [n], i = 0, \dots, N,

(4)

where

2 M + 1

is the size of the mask, and N hereinafter is the length of the signal. So, in a general case, exploiting symmetry in the filter h, FIR approximation to Gaussian filtering requires roughly

1 + M

multiplications and

2 M

additions per output sample. Researchers suggested different methods to reduce the computational complexity by specifying a special kind of h. Below, we describe some of the approximations presented in various sources.

4.1.1. DFT-Based Convolution

The discrete Fourier transform (DFT) is a powerful mathematical tool for computing the convolution of a 1D signal. A signal’s DFT represents it in the frequency domain, where each frequency component is quantified by its magnitude and phase, making it useful for analyzing and manipulating signals.

In mathematical language, the Fourier transform is expressed as follows:

F (x) [k] = \sum_{n = 0}^{N - 1} x [n] e^{- \frac{2 π i n k}{N}},

(5)

where N is the size of the input signal. Utilizing the convolution–multiplication property of the DFT, we can compute the convolution via the sequence of discrete Fourier transformations:

y = F^{- 1} (F (g_{σ}) \cdot F (x)) .

(6)

So, to convolve the image with Gaussian kernel, we can compute the DFT of both signals, perform the element-wise multiplication, and compute the inverse DFT of the result. Note that both signals must be of the same length; if this is not the case, the shorter one must be complemented. Also, DFT (or inverse DFT) should be normalized.

The computational complexity of the DFT-based convolution (using a fast Fourier transform algorithm [30]) of a 1D signal is typically

O (N \log N)

.

4.1.2. Stack Blur and Bell Blur

The Stack blur algorithm was proposed in [13] to approximate the Gaussian-blurred value of a pixel by a composition of 1D box blurs with increasing radius that simulates the effect of applying the Gaussian kernel. It uses a moving stack of pixels while scanning the image row. The expression of the Stack blur algorithm in one direction is defined as follows:

y_{s} [i] = \sum_{n = i - r}^{i + r} (r - | n - i | + 1) x [n],

(7)

where

r = r (σ) = [C σ]

is the radius of the box blur and C is some constant value. So, the 1D Gaussian kernel (1) is approximated by a piecewise linear function

S_{σ} [m] = r - | m | + 1 = \{\begin{matrix} r - m + 1, & if m > 0, \\ r + m + 1, & otherwise . \end{matrix}

(8)

The illustration is provided in Figure 2.

The idea of fast computation ensues from the following representations:

\begin{matrix} y_{s} [i] = y_{s} [i - 1] & - \underset{s_{i}^{o u t}}{\underset{︸}{(x [i - r - 1] + x [i - r] + \dots + x [i - 1])}} + \\ + \underset{s_{i}^{i n}}{\underset{︸}{(x [i] + x [i + 1] + \dots + x [i + r])}}, \end{matrix}

(9)

\begin{matrix} y_{s} [i + 1] = & \sum_{n = (i - r) + 1}^{(i + r) + 1} (r - | n - i - 1 | + 1) x [n] = \\ = & y_{s} [i] - \underset{s_{i + 1}^{o u t}}{\underset{︸}{(x [i - r] + \dots + x [i - 1] + x [i])}} + \\ + \underset{s_{i + 1}^{i n}}{\underset{︸}{(x [i + 1] + \dots + x [i + r] + x [i + r + 1])}}, \end{matrix}

(10)

\begin{matrix} s_{i + 1}^{o u t} & = s_{i}^{o u t} - x [i - r - 1] + x [i], \\ s_{i + 1}^{i n} & = s_{i}^{i n} - x [i] + x [i + r + 1] . \end{matrix}

(11)

Once we have calculated the value

y_{s} [i]

for the i-th element, the value

y_{s} [i + 1]

for the

(i + 1)

-th element, due to the form of the kernel, can be obtained just in

r + 1

subtractions and

r + 1

additions. Storing the values of

s^{o u t}

and

s^{i n}

, we can obtain the result for

(i + 1)

-th element just by 3 additions and 3 subtractions. Therefore, not considering the initial element in a row (the value for which can be calculated with

2 r + 1

multiplications and

2 r

additions), computing the Stack blur for a pixel requires 6 additive operations. Note also that the Stack blur result should be normalized by

{(r + 1)}^{2}

.

A further extension of the Stack blur idea was described in [14] as Quadratic Stack blur, or Bell blur. It operates as follows:

\begin{matrix} y_{b} [i + 1] & = y_{b} [i] - y_{s} [i - r] + y_{s} [i + r + 1] = \\ = y_{b} [i] - (y_{s} [i - r - 1] - s_{i - r}^{o u t} + s_{i - r}^{i n}) + (y_{s} [i + r] - s_{i + r + 1}^{o u t} + s_{i + r + 1}^{i n}) . \end{matrix}

(12)

Computing

y_{s} [i - r]

and

y_{s} [i + r + 1]

according to (9) and (10) requires 6 subtractions and 6 additions in total. Computing the whole result for

(i + 1)

-th element requires 1 more subtraction and 1 more addition, so the total cost is 14 additive operations per element (except for initialization). The result should be normalized by

(2 r + 1) {(r + 1)}^{2}

.

The general expression for Bell blur is more complicated:

y_{b} [i] = \sum_{n = i - 2 r}^{i + 2 r} (\sum_{m = - r}^{r - | i - n |} (r - | m | + 1)) x [n],

(13)

where

r = r (σ) = [C σ]

is the radius of the Stack blur and

2 r

is the radius of the Bell blur. The kernel (1) is approximated by a piecewise constant function with integer values

B_{σ} [n] = \sum_{m = - r}^{r - | n |} (r - | m | + 1) = (r + 1) (2 r - | n | + 1) - \sum_{m = - r}^{r - | n |} | m | .

(14)

The illustration of this approximation is also provided in Figure 2.

4.1.3. Running Sums

The running sums method, proposed by Elboher and Werman in [12], is the approximation of Gaussian kernel based on the idea of computing a running sum of pixel values within a rectangular region. As in the case of Bell blur, running sums is a piecewise constant approximation but already with real rather than integer values. The fast computation uses the concept of integral images and cumulates the sum of pixel values along the image rows and columns.

The authors defined k constant functions with values

c_{i} \in R, i = 1, \dots, k

and divided the segment

[- 3 σ; 3 σ]

into the nested segments defined by the partition indices

[- p_{i}, p_{i}], i = 1, \dots, k

, so the expression of the proposed approximation is

y_{r s} [i] = \sum_{n = i - p_{k}}^{i + p_{k}} R [i - n] x [n],

(15)

where

R [n] = \{\begin{matrix} c_{1}, & if n \in [- p_{1}; p_{1}], \\ c_{2}, & if n \in [- p_{2}; - p_{1}) \cup (p_{1}; p_{2}], \\ ⋮ \\ c_{k}, & if n \in [- p_{k}; - p_{k - 1}) \cup (p_{k - 1}; p_{k}] . \end{matrix}

(16)

The illustration is provided in Figure 3.

An approach called integral images is used for fast computation of (15). It is proposed to compute and store the cumulative sums preliminarily

I [i] = \sum_{n = 0}^{i} x [n]

(17)

for all pixels in an image row so that

\sum_{n = i - p_{k}}^{i + p_{k}} x [n] = I [i + p_{k}] - I [i - p_{k} - 1] .

(18)

Let us set

c_{k + 1} = 0

and denote

w_{i} = c_{i} - c_{i + 1} \in R, i = 1, \dots, k

. Then, we can rewrite the expression (15) as follows:

y_{r s} [i] = \sum_{n = 1}^{k} w_{n} (I [i + p_{n}] - I [i - p_{n} - 1]),

(19)

which means that the sum is accumulated by areas of horizontal rectangles, not vertical, as in (15).

For given k, the weights

w_{1}, \dots, w_{k}

and partition indices

p_{1}, \dots, p_{k}

are chosen to optimize the approximation of the Gaussian kernel. The authors provided the following values for cases with

σ_{0} = \frac{100}{π}

and

k \in {3, 4, 5}

:

\begin{matrix} k = 3 : & p_{i} = {23, 46, 76}, w_{i} = {0.9495, 0.5502, 0.1618}; \\ k = 4 : & p_{i} = {19, 37, 56, 82}, w_{i} = {0.9649, 0.6700, 0.3376, 0.0976}; \\ k = 5 : & p_{i} = {16, 30, 44, 61, 85}, w_{i} = {0.9738, 0.7596, 0.5031, 0.2534, 0.0739}; \end{matrix}

(20)

and described how we can scale them for other

σ

values [12]:

p_{i}^{(σ)} = ⌊\frac{σ}{σ_{0}} \cdot p_{i}⌋, w_{i}^{(σ)} = ⌊\frac{p_{i}}{2 p_{i}^{(σ)} + 1} \cdot w_{i}⌋ .

(21)

The number of arithmetic operations depends on k. The method generally requires

2 k

additions and k multiplications per image pixel. Note also that it requires extra memory. For storing the cumulative sum of a single row or column, the method requires

O (\max {h, w})

additional space over the input and the output images, where h and w are linear sizes of the image (height and width).

4.1.4. Piecewise Parabolic Approximation

This approach is based on the piecewise polynomial kernel approximation. The convolution can be decomposed into a sum of convolutions, each of which is different from zero only in a part of the window. Such a convolution is equivalent to a convolution with a reduced shifted window whose center no longer coincides with the target pixel. This method will be abbreviated below as PcParab.

Each polynomial can be represented as the sum of terms; i.e., the problem is reduced to calculating convolutions with a constant, x,

x^{2}

, etc. The important thing is that the product of the number of sites in the piecewise approximation and the number of terms is not large. For approximating the Gaussian kernel, the idea is to divide the segment

[- 3 σ; 3 σ]

into three non-intersecting sub-segments

p_{k}, k = 1, 2, 3

such that

\cup p_{k} = [- 3 σ; 3 σ]

and approximate the parts of Gaussian kernel on each sub-segment by a parabola. The parabola can be described by a quadratic equation of the form

y = a x^{2} + b x + c

with real coefficients

a, b

, and c. By choosing three different points on each sub-segment

p_{k}

, we can solve a system of three equations

a_{k} x_{i}^{2} + b_{k} x_{i} + c_{k} = g_{σ} [x_{i}], i = 1, 2, 3, x_{i} \in p_{k},

(22)

to determine the coefficients

a_{k}, b_{k}, c_{k}

of the corresponding parabola. An example of the obtained approximation is provided in Figure 4.

In our experiments, we divided the segment at the Gaussian inflection points

\pm σ

and took the ends and the middle of the sub-segment to construct the parabola. Alternatively, partition points can be found using the least squares method.

To compute convolutions with terms fast, we use the following technique. Convolution (4) with constant function

h [n] \equiv 1

is defined as

\begin{matrix} y^{C} [i] = \sum_{j = i - M}^{i + M} x [j], i = M, \dots, N - M - 1 . \end{matrix}

(23)

Hence, we have

\begin{matrix} y^{C} [i + 1] = y^{C} [i] - x [i - M] + x [i + M + 1] . \end{matrix}

(24)

First, the value for the left-most pixel

i = M

is computed according to (23), and then, given (24), the values for each

i = M + 1, \dots, N - M - 1

are obtained with 1 additive and 1 multiplicative operation.

For the convolution with

h [n] = n

, we can rewrite (4) as follows:

\begin{matrix} y^{F} [i + 1] = y^{F} [i] - M \cdot (x [i - M] + x [i + M + 1]) + \underset{y^{C} [i] - x [i - M]}{\underset{︸}{\sum_{j = i - M + 1}^{i + M} x [j]}}, \end{matrix}

(25)

so the result is

\begin{matrix} y^{F} [i + 1] = y^{F} [i] + y^{C} [i] - (M + 1) \cdot x [i - M] - M \cdot x [i + M + 1] \end{matrix}

(26)

and can be easily obtained from the already computed values for

y^{C}

. All convolutions with n are computed in this way. Furthermore, to convolve with

h [n] = n^{2}

, we use the following representation:

\begin{matrix} y^{F 2} [i + 1] = y^{F 2} [i] - M^{2} \cdot (x [i - M] + x [i + M + 1]) + 2 \underset{y^{F} [i] - M \cdot x [i - M]}{\underset{︸}{\sum_{j = i - M + 1}^{i + M} (i - j) x [j]}} + \underset{y^{C} [i] - x [i - M]}{\underset{︸}{\sum_{j = i - M + 1}^{i + M} x [j]}}, \end{matrix}

(27)

so the final expression comes as follows:

\begin{matrix} y^{F 2} [i + 1] = y^{F 2} [i] + 2 \cdot y^{F} [i] + y^{C} [i] - {(M + 1)}^{2} \cdot x [i - M] - M^{2} \cdot x [i + M + 1] . \end{matrix}

(28)

The full result of convolution with a parabola

h [n] = a_{k} n^{2} + b_{k} n + c_{k}

therefore becomes

\begin{matrix} y [i] = & \sum_{j = i - M}^{i + M} (a_{k} {(i - j)}^{2} + b_{k} (i - j) + c_{k}) x [j] = \\ = & a_{k} \sum_{j = i - M}^{i + M} {(i - j)}^{2} x [j] + b_{k} \sum_{j = i - M}^{i + M} (i - j) x [j] + c_{k} \sum_{j = i - M}^{i + M} x [j] = \\ = & a_{k} y^{F 2} [i] + b_{k} y^{F 1} [i] + c_{k} y^{C} [i], \end{matrix}

(29)

which can be computed in 11 additions and 8 multiplications according to (24), (26), and (28). On each of three sub-segments

p_{k}, k = 1, 2, 3,

we approximate the Gaussian kernel by a fragment of a parabola, so each sub-segment corresponds to a limited range j in (29). Initialization requires

3 K

additions and

2 K

multiplications in total, where

K = 2 M + 1

is the window size.

4.2. IIR Approximations

The Infinite Impulse Response (IIR) filters use current input and previous input and output samples to compute each output sample. In the case of approximating Gaussian convolution, IIR filters employ recursive expressions that allow for iterative computations.

IIR filter recursively solves a sequence of difference equations like the following one:

\begin{matrix} y [n] = b_{0} x [n] & + b_{1} x [n - 1] + \dots + b_{P} x [n - P] \\ - a_{1} y [n - 1] - \dots - a_{Q} y [n - Q], \end{matrix}

(30)

where the filter coefficients

b_{i}, i = 0, \dots, P,

and

a_{j}, j = 1, \dots, Q

are some functions of the Gaussian half-width

σ

. The cost of such an approximation is independent of

σ

. By considering the z-transform of the IIR filter and combining causal and anti-causal parts, researchers constructed recursive systems with symmetric impulse responses that closely approximate the Gaussian filtering.

Several state-of-the-art forms of IIR approximations are described below.

4.2.1. Deriche Form

This method was described in 1993 by R Deriche [15] and improved in 2006 by Farnebäck and Westin [16]. Here, we only focus on Deriche’s original approach. In his work, the recursive Gaussian filter of the fourth order was constructed as a sum of causal and anti-causal systems:

H (z) = H^{+} (z) + H^{-} (z),

(31)

where causal part

H^{+}

and anti-causal part

H^{-}

are as follows:

\begin{matrix} H^{+} (z) = \frac{b_{0}^{+} + b_{1}^{+} z^{- 1} + b_{2}^{+} z^{- 2} + b_{3}^{+} z^{- 3}}{1 + a_{1} z^{- 1} + a_{2} z^{- 2} + a_{3} z^{- 3} + a_{4} z^{- 4}}, \\ H^{-} (z) = \frac{b_{1}^{-} z + b_{2}^{-} z^{2} + b_{3}^{-} z^{3} + b_{4}^{-} z^{4}}{1 + a_{1} z + a_{2} z^{2} + a_{3} z^{3} + a_{4} z^{4}} . \end{matrix}

(32)

To implement the composite system

H (z)

, one needs to apply both causal and anti-causal filters to an input sequence

x [n]

and accumulate the results in an output sequence

y [n]

.

To obtain coefficients

a_{1}, a_{2}, a_{3}

, and

a_{4}

, Deriche noted that they depend only on the locations of the filter poles and are the same for both the causal part

H^{+}

and the anti-causal part

H^{-}

. Coefficients

b_{1}^{+}, b_{2}^{+}, b_{3}^{+}

, and

b_{4}^{+}

in the causal system

H^{+}

depend on the locations of the filter zeros. The coefficients

b_{1}^{-}, b_{2}^{-}, b_{3}^{-}

, and

b_{4}^{-}

in

H^{-} (z)

are computed from others as the composite filter

H (z)

must be symmetric. Thus,

H (z) = H (z^{- 1})

is satisfied and we have the following expressions:

\begin{matrix} b_{1}^{-} = b_{1}^{+} - b_{0}^{+} a_{1}, \\ b_{2}^{-} = b_{2}^{+} - b_{0}^{+} a_{2}, \\ b_{3}^{-} = b_{3}^{+} - b_{0}^{+} a_{3}, \\ b_{4}^{-} = - b_{0}^{+} a_{4} . \end{matrix}

(33)

With these relations, the eight coefficients of the causal system

H^{+}

completely determine the symmetric impulse response

h [n]

of the composite system

H (z)

. Those eight coefficients, in turn, depend on the poles and zeros of the causal system

H^{+}

.

Deriche computed poles and zeros to minimize a normalized mean square error

ε^{2} = \frac{\sum_{n = 0}^{S} {(g_{σ_{0}} [n] - h [n])}^{2}}{\sum_{n = 0}^{S} {(g_{σ_{0}} [n])}^{2}}

(34)

of squared differences between the sampled Gaussian function with half-width

σ_{0}

and the impulse response h that was taken in the form

h [n] = \sum_{i = 0}^{m} α_{i} e^{- \frac{λ_{i} n}{σ}} .

(35)

The author chose

S = 10 σ_{0}

with

σ_{0} = 100

. A solution to this non-linear least-squares minimization problem requires to be computed only once for

σ_{0}

, and, due to the scaling properties, it can be extended to other

σ

. Finally, to deal with normalized filters for Gaussian smoothing, Deriche suggested scaling the operators by the normalization factor C such that

C \sum_{n = - \infty}^{+ \infty} h [n] = 1,

(36)

which provides

C = \frac{1}{H^{+} (1) + H^{-} (1)} .

(37)

This paper only includes the first-order and the second-order Deriche-style filters in the comparison as the most computationally efficient options. They both come from the sum

y [k] = C (y^{+} [k] + y^{-} [k]) .

(38)

The first-order filter has the following expression for

y_{k}^{+}

and

y_{k}^{-}

:

\begin{matrix} y^{+} [k] = b_{0}^{+} x [k] - a_{1}^{+} y^{+} [k - 1], \\ y^{-} [k] = b_{1}^{-} x [k + 1] - a_{1}^{-} y^{-} [k + 1], \end{matrix}

(39)

and approximates the causal part of Gaussian kernel by a simple exponential function

h_{1} (x) = α e^{- \frac{λ x}{σ}}, α, λ \in R .

(40)

Minimizing the mean square error (34), we obtained the following values for

α

and

λ

:

α = 1.25841931, λ = 0.92261977,

(41)

by which the coefficients of filter and normalization factor are expressed as follows:

\begin{matrix} b_{0}^{+} = α, a_{1}^{+} = - e^{- \frac{λ}{σ}}, \\ b_{1}^{-} = - a_{1}^{+} b_{0}^{+}, a_{1}^{-} = a_{1}^{+}, \\ C = \frac{1 + a_{1}^{+}}{b_{0}^{+} + b_{1}^{-}} . \end{matrix}

(42)

An example of the approximation with the first-order Deriche-style filter is provided in Figure 5. Same expressions as (39)–(42) corresponding to the second-order filter are provided below:

\begin{matrix} y^{+} [k] = b_{0}^{+} x [k] + b_{1}^{+} x [k - 1] - a_{1}^{+} y^{+} [k - 1] - a_{2}^{+} y^{+} [k - 2], \\ y^{-} [k] = b_{1}^{-} x [k + 1] + b_{2}^{-} x [k + 2] - a_{1}^{-} y^{-} [k + 1] - a_{2}^{-} y^{-} [k + 2], \end{matrix}

(43)

h_{2} (x) = (γ_{1} cos (\frac{ω x}{σ}) + γ_{2} sin (\frac{ω x}{σ})) \cdot e^{- \frac{b x}{σ}}, γ_{1}, γ_{2}, ω, b \in R,

(44)

γ_{1} = 0.9629, γ_{2} = 1.942, ω = 0.8448, b = 1.26,

(45)

\begin{matrix} b_{0}^{+} = γ_{1}, b_{1}^{+} = (- γ_{1} cos (\frac{ω}{σ}) + γ_{2} sin (\frac{ω}{σ})) \cdot e^{- \frac{b}{σ}}, \\ a_{1}^{+} = - 2 cos (\frac{ω}{σ}) \cdot e^{- \frac{b}{σ}}, a_{2}^{+} = e^{- \frac{2 b}{σ}}, \\ b_{1}^{-} = b_{1}^{+} - a_{1}^{+} b_{0}^{+}, b_{2}^{-} = - a_{2}^{+} b_{0}^{+}, \\ a_{1}^{-} = a_{1}^{+}, a_{2}^{-} = a_{2}^{+}, \\ C = \frac{1 + a_{1}^{+} + a_{2}^{+}}{b_{0}^{+} + b_{1}^{+} + b_{1}^{-} + b_{2}^{-}} . \end{matrix}

(46)

Note that values of

γ_{1}, γ_{2}, ω

, and b have already been reported in [15], and our result of minimizing the error of the form (34) exactly agrees with them.

These low-order filters require a small fixed number of arithmetic operations. Applying the first-order Deriche-style filter requires 3 additions and 4 multiplications per item, and the second-order filter 7 additions and 8 multiplications.

4.2.2. Vliet–Young–Verbeek Form

Another IIR filter was proposed in 1998 by van Vliet, Young, and Verbeek [17] and will be abbreviated in the paper as the VYV filter. While Deriche’s filter consists of the sum of the causal and anti-causal parts, the VYV filter is the product

\begin{matrix} H (z) = H^{+} (z) \cdot H^{-} (z), \\ H^{+} (z) = \prod_{i = 1}^{p} \frac{d_{i} - 1}{d_{i} - z^{- 1}}, H^{-} (z) = {(- 1)}^{p} \prod_{i = 1}^{p} \frac{d_{i} - 1}{z - d_{i}}, \end{matrix}

(47)

where p is the order of the filter,

d_{i} \in C

are the poles of

H (z)

. For

p = 3

, the causal and anti-causal systems take the following form:

H^{+} (z) = \frac{α}{1 + b_{1} z^{- 1} + b_{2} z^{- 2} + b_{3} z^{- 3}}, H^{-} (z) = \frac{α}{1 + b_{1} z + b_{2} z^{2} + b_{3} z^{3}}

(48)

with

α, b_{i} \in R

, so the expressions for the third-order filter are as follows:

\begin{matrix} y^{+} [k] = α x [k] - b_{1} y^{+} [k - 1] - b_{2} y^{+} [k - 2] - b_{3} y^{+} [k - 3], \\ y^{-} [k] = α y^{+} [k] - b_{1} y^{-} [k + 1] - b_{2} y^{-} [k + 2] - b_{3} y^{-} [k + 3] . \end{matrix}

(49)

To implement the composite system

H (z)

, the authors applied the causal filter to an input sequence

x [k]

to obtain an intermediate output sequence

y^{+} [k]

. Then, they applied the anti-causal filter to that sequence to obtain a final output

y [k]

. The same as Deriche’s method, coefficients

b_{1}, b_{2}, b_{3}

depend only on the locations of filter poles

d_{1}, d_{2} \in C

, and

d_{3} \in R

, and these coefficients are the same for both the causal part

H^{+}

and the anti-causal part

H^{-}

. They can be derived by minimizing the root mean square error

L^{2} = \sqrt{\frac{1}{2 π} \int_{- π}^{π} {(H (ω; σ_{0}) - G (ω; σ_{0}))}^{2} d ω} = \sqrt{\sum_{n = - \infty}^{+ \infty} {(g_{σ} [n] - h [n])}^{2}} .

(50)

The authors obtained the following poles for

σ_{0} = 2

:

d_{1} = 1.41656 + 1.00832 i, d_{2} = \bar{d_{1}}, d_{3} = 1.86548065,

(51)

and filter coefficients are related to poles

d_{1}, d_{2}, d_{3}

as

\begin{matrix} b = \frac{1}{d_{1} d_{2} d_{3}}, \\ b_{1} = - b (d_{1} d_{2} + d_{1} d_{3} + d_{2} d_{3}), b_{2} = b (d_{1} + d_{2} + d_{3}), b_{3} = - b, \\ α = 1 + b_{1} + b_{2} + b_{3} . \end{matrix}

(52)

Note that value of error (50) for the third-order VYV filter was provided in the original paper, and our result agreed with it to a margin of error. Finally, the authors scaled poles to obtain Gaussian filters for other

σ

using

d_{i} \to d_{i}^{\frac{1}{q}}, σ^{2} = \sum_{i = 1}^{p} \frac{2 d_{i}^{\frac{1}{q}}}{{(d_{i}^{\frac{1}{q}} - 1)}^{2}}

(53)

and determining the relationship between q and

σ

. For the third-order filter, examining a set of

(q, σ^{2})

pairs, we approximated this relation by the least squares method as in [31] but with the second-order polynomial as in [32]

σ^{2} = 1.06042809 q^{2} + 4.62413131 q + 4.56081613 .

(54)

In this paper, in addition to the third-order filter, we also include the second-order VYV-style filter in comparison. It performs as follows:

\begin{matrix} y^{+} [k] = α x [k] - b_{1} y^{+} [k - 1] - b_{2} y^{+} [k - 2], \\ y^{-} [k] = α y^{+} [k] - b_{1} y^{-} [k + 1] - b_{2} y^{-} [k + 2] . \end{matrix}

(55)

Minimizing (50) for

σ_{0} = 2

, we obtained the pair of complex-conjugate poles

\begin{matrix} d_{1} = 1.69593 + 0.5996 i, d_{2} = \bar{d_{1}} . \end{matrix}

(56)

The filter coefficients are

\begin{matrix} b = \frac{1}{d_{1} d_{2}}, \\ b_{1} = - b (d_{1} + d_{2}), b_{2} = b, \\ α = 1 + b_{1} + b_{2}, \end{matrix}

(57)

and, according to (53), we scaled them to

\frac{1}{q}

power with q computed from

σ^{2} = 1.02158395 q^{2} + 4.44944206 q + 4.53954235 .

(58)

An example of the approximation with the second-order and the third-order VYV filters is provided in Figure 6.

Applying the VYV filter is computationally less expensive than Deriche’s one. Whereas Deriche’s second-order filter requires 7 additions and 8 multiplications per pixel, the second-order VYV filter requires 4 additions and 6 multiplications and the third-order one 6 additions and 8 multiplications.

5. Complexity and Accuracy Comparison

The comparison of the methods in terms of computational complexity, accuracy, and extra memory is summarized in Table 1. The complexity is the number of required operations per pixel for 2D image filtering with a Gaussian kernel. Note that the number of required arithmetic operations is presented per pixel. For several methods, it depends on their internal parameters, the meaning of which can be clarified in the subsection with a description of the corresponding approximation.

The accuracy of methods was computed by us independently by minimizing the mean square error (MSE) provided by the expression

MSE = \frac{1}{S} \sum_{n = 1}^{S} {(g_{σ_{0}} [n] - h [n])}^{2}

(59)

for

S = 500

points uniformly taken on the interval

[- 3 σ_{0}; 3 σ_{0}]

with

σ_{0} = 10

. The original Deriche paper contained values of error provided by (34) for the second-order filter and

σ = 100

, so we re-estimated its error. The PcParab method is the only method that did not require any parameter adjustment by minimizing any function because the coefficients of the parabolas were obtained by solving a system of linear equations. As a result, the obtained

h [n]

was just substituted into expression (59) for this method.

The third-order VYV filter proved to be the most accurate of all the approximations considered. The second-order VYV and the second-order Deriche ones, as well as the PcParab method, were slightly less accurate. At the same time, these approximations are the most computationally expensive. Running sums method with

k = 5

has almost identical computational complexity but shows an order of magnitude lower accuracy. The first-order Deriche and second-order VYV filters require few operations but show lower accuracy than the others. The Stack and Bell blur approximations do not require multiplicative operations, while the accuracy is acceptable. Among the approximations with piecewise constant functions, the accuracy of Bell blur was higher than all the running sums ones since the number of steps used in Bell blur notation is not limited and is shown (Figure 2 and Figure 3) to be significantly greater than for running sums. Nevertheless, for

k = 5

, the accuracy of the running sums method is very close to Bell blur.

6. Quantization and SIMD Implementation

Modern computers employ either integer or floating-point format to store numbers. Both formats use a specific number of bits for value storage, with more bits equating to higher precision. In integer representation, the discrete values are evenly spaced. In floating-point methods, the discrete values near one another are uniform; however, the step size increases exponentially as the value’s magnitude increases. This results in a vast dynamic range, enabling storage of both very large and very small values in memory. While floating-point representation is prevalent, it does come with a computational cost. A floating-point processing unit requires more logic and transistors than an integer one. So, using an integer format can be advantageous: the resulting values require less memory storage and can be processed faster. Furthermore, quantization enables the implementation on devices with no floating point unit support.

We conducted our experiments with floating-point and low-precision integer representations. Let us describe our quantization approach. In running sums and PcParab methods, the coefficients are real-valued, and we quantized them to round them into integer ones. Mathematically, it is written as

\hat{a} = round (a \cdot 2^{b}),

(60)

where

a \in [0; 1]

is the coefficient to be rounded and b denotes the number of bits in fixed-point representation.

We ran the experiments on Intel Core i9-9900KF (with x86_64 architecture) and ARM Cortex-A73 (with ARMv8 architecture) CPUs. For DFT-based convolution, we used the implementation from fftw3 open-source library [33]. Methods except DFT-based convolution and direct Gaussian filtering were vectorized.

We utilized SIMD vectorization using intrinsics of SSE family for x86_64 and NEON intrinsics for ARMv8. We operated with 128-bit registers storing 4 single-precision floating-point values or 16 unsigned integer values. Approximations except for the running sums method compute the answer recurrently; i.e., the result for the current element is based on the result for the previous one. Thus, direct vectorization would not be efficient and meaningful. In vectorized implementations, the image was preliminarily transposed (Figure 7) so that we go through the columns of the original image, writing the answer in 4 or 16 elements at a time if 32-bit floating-point or 8-bit unsigned integer types, correspondingly, were used.

Details of quantized implementations can be clarified below. The division to normalize the result for all the quantized implementations is performed in integers with appropriate rounding to the nearest value.

Stack blur and Bell blur

These are the only approximations that do not involve multiplications and construct the answer by special summation of the input image pixels. Their only parameter is the radius, which defines the lengths of the summation regions. So, there is nothing to apply Equation (60) to. Obtained sums for outgoing and incoming pixels stored in

s^{o u t}

and

s^{i n}

accumulators, correspondingly, may be negative, so we use 32-bit integer type for operating on them. The normalizing factor was not modified.

Running sums

The method sequentially sums the pixels and stores them in an array called an integral row. The range of possible intensity values for the 8-bit image is

[0; 2^{8} - 1]

; hence, the values of the integral row are greater than or equal to 0. We stored them in 32-bit unsigned integer type. Real-valued steps of piecewise constant functions were quantized by specifying

b = 8

in Equation (60). The normalizing factor was modified in the same way.

PcParab

The method operates with arrays

y^{C}, y^{F},

and

y^{F 2}

accumulating integer linear combinations of image pixels and then multiplies their elements on real coefficients of the parabolas. For these purposes, we used 32-bit integer type for accumulator arrays and obtained quantized coefficients according to Equation (60) for

b = 9

and with preliminary division by the minimum absolute value coefficient. The normalizing factor was modified in the same way.

Deriche and VYV filters

For IIR filters, quantized implementations were not successful as the values accumulated along the image rows were too large to fit into an appropriate accumulator type.

For FIR filters, we considered both floating-point and quantized implementations. All of them were vectorized. For IIR filters, the paper includes floating-point implementations and their vectorized variants.

7. Experimental Comparison

In this section, we present the results of our experimental comparison. In our computational experiments, we ran each method on 1000 randomly generated gray images of the size

1024 \times 1024

. The quantized methods obtained 8-bit unsigned integer data as inputs. The non-quantized methods operated on 32-bit floating-point data, and their input was in the

[0, 1]

range. The numerical values of the runtime are presented in Table 2. We also added illustrations to demonstrate the visual effects of applying the considered approximations. One can see them in Figure 8, Figure 9 and Figure 10. Analyzing these illustrations shows that all the considered implementations demonstrate an acceptable visual effect.

Examining the numerical results, we can deduce the following conclusions. Looking at the speedup from the vectorization of the computations, we can notice that the vectorization significantly affects the approximations that require multiplicative operations. Thus, it enabled speeding up the Deriche filters by almost 3 times and the VYV filters nearly 4 times. Furthermore, in the quantized implementations, the vectorization demonstrated a more substantial speedup than the floating-point implementations. This is due to the fact that vectorization processes a larger number of elements at a time. For instance, a 128-bit register can accommodate 16 8-bit items. However, to prevent overflow, we need to convert the numbers into a 32-bit integer format for intermediate computations. Despite this constraint, the speedup was still significant, highlighting the advantage of vectorization in quantized filters. We can also see that direct Gaussian filtering quantized implementation already enables a significant reduction in the running time.

The results in a more demonstrable form are presented in Figure 8. Here, the MSE is plotted along the x-axis and the running time along the y-axis. We also added the results for the case with

σ = 5

in Figure 8. Analyzing the plots, one can choose a method based on the priority of higher accuracy or lower running time:

The Stack blur quantized implementation showed the lowest running time on x86, although, on ARM, the quantized implementation of running sums with $k = 3$ appeared to be slightly faster. However, these approximations have relatively low accuracy.
Among the IIR filters, the second-order VYV filter demonstrated the best running time both on x86 and ARM and slightly outperformed Deriche’s approximations in terms of accuracy. The third-order VYV filter works more slowly on x86 and faster on ARM, providing the most accurate results.
The IIR filters outperformed almost all the floating-point methods except Stack blur on ARM for $σ = 10$ .
All in all, the second-order VYV filter has the best balance between speed and accuracy. Moreover, it requires a constant amount of additional memory. However, it uses floating-point input and computations, which can be inappropriate for some cases.
Among the integer-only methods, the quantized Bell blur and quantized running sums ( $k = 4$ and $k = 5$ ) have the best balance between the speed and accuracy on x86 and ARM, respectively. The Bell blur requires a constant amount of additional memory, while, for running sums, the additional memory depends linearly on the image size.

8. Conclusions

Approximations of convolution with Gaussian filters offer a practical alternative to the computationally expensive approach of exact Gaussian smoothing. In this paper, we compared several of these methods in terms of computational complexity and accuracy. All the considered approximations demonstrated acceptable results and 10–20× times speedup compared to the naive approach. The choice between these approximations depends on the specific requirements of the application and hardware, striking a balance between precision and efficiency.

For floating-point input data and implementations, we recommend using the second-order Vliet–Young–Verbeek filter since the other methods superior to it in time or accuracy are significantly inferior in other aspects. However, in cases where accuracy is unimportant, one can use Stack blur as the fastest method.

For integer inputs, we recommend quantized methods. Bell blur or running sums are the most effective variants depending on the platform and additional memory use restrictions. These methods provide a combination of reasonably high accuracy and speed. Quantized Stack blur also works faster but is inferior in accuracy and, therefore, has limited use.

Author Contributions

Conceptualization, D.P.N.; methodology, E.E.L.; software, E.O.R.; validation, E.E.L. and D.P.N.; formal analysis, D.P.N.; writing—original draft preparation, E.O.R.; writing—review and editing, E.E.L.; visualization, E.O.R.; supervision, D.P.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the grant from the Ministry of Science and Higher Education of the Russian Federation, internal number 00600/2020/51896, Agreement dated 21 April 2022 No. 075-15-2022-319.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Ekaterina O. Rybakova, Elena E. Limonova and Dmitry P. Nikolaev were employed by the company Smart Engines Service LLC. The company was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Abbreviations

The following abbreviations are used in this manuscript:

CPU	Central processing unit
DFT	Discrete Fourier transform
FIR	Finite Impulse Response filter
FP	Floating-point
IIR	Infinite Impulse Response filter
MSE	Mean square error
SIMD	Single Instruction Multiple Data
VYV	Vliet–Young–Verbeek filter

References

Devi, T.G.; Patil, N.; Rai, S.; Philipose, C.S. Gaussian blurring technique for detecting and classifying acute lymphoblastic leukemia cancer cells from microscopic biopsy images. Life 2023, 13, 348. [Google Scholar] [CrossRef] [PubMed]
Mewada, H.; Al-Asad, J.F.; Almalki, F.A.; Khan, A.H.; Almujally, N.A.; El-Nakla, S.; Naith, Q. Gaussian-Filtered High-Frequency-Feature Trained Optimized BiLSTM Network for Spoofed-Speech Classification. Sensors 2023, 14, 6637. [Google Scholar] [CrossRef] [PubMed]
Abuya, T.K.; Rimiru, R.M.; Okeyo, G.O. An Image Denoising Technique Using Wavelet-Anisotropic Gaussian Filter-Based Denoising Convolutional Neural Network for CT Images. Appl. Sci. 2023, 13, 12069. [Google Scholar] [CrossRef]
Chekanov, M.; Shipitko, O.; Skoryukina, N. Study of Keypoints Detectors and Descriptors Performance on X-ray Images Compared to the Visible Light Spectrum Images. IEEE Access 2022, 10, 38964–38972. [Google Scholar] [CrossRef]
Gayer, A.V.; Ershova, D.M.; Arlazarov, V.V. An accurate approach to real-time machine-readable zone detection with mobile devices. In Lecture Notes in Computer Science (LNCS); ICDAR 2023; Springer Nature Group: Cham, Switzerland, 2023; Volume 26, pp. 1–14. [Google Scholar] [CrossRef]
Galletti, A.; Giunta, G.; Marcellino, L.; Parlato, D. An algorithm for gaussian recursive filters in a multicore architecture. In Proceedings of the 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), Prague, Czech Republic, 3–6 September 2017; pp. 507–511. [Google Scholar]
Ma, Y.; Xie, K.; Peng, M. A Parallel Gaussian Filtering Algorithm Based on Color Difference. In Proceedings of the 2011 2nd International Symposium on Intelligence Information Processing and Trusted Computing, Wuhan, China,, 22–23 October 2011; pp. 51–54. [Google Scholar]
De Luca, P.; Galletti, A.; Marcellino, L. A Gaussian Recursive Filter Parallel Implementation with Overlapping. In Proceedings of the 2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), Sorrento, Italy, 26–29 November 2019; pp. 641–648. [Google Scholar]
Vidal-Migall´on, I.; Commowick, O.; Pennec, X.; Dauguet, J.; Vercauteren, T. GPU & CPU implementation of Young—Van Vliet’s Recursive Gaussian Smoothing Filter. Insight J. 2013, 16. [Google Scholar] [CrossRef]
Takagi, H.; Fukushima, N. An efficient description with halide for iir gaussian filter. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 28–35. [Google Scholar]
Bokovoy, A.V. Automatic control system’s architecture for group of small unmanned aerial vehicles. JITCS 2018, 1, 68–77. [Google Scholar]
Elboher, E.; Werman, M. Efficient and accurate Gaussian image filtering using running sums. In Proceedings of the 2012 12th International Conference on Intelligent Systems Design and Applications (ISDA), Kochi, India, 27–29 November 2012; pp. 897–902. [Google Scholar]
Klingemann, M. StackBlur. Available online: http://underdestruction.com/2004/02/25/stackblur-2004/ (accessed on 7 August 2023).
vd Zwan, J. Stackblur and Quadratic Stackblur. Available online: https://observablehq.com/@jobleonard/mario-klingemans-stackblur (accessed on 7 August 2023).
Deriche, R. Recursively Implementing the Gaussian and Its Derivatives; Technical Report 1893, INRIA; Unit´e de RechercheSophia-Antipolis: Valbonne, France, 1993. [Google Scholar]
Farnebäck, G.; Westin, C.-F. Improving Deriche-style Recursive Gaussian Filters. J. Math. Imaging Vision 2006, 26, 293–299. [Google Scholar] [CrossRef]
van Vliet, L.J.; Young, I.T.; Verbeek, P.W. Recursive Gaussian derivative filters. In Proceedings of the Proceedings. Fourteenth International Conference on Pattern Recognition, Brisbane, QLD, Australia, 20 August 1998; Volume 1, pp. 509–514. [Google Scholar]
Bozkurt, F.; Yaganoglu, M.; Günay, F.B. Effective Gaussian blurring process on graphics processing unit with CUDA. Int. J. Mach. Learn. Comput. 2015, 5, 57–61. [Google Scholar] [CrossRef]
Ibrahim, N.M.; Abou ElFarag, A.; Kadry, R. Gaussian Blur through Parallel Computing. In Proceedings of the International Conference on Image Processing and Vision Engineering (IMPROVE 2021), Online, 28–30 April 2021; pp. 175–179. [Google Scholar]
Moradifar, M.; Shahbahrami, A. Performance improvement of Gaussian filter using SIMD technology. In Proceedings of the 2020 International Conference on Machine Vision and Image Processing (MVIP), Qom, Iran, 18–20 February 2020; pp. 1–6. [Google Scholar]
Zin Oo, N. The Improvement of 1D Gaussian Blur Filter using AVX and OpenMP. In Proceedings of the 2022 22nd International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 27 November–1 December 2022; pp. 1493–1496. [Google Scholar]
Simd Library. Available online: https://ermig1979.github.io/Simd/help/group__gaussian__filter.html#gaf37e85de82125071158ae2db59fb5643 (accessed on 17 March 2024).
Melatonin Blur library. Available online: https://github.com/sudara/melatonin_blur (accessed on 17 March 2024).
Gupta, P. Accurate performance analysis of a fixed point FFT. In Proceedings of the 2016 Twenty Second National Conference on Communication (NCC), Guwahati, India, 4–6 March 2016; pp. 1–6. [Google Scholar]
Koljonen, J.; Bochko, V.A.; Lauronen, S.J.; Alander, J.T. Fast fixed-point bicubic interpolation algorithm on FPGA. In Proceedings of the 2019 IEEE Nordic Circuits and Systems Conference (NORCAS), Helsinki, Finland, 29–30 October 2019; pp. 1–7. [Google Scholar]
Sher, A.; Trusov, A.; Limonova, E.; Nikolaev, D.; Arlazarov, V.V. Neuron-by-Neuron Quantization for Efficient Low-Bit QNN Training. Mathematics 2023, 11, 2112. [Google Scholar] [CrossRef]
Ryu, J.; Nishimura, T.H. Fast image blurring using lookup table for real time feature extraction. In Proceedings of the 2009 IEEE International Symposium on Industrial Electronics, Seoul, Republic of Korea, 5–8 July 2009; pp. 1864–1869. [Google Scholar]
Chervyakov, N.; Lyakhov, P.; Nagornov, N.; Kaplun, D.; Voznesensky, A.; Bogayevskiy, D. Implementation of Smoothing Image Filtering in the Residue Number System. In Proceedings of the 2019 8th Mediterranean Conference on Embedded Computing, Budva, Montenegro, 10–14 June 2019; pp. 1–4. [Google Scholar]
Valueva, M.V.; Lyakhov, P.A.; Nagornov, N.N.; Valuev, G.V. High-performance digital image filtering architectures in the residue number system based on the Winograd method. Comput. Opt. 2022, 46, 752–762. [Google Scholar] [CrossRef]
Brigham, E.O.; Morrow, R.E. The fast Fourier transform. IEEE Spectrum 1967, 4, 63–70. [Google Scholar] [CrossRef]
Tan, S.; Dale, J.L.; Johnston, A. Performance of three recursive algorithms for fast space-variant Gaussian filtering. Real-Time Imaging 2003, 9, 215–228. [Google Scholar] [CrossRef]
Young, I.T.; van Vliet, L.J.; van Ginkel, M. Recursive Gabor filtering. IEEE Trans. Signal Process. 2002, 50, 2798–2805. [Google Scholar] [CrossRef]
fftw3. Available online: https://www.fftw.org/ (accessed on 7 August 2023).

Figure 1. Convolution of an image with the kernel of size

5 \times 5

.

Figure 1. Convolution of an image with the kernel of size

5 \times 5

.

Figure 2. Approximation of the Gaussian with

σ = 10

by the Stack blur and Bell blur methods.

Figure 2. Approximation of the Gaussian with

σ = 10

by the Stack blur and Bell blur methods.

Figure 3. Approximation of the Gaussian with

σ = 10

by running sums method for

k = 3, 4, 5

.

Figure 3. Approximation of the Gaussian with

σ = 10

by running sums method for

k = 3, 4, 5

.

Figure 4. Approximation of Gaussian with

σ = 10

by three parabolas.

Figure 4. Approximation of Gaussian with

σ = 10

by three parabolas.

Figure 5. Approximation of Gaussian with

σ = 10

by the first-order and the second-order Deriche filters.

Figure 5. Approximation of Gaussian with

σ = 10

by the first-order and the second-order Deriche filters.

Figure 6. Approximation of Gaussian transfer function with

σ = 10

by the second-order and the third-order VYV filters.

Figure 6. Approximation of Gaussian transfer function with

σ = 10

by the second-order and the third-order VYV filters.

Figure 7. Image processing scheme in sequential and vector implementations when pixel intensities are stored as 32-bit floating points.

Figure 8. Scatter plots of MSE and runtime results for

σ = 5, 10

on Intel Core i9-9900KF (x86) and ARM Cortex-A73 (ARMv8).

Figure 8. Scatter plots of MSE and runtime results for

σ = 5, 10

on Intel Core i9-9900KF (x86) and ARM Cortex-A73 (ARMv8).

Figure 9. Results of the considered approximations of Gaussian filter (

σ = 5

): (a) original image; (b) Gaussian filter; (c) Gaussian filter (quantized); (d) Stack blur (

r = 24

); (e) Stack blur (quantized) (

r = 24

); (f) Bell blur (

r = 12

); (g) Bell blur (quantized) (

r = 12

); (h) DFT-based convolution; (i) running sums (

k = 3

); (j) running sums (quantized) (

k = 3

); (k) running sums (

k = 4

); (l) running sums (quantized) (

k = 4

); (m) running sums (

k = 5

); (n) running sums (quantized) (

k = 5

); (o) PcParab; (p) PcParab (quantized); (q) Deriche filter (the 1st order); (r) Deriche filter (the 2nd order); (s) VYV filter (the 2nd order); (t) VYV filter (the 3rd order).

Figure 9. Results of the considered approximations of Gaussian filter (

σ = 5

): (a) original image; (b) Gaussian filter; (c) Gaussian filter (quantized); (d) Stack blur (

r = 24

); (e) Stack blur (quantized) (

r = 24

); (f) Bell blur (

r = 12

); (g) Bell blur (quantized) (

r = 12

); (h) DFT-based convolution; (i) running sums (

k = 3

); (j) running sums (quantized) (

k = 3

); (k) running sums (

k = 4

); (l) running sums (quantized) (

k = 4

); (m) running sums (

k = 5

); (n) running sums (quantized) (

k = 5

); (o) PcParab; (p) PcParab (quantized); (q) Deriche filter (the 1st order); (r) Deriche filter (the 2nd order); (s) VYV filter (the 2nd order); (t) VYV filter (the 3rd order).

Figure 10. Results of the considered approximations of Gaussian filter (

σ = 10

): (a) original image; (b) Gaussian filter; (c) Gaussian filter (quantized); (d) Stack blur (

r = 24

); (e) Stack blur (quantized) (

r = 24

); (f) Bell blur (

r = 12

); (g) Bell blur (quantized) (

r = 12

); (h) DFT-based convolution; (i) running sums (

k = 3

); (j) running sums (quantized) (

k = 3

); (k) running sums (

k = 4

); (l) running sums (quantized) (

k = 4

); (m) running sums (

k = 5

); (n) running sums (quantized) (

k = 5

); (o) PcParab; (p) PcParab (quantized); (q) Deriche filter (the 1st order); (r) Deriche filter (the 2nd order); (s) VYV filter (the 2nd order); (t) VYV filter (the 3rd order).

Figure 10. Results of the considered approximations of Gaussian filter (

σ = 10

): (a) original image; (b) Gaussian filter; (c) Gaussian filter (quantized); (d) Stack blur (

r = 24

); (e) Stack blur (quantized) (

r = 24

); (f) Bell blur (

r = 12

); (g) Bell blur (quantized) (

r = 12

); (h) DFT-based convolution; (i) running sums (

k = 3

); (j) running sums (quantized) (

k = 3

); (k) running sums (

k = 4

); (l) running sums (quantized) (

k = 4

); (m) running sums (

k = 5

); (n) running sums (quantized) (

k = 5

); (o) PcParab; (p) PcParab (quantized); (q) Deriche filter (the 1st order); (r) Deriche filter (the 2nd order); (s) VYV filter (the 2nd order); (t) VYV filter (the 3rd order).

Table 1. Comparison of the considered methods for Gaussian filtering image with size

h \times w

in terms of the number of additions (# +) and multiplications (# ·) per pixel, error, and extra memory. For FIR filters, we use kernel of size

K \times K

.

Table 1. Comparison of the considered methods for Gaussian filtering image with size

h \times w

in terms of the number of additions (# +) and multiplications (# ·) per pixel, error, and extra memory. For FIR filters, we use kernel of size

K \times K

.

Method	# +	# ·	Error (FP)	Error (Quant.)	Extra Memory
Direct Gaussian	$2 (K - 1)$	$2 K$	depends on K	depends on K	$O (1)$
DFT-based convolution (FFT)	$O (\log K^{2})$	$O (\log K^{2})$	depends on K	-	impl. defined
Stack blur	12	0	9.35 × $10^{- 6}$	9.35 × $10^{- 6}$	$O (1)$
Bell blur	28	0	4.40 × $10^{- 6}$	4.40 × $10^{- 6}$	$O (1)$
Running sums, $k = 3$	12	6	1.43 × $10^{- 5}$	1.43 × $10^{- 5}$	$O (\max {h; w})$
Running sums, $k = 4$	16	8	7.91 × $10^{- 6}$	7.91 × $10^{- 6}$	$O (\max {h; w})$
Running sums, $k = 5$	20	10	5.11 × $10^{- 6}$	5.11 × $10^{- 6}$	$O (\max {h; w})$
PcParab	22	16	1.92 × $10^{- 7}$	5.22 × $10^{- 6}$	$O (\max {h; w})$
Deriche (1st order)	6	8	1.39 × $10^{- 5}$	-	$O (1)$
Deriche (2nd order)	14	16	1.80 × $10^{- 7}$	-	$O (1)$
VYV (2nd order)	8	12	1.39 × $10^{- 7}$	-	$O (1)$
VYV (3rd order)	12	16	5.01 × $10^{- 8}$	-	$O (1)$

Table 2. Average runtime of the considered methods for

σ = 10

. Bold corresponds to the lowest value in the column.

Table 2. Average runtime of the considered methods for

σ = 10

. Bold corresponds to the lowest value in the column.

Method	Runtime, ns (x86)		Runtime, ns (ARMv8)
Method	no SIMD	with SIMD	no SIMD	with SIMD
Direct Gaussian	81.3	-	281.5	-
Direct Gaussian (quant.)	12.0	-	128.0	-
DFT-based convolution (FFT)	10.7	-	86.1	-
Stack blur	6.0	3.5	20.4	17.7
Stack blur (quant.)	3.8	1.6	15.5	14.7
Bell blur	6.7	4.3	31.6	26.8
Bell blur (quant.)	5.7	2.6	22.5	22.0
Running sums, $k = 3$	8.3	5.6	40.7	23.2
Running sums, $k = 3$ (quant.)	7.0	2.7	32.9	14.1
Running sums, $k = 4$	9.7	5.9	46.8	25.0
Running sums, $k = 4$ (quant.)	8.3	3.0	39.4	15.7
Running sums, $k = 5$	11.3	6.2	53.2	27.2
Running sums, $k = 5$ (quant.)	10.1	3.4	46.2	17.3
PcParab	20.0	5.8	47.9	34.8
PcParab (quant.)	15.0	4.9	65.2	29.8
Deriche (1st order)	9.1	3.5	31.3	20.4
Deriche (2nd order)	12.5	4.2	38.2	22.7
VYV (2nd order)	12.2	3.4	36.0	18.9
VYV (3rd order)	15.5	3.7	44.1	19.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rybakova, E.O.; Limonova, E.E.; Nikolaev, D.P. Fast Gaussian Filter Approximations Comparison on SIMD Computing Platforms. Appl. Sci. 2024, 14, 4664. https://doi.org/10.3390/app14114664

AMA Style

Rybakova EO, Limonova EE, Nikolaev DP. Fast Gaussian Filter Approximations Comparison on SIMD Computing Platforms. Applied Sciences. 2024; 14(11):4664. https://doi.org/10.3390/app14114664

Chicago/Turabian Style

Rybakova, Ekaterina O., Elena E. Limonova, and Dmitry P. Nikolaev. 2024. "Fast Gaussian Filter Approximations Comparison on SIMD Computing Platforms" Applied Sciences 14, no. 11: 4664. https://doi.org/10.3390/app14114664

APA Style

Rybakova, E. O., Limonova, E. E., & Nikolaev, D. P. (2024). Fast Gaussian Filter Approximations Comparison on SIMD Computing Platforms. Applied Sciences, 14(11), 4664. https://doi.org/10.3390/app14114664

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Gaussian Filter Approximations Comparison on SIMD Computing Platforms

Abstract

1. Introduction

2. Related Work

3. Gaussian Filtering

Basic Optimizations

4. Approximations

4.1. FIR Approximations

4.1.1. DFT-Based Convolution

4.1.2. Stack Blur and Bell Blur

4.1.3. Running Sums

4.1.4. Piecewise Parabolic Approximation

4.2. IIR Approximations

4.2.1. Deriche Form

4.2.2. Vliet–Young–Verbeek Form

5. Complexity and Accuracy Comparison

6. Quantization and SIMD Implementation

7. Experimental Comparison

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI