**1. Introduction**

Normalized cross-correlation (NCC) is an important mathematical tool in signal and image processing for feature matching, similarity analysis, motion tracking, object recognition, and so on [1–3]. In order to improve its real-time and efficient performance, digital NCC has been suggested to be implemented by some fast algorithms and hardware structures, due to its high computational complexity [4,5].

Nowadays, since correlation and convolution have similar computation structures, there are mainly three kinds of fast convolution algorithms can be applied for fast NCC [6,7]: (1) the Fast Fourier Transform (FFT)-based algorithm, (2) the polynomial-based algorithm, (3) the decomposition algorithm. However, to our knowledge, each of these algorithms has its applicable limitations. The FFT-based algorithm is not well-suited to the discrete domain. Plus, it involves with complex multiplications [8,9]. Both the polynomial-based algorithm and the decomposition algorithm require complex computational structures, and they often lack commonality for arbitrary-length correlations [10,11].

Furthermore, some special algorithms for fast NCC have been presented [12,13]. The fast cross-correlation of binary sequences can be extended to other types of NCC sequences [14]. The estimation algorithm derives the scaling factor between the signal and the kernel, so it computes NCC using only additions at the cost of small noise [15]. Several methods have been used to assist NCC for reducing its searching and computing times in image matching, such as the pyramid method [3,7]. In addition, many parallel algorithms of the inner-product have been published that can perform fast cross-correlation for NCC [16,17], where the Distributed Arithmetic (DA) with look-up table has not multiplication, but needs much Read-Only Memory (ROM) [18].

To hardware implementation of fast NCC, Very-Large-Scale Integration (VLSI) circuits have been applied, where systolic structures are popular due to their regularity and modularity [19–21]. The integration of the systolic array and the DA technique lead to more efficient VLSI implementation of cross-correlation, although they use many ROMs and address decoders [22,23]. The Residue Number System-based DA can reduce ROMs and enhance throughput, while extra encoding processes in the residue domain are necessary [24].

In this paper, we present a new algorithm and structure to implement digital NCC with a simple and fast procedure. It is a breakthrough that an NCC formula expressed in terms of a first-order moment is designed according to the relationship between the inner-product and the first-order moment, so the computational complexity of NCC is transformed into that of a first-order moment. For performing an arbitrary-length digital NCC, our algorithm would first establish the NCC formula based on a first-order moment for correlation sequences, and then introduce a fast algorithm without multiplication from [25,26] to compute this first-order moment in the new NCC formula rapidly. For the hardware implementation of NCC, we develop a simple and scalable systolic array derived from the proposed algorithm, due to the fact that the fast algorithm for the first-order moment is easily performed by systolic structure [27]. The proposed algorithm and systolic array are also improved to reduce their addition complexity, according to an even-odd relationship in the computation of the first-order moment.

The rest of the paper is organized as follows. Section 2 establishes the NCC formula based on a first-order moment. Section 3 introduces a fast algorithm and its systolic implementation for first-order moment. Sections 4 and 5 discuss the fast algorithm and the systolic array inspired by Section 3 to perform the NCC formula in Section 2 rapidly. Comparison and analysis are presented in Section 6 to demonstrate the feasibility of the proposed algorithm and structure. Finally, Section 7 gives the conclusion.

#### **2. Normalized Cross-Correlation Based on First-Order Moment**

Being the most complex operation in NCC, the inner-product of two correlation sequences would be transformed into a first-order moment for decreasing computational complexity in fast NCCs. To do this, let us assume two *N*-point digital sequences { *f*(*i*) } and { *g*(*i*) }, where { *f*(*i*) } is an arbitrary input sequence, and { *g*(*i*) } is the fixed correlation kernel with the value range *g*(*i*)∈{ 0, 1, 2, ... , *L* }. This section establishes an NCC formula for these two sequences that mainly includes a first-order and a zero-order moment. The aim is to replace the complex computation of cross-correlation in NCC with an easy computation of a first-order moment.

#### *2.1. Cross-Correlation*

Cross-correlation is an inner-product between two digital sequences. It is defined as

$$\mathfrak{c}(n) = f(n) \circ \mathfrak{g}(n) = \sum\_{i=0}^{N-1} f(n+i)\mathfrak{g}(i) \tag{1}$$

Using mathematical transformation, this Equation (1) could be transformed into a first-order moment by means of the statistical characteristics of the inner-product operation. To do this, we define some subsets *Sk* (*k* = 0, 1, 2, ... , *L*) that divide the index set *i*∈{0, 1, ... , *N* − 1} into *L* subsets, depending on the max value in the correlation kernel { *g*(*i*) }. Specifically,

$$S\_k = \left\{ i \, \middle| \, \mathbb{g}(i) = k, \quad i \in \{0, 1, 2, \dots, \dots, N - 1\} \right\} \tag{2}$$

where *k* = 0, 1, 2, ... , *L*. In other words, *Sk* is a set of indices *i* that corresponds to *g*(*i*) = *k* in actual. Then a new (*L* + 1)-point sequence { *ak*(*n*) } is defined by subsets *Sk* [28], which is

$$a\_k(n) = \begin{cases} \sum\_{i \in S\_k} f(n+i) & \text{where } S\_k \neq \Phi \\ 0 & \text{otherwise} \end{cases} \tag{3}$$

where *k* = 0, 1, 2, ... , *L*, and "Φ" denotes an empty set.

The *ak*(*n*) could be acted as the sum of elements in the sequence { *f*(*n* + *i*) } while the parameter *i* corresponds to *g*(*i*) = *k*. The computation of the { *ak*(*n*) } is actually a statistics procedure for counting how much *k* would be accumulated in the computation of the *c*(*n*). Therefore, the relationship between { *f*(*n* + *i*) } and { *ak*(*n*) } can be described as:

$$\sum\_{i=0}^{N-1} f(n+i) = \sum\_{k=0}^{L} a\_k(n)\_\prime \tag{4a}$$

$$\sum\_{i=0}^{N-1} f(n+i)g(i) = \sum\_{k=0}^{L} a\_k(n)k = \sum\_{k=1}^{L} a\_k(n)k \tag{4b}$$

It is obvious that *<sup>L</sup> k*=1 *ak*(*n*) in Equation (4a) is a zero-order moment of { *ak*(*n*) }, and *<sup>L</sup> k*=1 *ak*(*n*)*k* in Equation (4b) is a first-order moment of { *ak*(*n*) }. As a result, the Equation (1) can be transformed into:

$$\mathcal{L}(n) = \sum\_{k=1}^{L} a\_k(n)k \tag{5}$$

From Equation (5), we obtain a new calculation formula for cross-correlation based on a first-order moment.

#### *2.2. Normalized Cross-Correlation*

Normalized cross-correlation is more complex than cross-correlation, because it includes an inner-product between two difference sequences from { *f*(*i*) }, { *g*(*i*) } and their mean value. It is defined as

$$\rho(n) = \frac{\sum\_{i=0}^{N-1} \left[ f(n+i) - \overline{f}(n) \right] \left[ g(i) - \overline{g} \right]}{\left\{ \sum\_{i=0}^{N-1} \left[ f(n+i) - \overline{f}(n) \right]^2 \sum\_{i=0}^{N-1} \left[ g(i) - \overline{g} \right]^2 \right\}^{\frac{1}{2}}},\tag{6}$$
  $\text{where } \overline{f}(n) = \frac{1}{N} \sum\_{i=0}^{N-1} f(n+i) \text{ and } \overline{g} = \frac{1}{N} \sum\_{i=0}^{N-1} g(i).$ 

This Equation (6) can be rewritten as

$$\begin{array}{rcl}\rho(n) &=& \frac{\sum\_{i=0}^{N-1} f(n+i)\,\overline{g}(i) - \overline{g}\sum\_{i=0}^{N-1} f(n+i) - \overline{f}(n)\sum\_{i=0}^{N-1} g(i) + N[\overline{f}(n)\overline{g}]}{\left\{\sum\_{i=0}^{N-1} [f(n+i)^2 - 2f(n+i)\overline{f}(n) + \overline{f}(n)^2] \sum\_{i=0}^{N-1} [g(i) - \overline{g}]^2\right\}^{\frac{1}{2}}}\\ &=& \frac{\sum\_{i=0}^{N-1} f(n+i)g(i) - \frac{1}{N}\sum\_{i=0}^{N-1} f(n+i)\sum\_{i=0}^{N-1} g(i)}{\left\{\sum\_{i=0}^{N} f(n+i)^2 - \frac{1}{N}\left[\sum\_{i=0}^{N-1} f(n+i)\right]^2 \sum\_{i=0}^{N-1} [g(i) - \overline{g}]^2\right\}^{\frac{1}{2}}} \end{array} \tag{7}$$

If we set

$$b(n) = \sum\_{i=0}^{N-1} \left[ f(n+i) \right]^2 \tag{8}$$

and substitute Equations (4a), (4b) and (8) into Equation (7), the NCC expressed by Equation (6) can be converted to

$$\rho(n) = \frac{\sum\_{k=1}^{L} a\_k(n)k - \overline{\underline{g}} \sum\_{k=0}^{L} a\_k(n)}{\left\{ \left[ b(n) - \frac{1}{N} \right] \sum\_{k=0}^{L} a\_k(n) \right\} \sum\_{i=0}^{N-1} \left[ \underline{g}(i) - \overline{\underline{g}} \right]^{\frac{1}{2}}} . \tag{9}$$

From Equation (9), we develop a new calculation formula for NCC based on a first-order moment *ak*(*n*)*k* and a zero-order moment *ak*(*n*). It is obvious that the computation complexity of this NCC formula depends heavily upon the complexity of *ak*(*n*)*k* and *b*(*n*). Therefore, for a fast implementation of Equation (9), we introduce a fast algorithm and structure for *ak*(*n*)*k* in Section 3, and an optimization method for *b*(*n*) in Section 4.1.

#### **3. The Fast Algorithm and Systolic Array for First-Order Moment**

Liu et al. presented an algorithm and its systolic array for first-order moment in [25–27]. Their method is suitable to compute the first-order and the zero-order moment in Equation (4) rapidly. In this section, we introduce this algorithm and systolic array that aims to implement fast NCC by using Equation (9). In addition, because the introduced algorithm and array request many additions as the result of removing all multiplications, we also improve them in order for lower addition complexity.

### *3.1. The Fast Algorithm for First-Order Moment*

According to [25], we illustrate a simple 1-network shown in Figure 1 that represents a map of transforming the two-dimensional vector (1, *x*) into the vector (1, (1 + *x*)). This map is denoted by *F* that is

$$F(1, \mathbf{x}) = (1, \ (1+\mathbf{x})) .$$

**Figure 1.** The 1-network.

Some characteristic equations obtained from *F* are

$$F(a, \text{ax}) = (a, \, a(1+\mathbf{x})), \,\, F(a+b, \, a+b) = F(a, \, a) + F(b, \, b) \tag{10}$$

Also,

$$F^2(1, \ge) = F(F(1, \ge)) = F(1, \ (1 + \ge)) = (1, \ 2 + \ge)$$

and by induction

$$F^{L-1}(1, \ge) = F(\dots F \dots F(1, \ge)) = (1, \ (L - 1 + \ge))\dots$$

Hence, we have

$$F^{L-1}(1,1) = F(\dots F \dots F(1,1)) = (1,L),$$

$$F^{L-1}(a,a) = (a,La). \tag{11}$$

To compute first-order moment by this 1-network, let

$$\mathbf{a}\_k = (\ a\_k(n), \ a\_k(n), \ ) \quad (k = 1, \ 2, \ \dots, L),$$

so, Equations (10) and (11) are yielded by

$$\begin{aligned} F(F(\mathbf{a}\_k) + \mathbf{a}\_{k-1}) &= F(F(\mathbf{a}\_k)) + F(\mathbf{a}\_{k-1}) = F^2(\mathbf{a}\_k) + F(\mathbf{a}\_{k-1}) \\ &= (\
a\_k(n) + a\_{k-1}(n), \ 3a\_k(n) + 2a\_{k-1}(n) \ ) \end{aligned}$$

Generally, the above equation is expanded into

$$\begin{aligned} F(F\dots F(F(\mathbf{a\_L}) + \mathbf{a\_{L-1}}) + \dots) + \mathbf{a\_2}) + \mathbf{a\_1} &= F^{L-1}(\mathbf{a\_L}) + \dots + F^2(\mathbf{a\_3}) + F(\mathbf{a\_2}) + \mathbf{a\_1} \\ \mathbf{a\_i} &= (\sum\_{k=1}^L a\_k(n), \sum\_{k=1}^L a\_k(n)k) \end{aligned} \tag{12}$$

From Equation (12), *ak*(*n*) in Equation (4a) and *ak*(*n*)*k* in Equation (4b) can both be obtained from an iterative implementation of the map *F*. This computational flow uses the (*L* − 1) recursive process of map *F* that includes 3*L* additions and 0 multiplications [26]. Therefore, the fast algorithm for first-order moment by Equation (12) can be described in Algorithm 1 as a subroutine Moment [29]. Its computational structure is also shown in Figure 2, which is an iterative structure of a 1-network with six adders and three latches. Its total addition number to compute *N*-point first-order moments *ak*(*n*)*k* (*n* = 0, 1, ... , *N* − 1) is 3*NL*.
