**Algorithm 1 Moment** (*aL*(*n*), *aL* <sup>−</sup> 1(*n*), ... , *a*0(*n*))

```
Define the array a with two elements
Initial a ← ( aL(n), aL(n) )
for each k ∈ [2, L] do // Equation (12)
  a[1] ← a[1] + a[0] // 1-network F(a)
  a[1] ← a[1] + aL-k+1(n)
  a[0] ← a[0] + aL-k+1(n)
end for
a[0] ← a[0] + a0(n)
return a
```
**Figure 2.** The computational structure for first-order moment.

#### *3.2. The Systolic Array for First-Order Moment*

The Equation (12) can be implemented by a systolic array for continuously generating a set of *ak*(*n*) and *ak*(*n*)*k* in parallel [27]. This systolic array is shown in Figure 3, which is actually a serial arrangement of (*L* − 1) 1-networks extended from Figure 2. It uses 3*L* − 2 adders, *L* + 2 latch, and 0 multiplier. In each clock cycle, we should input a sequence { *ak*(*n*) } into this systolic array and get a ( *ak*(*n*), *ak*(*n*)*k*).

**Figure 3.** The systolic array for first-order moment.

Especially, to keep an operation synchronization for this parallel structure, the (*L* − 1)-point *ak*(*n*) (*k* = 2, ... , *L*) should be input into the (*L* − 1) 1-networks respectively rather than simultaneously. Generally, a single *ak*(*n*) (*k* > 0) is input into the (*L* − *k*)-th 1-network with a latency *n* + 2 (*L* – 1 − *k*) clock cycle. Hence, in Figure 3, we use the extra latch array to generate latency for *ak*(*n*) before it is input

into the corresponding 1-network. The number of latch array and latency time is shown in the note "[ ]", which leads to the occurrence that different *ak*(*n*) are input into the different 1-networks at regular intervals. As a result, the total execution time of this systolic array to compute *N*-point *ak*(*n*)*k* (*n* = 0, 1, ... , *N* − 1) is that

$$2L - 1 + 1 + N - 1 = 2L + N - 1$$

clock cycles.

#### *3.3. The Improvement of the Fast Algorithm and Systolic Array for First-Order Moment*

The algorithm in Section 3.1 requires many additions that are computationally expensive when *N* is larger. In order to reduce its addition number, this algorithm is improved by means of an even-odd relationship that divides the first-moment of sequence { *ak*(*n*) } into two smaller moments. This even–odd relationship is illustrated as:

$$\sum\_{k=0}^{L} a\_k(n) = \sum\_{k=1}^{L/2} \left[ a\_{2k-1}(n) + a\_{2k}(n) \right] + a\_0(n),\tag{13a}$$

$$\sum\_{k=1}^{L} a\_k(n)k = \sum\_{k=1}^{L/2} a\_{2k-1}(n) \cdot (2k - 1) + \sum\_{k=1}^{L/2} a\_{2k}(n) \cdot 2k = 2\sum\_{k=1}^{L/2} \left[a\_{2k-1}(n) + a\_{2k}(n)\right]k - \sum\_{k=1}^{L/2} a\_{2k-1}(n). \tag{13b}$$

According to Equation (13), the fast algorithm described by Figure 2 can be improved to the new structure shown in Figure 4. This improved algorithm firstly adds *L*/2 additions to obtain the sequence *a*2*k*−1(*n*) + *a*2*k*(*n*) } as well as *L*/2 − 1 addition to accumulate *a*2*k*−1(*n*). Then each *a*2*k*−1(*n*) + *a*2*k*(*n*) is input into map *F* successively for performing *L*/2 − 1 iterations. Finally, a left-shift operation and 1 subtraction are applied to generate *ak*(*n*)*k*. The improved algorithm requires 5*L*/2 − 1 additions that are superior to Figure 2, even though its structure is more complex at the cost of decreasing *L*/2 additions. Although the sequence *a*2*k*−1(*n*) + *a*2*k*(*n*) } could be continually divided by the even-odd relationship for further reducing additions, the fast algorithm's structure would become very complex and unworthy.

**Figure 4.** The improved computational structure for first-order moment.

Similarly, the systolic array in Figure 3 can be improved to the structure shown in Figure 5. This improved systolic array is a serial arrangement of the *L*/2 − 1 1-networks extended from Figure 4. It requires 5*L*/2 − 3 adders and *L*/2 + 3 latches that are superior to Figure 3, even though its structure is more complex. As a result, the total execution time of this systolic array to compute *N*-point *ak*(*n*)*k* (*n* = 0, 1, ... , *N* − 1) is decreased to

$$L + 1 + 1 + N - 1 = N + L + 1$$

clock cycles.

**Figure 5.** The improved systolic array for first-order moment.

#### **4. The Fast Algorithm for Normalized Cross-Correlation**

We apply the improved fast algorithm in Section 3.3 to compute the first-order and the zero-order moments in Equation (9). Thus, the fast algorithm for NCC is presented that can remove most of its multiplications. At first, some optimization methods are introduced in Section 4.1 to further reduce its additions.

### *4.1. The Optimization Methods*

As the sequence { *g*(*i*) } is a fixed correlation kernel in general, both *g* = *g*(*i*)/*N* and [*g*(*i*) − *g*] 2 in Equation (9) could be pre-computed and reused for avoiding their repeated computations [30].

Although *b*(*n*) in Equation (8) involves many additions and complex squares, it could also be computed by a simple function with the previous *b*(*n* − 1), where

$$\begin{array}{rcl} b(n) &= \sum\_{i=0}^{N-1} f(n-1+i)^2 + [f(n+N-1)^2 - f(n-1)^2] \\ &= b(n-1) + [f(n+N-1) + f(n-1)][f(n+N-1) - f(n-1)] \end{array} \tag{14}$$

We only need to directly compute the first *b*(0) by *N* multiplication and *N* − 1 additions, where the square is performed by multiplication. Then, the following *b*(*n*) (*n* = 1, 2, ... , *N* − 1) would be obtained from Equation (14) by only 1 multiplication, 2 additions and 1 subtraction.

*4.2. The Step of the Fast Algorithm for NCC*

The proposed fast algorithm for NCC would include five steps:


The computational flow of this algorithm is illustrated in Algorithm 2. It includes *N* + 5*L*/2 + 1 additions, 3 subtractions and 5 multiplications per output an NCC ρ(*n*). Therefore, to compute *N*-point NCC, it requires *N* − 1 + *N* (*N* + 5*L*/2 + 1) − 2 = *N* (*N* + 5*L*/2 + 2) − 3 additions, and only *N* + *N* − 1 + 4*N* = 6*N* − 1 multiplications.

**Algorithm 2 Computing NCC** ( *n*, *f*, *g*, *b*(*n*-1) )

```
for each ak in the sequence { ak }: ak ← 0
for each i ∈ [0, N-1] do // Equation (3)
  k ← g(i)
  ak ←ak + f(n + i)
end for
for each k ∈ [1, L/2] do // Equation (13a)
  s ← s + a2k−1
  ak ← a2k−1 + a2k
end for
a ← Moment ( aL/2, aL/2−1, ... , a2, a1, a0) // Algorithm 1
a[1] ← a[1] << 1 – s // Equation (13b)
Compute b(n) by b(n-1), f(n + N − 1) and f(n − 1) // Equation (14)
Compute ρ(n) by a[0], a[1] and b(n) // Equation (9)
return ρ(n)
```
#### **5. The Systolic Array for Normalized Cross-Correlation**

We apply the improved systolic array in Figure 5 to design a hardware structure for fast NCC in parallel. Figure 6 shows this systolic structure that mainly includes three parts: the module **A** to compute *a*2*k*−1(*n*) + *a*2*k*(*n*) }, the module **M** to compute the first-order and zero-order moment of { *ak*(*n*) }, and the module **S** to compute *b*(*n*). In each cycle, we simultaneously input *N*-point *f*(*n* + *i*) into this systolic array and get an NCC result ρ(*n*). At first, since the direct computation for *a*2*k*−1(*n*) + *a*2*k*(*n*) } needs many adders, a simplified structure for the module **A** is discussed in Section 5.1.

**Figure 6.** The systolic array for fast normalized cross-correlations (NCCs).
