*5.2. The Model P*

The Model **P** is to implement Equation (9) with 4 multipliers, 1 divider and 1 square root extractor. It receives a *ak*(*n*)*k* and a *b*(*n*), and output a corresponding ρ(*n*) in each cycle. Some fast methods can be applied for the square root operation. In addition, the fixed *g* and [*g*(*i*) − *g*] <sup>2</sup> are saved in advance against repeated computation.

#### *5.3. The Systolic Array*

The systolic array in Figure 6 uses various modules to perform Equations (3), (9), (13) and (14), respectively, for NCC. Some latches are indispensable to connect these modules for assuring their mutual and parallel operation. The latch number has been shown in the note "[ ]". The module M from Figure 5 is to compute first-order moments and zero-order moments based on Equation (13). The module **S** implements Equation (14) and generates *b*(*n*) by 1 multiplier, 1 accumulator and 1 subtractor. Finally, the module **P** generates NCC ρ(*n*). The systolic array's total adder number is ranged from 2*L* − 2 to 2*L* + *N* − 3, and its multiplier number is 5.

The initial value of the accumulator in the module **S** is set as *b*(0). In the *n*-th clock cycle, *f* (*n* + *N* − 1) and *f* (*n* − 1) would be input into the module **S** to get *b*(*n*) with three clock cycles. Then

*b*(*n*) is output from the module **S** to the module **P** with a latency *TA* + *L* − 1. The aim is that *b*(*n*), *ak*(*n*) and *ak*(*n*)*k* can arrive in the **P** at the same time.

#### **6. Comparisons**

The proposed algorithm and systolic structure are compared with some existing methods to verify their effectiveness. These compared methods are also focused on reducing their multiplication numbers.

#### *6.1. Algorithm Comparison*

Because correlation and convolution can share fast algorithms, we compare the proposed algorithm in Section 4 with some convolution algorithms, as well as a fast NCC algorithm to compute an *N*-point cyclic NCC. The computational complexity of these algorithms are displayed in Table 1, where we set a complex multiplication, which is equivalent to three real multiplications and three real additions, an "AND" operation is equivalent to an addition [31], and a subtraction is also equivalent to an addition.

From Table 1, the multiplication and addition complexity of the FFT-based algorithm are both *O*(*N* log2*N*), the DA-based algorithm is the least addition complexity, and the fast NCC algorithm has zero multiplication. The proposed algorithm uses *O*(*N*2) additions that are more than the FFT-based and the DA-based algorithm, and *O*(*N*) multiplications that are more than the fast NCC algorithm. However, the FFT-based algorithm needs float addition and multiplication operations that are more complex than integer operations, the DA-based algorithm requires tedious decode address and very large memories, as well as that the fast NCC algorithm is the most addition complexity and not suitable for high-precision matching [15]. Figure 10 shows the four algorithms' multiplication and addition number increasing along with *N*. It is obviously that the proposed algorithm's multiplication number is lower than both the FFT-based algorithm's and the DA-based algorithm's, and its addition number is lower than the fast NCC algorithm's when *N* > 320.

**Table 1.** The compassions of computational complexity.


**Figure 10.** The four algorithm's multiplication and addition number: (**a**) Multiplication (**b**) Addition.

The wireless sensor and communication is an important application field for the proposed algorithm. Therefore, we compare the execution time of the five algorithms from Table 1 by using a mobile phone with the type "HUAWEI nova 2s (HWI-AL00)" and the operation system "Android 9". Figure 11 shows these algorithms' execution time to compute a cyclic NCC by the phone with *N* from 100 to 6000. The growth curve of the FFT-based algorithm's time is similar to a step curve, in that the length of FFT needs to be extended from *N* to 2log2 *<sup>N</sup>* . Although the DA-based algorithm can use the least time, it needs too much memory to make it worthwhile. From the Figure 11, the proposed algorithm's execution time is less than the FFT-based algorithm's when *N* < 5500, and is very close to the fast NCC algorithm, but not involved with noise.

**Figure 11.** The comparisons of the five algorithm's execution time (ms).

In addition, it is important that the proposed algorithm has five advantages, as follows:


#### *6.2. Structure Comparison*

We compare the proposed systolic array in Section 5 with some existing hardware structures. Table 2 shows the hardware complexity of these structures to implement an *N*-point cyclic NCC, where *N* = *PM* (*P* and *M* are two positive integers derived from [33]). Because the proposed array's adder number and latency are not fixed, but varied with the sequence { *g*(*i*) }, we only display their value range according to Section 5.1. The execution time of the model **P** is assumed as three clock cycles.


**Table 2.** The compassions of hardware complexity.

From Table 2, it is an advantage that the proposed systolic structure does not need ROMs, while the other two structures use O(2*N*) ROMs that are hardware-expensive when *N* > 16. The structure [22] has minimum latency, but its throughput is more than 1. The structure [33] needs the O(*P*) adder and latency that would increase rapidly with *N*.

The proposed structure's hardware complexity is dependent upon *L*. Furthermore, for long NCCs, or two-dimension NCCs when *N* and *P* are larger than *L*, the adder number of the proposed structure is lower than that of the structure [22], and the latency of the proposed structure is lower than that of the structure [33]. Figure 12 shows the three structures' adder number and latency increasing along

with *N*, where the proposed structure adopts maximum adder and latency to perform comparisons. It is obvious that the proposed structure's adder number is least when *N* > 1800, and its latency is lower than the structure [33] when *N* > 1500. Therefore, although additional O(*L*) latches are required for data store and transfer, the proposed systolic array could be more efficient in digital signal and image domain where the maximum value of *L* is less than 256 in general [34].

**Figure 12.** The three structure's adder number and latency: (**a**) Adder (**b**) Latency.
