4.2.1. Internal Architecture

As described in Figure 2, the inverse matrix updater contains five processing stages for updating the inverse matrices and four memory buffers for independently caching inverse matrices associated with four successive pixels. The four individual memories are allocated for solving the problem of data dependency described in Section 2.2.2. The specific calculations of each stage, the data flow, and the access mode of inverse matrices are clearly displayed in Figure 2. In addition, we arrange five blocks (Block A in Figure 3a, Block B in Figure 4a, Block C in Figure 5a, Block D in Figure 6a, and Block E in Figure 7a) to realize the last four processing stages in Figure 2, and we also provide pieces of C/C++ code written in HLS for these blocks on the right side of the Figures. In the Appendix A, the features of the #pragma used in these pieces of code are explained in Table A1.

**Figure 2.** Block diagram of inverse matrix updater.

**Figure 3.** (**a**) Hardware structure and (**b**) C/C++ code in HLS of Block A. (*v* represents the pixel number, *l* represents the row number of the matrix **<sup>S</sup>**−1, and *m* represents the column number of the matrix **<sup>S</sup>**−1).

**Figure 4.** (**a**) Hardware structure and (**b**) C/C++ code in HLS of Block B.

**Figure 6.** (**a**) Hardware structure and (**b**) C/C++ code in HLS of Block D.

**Figure 7.** (**a**) Hardware structure and (**b**) C/C++ code in HLS of Block E.

In what follows, the complete updating process of the inverse matrices by using these blocks can be summarized in five stages as depicted in Figure 2.


Besides, it is worth noting that the following design optimization strategies play an important role in improving the performance of the FPGA implementation.


All the other intermediate data are represented as 32 bits signed fixed-point type (7 bits integer part, 24 bits fractional part). T and Q have the significant impact on the detection accuracy. Therefore, they are assigned high data precision up to 38 bits. The elements of the matrix F are obtained by the accumulation operations, and more bits should be assigned to the integer part for avoiding data overflow. Though T and Q have larger bit-width up to 38 bits, it almost does not increase the logic resource consumption compared with the data type of 32 bits signed fixed-point. The reason is that only one single accumulation adder is allocated to compute T while one single adder and one single divider are placed to calculate Q. It is worthwhile to highlight that these data types can be defined and modified by HLS *ap\_fixed* type easily.
