*3.2. Design Challenges*

**Feedback.** There is a feedback problem in updating the inverse matrix. In fact, the inverse matrix updated in the fourth stage has to be transmitted back to the first stage as an input operand for the next updating. All of the stages are described by individual C/C++ functions. To substantially accelerate the process of updating the inverse matrix, we have to apply the data flow optimization directly to these functions so that the HLS tool can be guided to implement a task-level pipelining. Unfortunately, the HLS tool will not take place if it detects a feedback among the functions. As a result, the task-level pipelining cannot be achieved only using HLS directly.

**Fanout.** Due to the use of a large number of bands, there are some high fanout cases where some registers need to drive lots of loads like multipliers, which result in longer path delay and lower clock frequency. For example, in the fourth stage as described in Table 1, the scalar Q needs to be multiplied by *L* elements of a column in the matrix **F** simultaneously after parallel computation applied. It means that the element of the scalar Q has a high fanout to drive as much as *L* slave modules. It is simple to solve the high fanout problem by means of duplicating registers when designing with RTL, but it is not easy with HLS.
