*3.2. BNMF Algorithm*

Bayesian Non-negative Matrix Factorization (BNMF) [9] model is another factorization model designed for CF based RS. BNMF model has demonstrated its superiority by providing more accurate predictions and recommendations than PMF model. As PMF, BNMF factorizes the rating matrix in a probabilistic way.

The main objective of BNMF is to provide an understandable probabilistic meaning of the latent factors space generated as consequence of the factorization process. To achieve this, the model has been designed in such a way that it better represents the interaction between users and items. Instead of assuming a continuous distribution to represent ratings, such as Gaussian distribution, a discrete distribution is used. This coincides with the reality of most CF systems, where users must rate items on a pre-set scale (e.g., 1–5 stars).

Figure 2 contains a graphical representation of BNMF model. The model is composed by the following random variables:


The model also contains the following hyper-parameters:


*qik* ←

*e*

*ik e* + *ik*+*e* − *ik*


To be able to compute predictions with the BNMF model, we must determine the conditional probability distribution of the non-observable random variables given a set of observations (i.e., the known ratings). Applying the variational inference technique [45], we can obtain the algorithm to perform this task. Algorithm 2 contains a detailed explanation about the training phase of BNMF model. For further information about the inference process, see [9].

**Algorithm 2:** BNMF algorithm. The algorithm returns the latent factors for each user and item. Input ratings (*rui*) must be normalized.

```
input :rui, α, β, K, R
output : puk, qik
temp :γuk, e
                −
                ik , e
                    +
                    ik , λuik, λ
                               0
                               uik
Initialize γuk
Initialize e
            −
            ik
Initialize e
            +
            ik
repeat
    for each user u do
        for each item i rated by user u do
            for each factor k do
                λ
                  0
                  uik ← exp(Ψ(γuk) + r
                                            +
                                            ui · Ψ(e
                                                    +
                                                    ik ) + r
                                                            −
                                                            ui · Ψ(e
                                                                    −
                                                                    ik ) − R · Ψ(e
                                                                                   +
                                                                                   ik + e
                                                                                          −
                                                                                         ik ))
            for each factor k do
                λuik ←
                              λ
                                0
                                uik
                          λ
                           0
                           ui1+···+λ
                                    0
                                    uiK
    for each item i do
        e
         +
         ik ← β
        e
         −
         ik ← β
    for each user u do
        γuk ← α
        for each item i rated by user u do
            for each factor k do
                γuk ← γuk + λuik
                e
                  +
                  ik ← e
                         +
                         ik + λuik · R · rui
                e
                  −
                  ik ← e
                         −
                         ik + λuik · R · (1 − rui)
until convergence
for each factor k do
    for each user u do
        puk ← γuk
                 ∑f=1..K γu f
    for each item i do
                   +
```
<sup>ߠ</sup> *<sup>ȕ</sup> <sup>ȡ</sup> <sup>Į</sup>*

**Figure 2.** Graphical representation of BNMF model.

#### **4. Hardware Designs for Embedded Applications**

In this section, we present the hardware implementations of PMF and BNMF. The purpose of both implementations is twofold. On the one hand, the operations of the algorithms are accelerated by using the parallelism that hardware provides; on the other hand, the energy consumption is reduced in comparison with usual microprocessors.

#### *4.1. PMF Design*

*ı*

*ı*

*ı ı*

PMF was parallelized by considering *High-Level Synthesis* (HLS) technology [46]. HLS transforms C specifications (C, C++, SystemC, or OpenCL code) into a *Register Transfer Level* (RTL) implementation, which allows us to synthesize the design to any Xilinx FPGA. This way, HLS facilitates the fast design of efficient circuits by parallelizing code automatically. Specifically, for this work, we considered Vivado HLS tool [47], which was deeply analyzed by O'Loughlin *et al.* [48].

The main parallelization strategy for PMF is described in [49]. As we can see in Algorithm 1, two consecutive loops can be parallelized after initialization in order to update the corresponding factorized matrices for each user/item. These loops are sequentially performed several times.

#### *4.2. BNMF Design*

In this section, we show how we implemented the BNMF algorithm on a reconfigurable hardware platform. Previously, we implemented two versions of PMF on FPGA. The first version was a simple design without parallelism in order to check the viability of using an embedded operating system for running a full recommender system and analyze its performance. The second version was a parallelized design in order to accelerate the operations. Therefore, next, we focused our efforts just on implementing a parallelized design of BNMF, once checked the viability of using the same hardware platforms and software tools applied to PMF. In this section, we detail how we designed the BNMF algorithm for a high-performance implementation on FPGA.

The main tools used for the design of the BNMF algorithm in an FPGA are summarized in Table 1. Zedboard is a low-energy and low-cost prototyping board that mounts a programmable System-on-Chip (SoC) including an ARM processing architecture. Furthermore, there are many elements and features to design any computing system based on Linux, Windows, and Android operating systems, among others, and interact with the user's needs.


**Table 1.** Main tools for implementing BNMF in FPGA.

Figure 3 shows the architecture where BNMF is implemented and executed. This architecture basically consists of three elements, mutually communicated along an AXI bus: external memory, multiprocessor system, and programmable logic.


**Figure 3.** Basic architecture for BNMF on the Zynq Zedboard 7000.

 As we did in PMF, we installed an embedded Linux OS (Linaro distribution) on the board in order to allow running the BNMF on the FPGA. This OS is launched from a separated partition in the SD card, thus the changes made by the program are written in that partition. The Linaro filesystem is a complete Ubuntu-based Linux distribution with graphical desktop. The advantage of using the Linaro is that we can work with the ZedBoard just as if we used a commercial processor. Thus, the code executed both in ZedBoard and in CPU is exactly the same.

#### *4.3. Parallelization Strategy*

In this section, we detail how the parallel implementation of the BNMF algorithm was designed. The results obtained in PMF encouraged us to improve the performance by designing a more accurate parallel design in BNMF.

 The parallel design was developed mainly by programming with HLS. However, we also modified the design manually by including different optimization directives provided by HLS in order to increase the fine-grained parallelism without the need to modify the C code, in order to obtain a higher performance circuit. Thanks to these directives, we managed the way of parallelizing certain loops and operations. The most used directives were those for unrolling loops or functions, which allow us to work with arrays in parallel. Additionally, other directives to later transfer data to the BNMF algorithm were used too.

Figure 4 allows explaining easily the parallelization strategy followed by the design. First, according to Algorithm 2, we perform random initializations of γ, *e* <sup>+</sup> and *e* − in parallel, since they are matrices and are highly parallelizable.

Next, four consecutive blocks implement parallel operation for calculating some sections of the algorithm. These four blocks are executed sequentially because there is a clear data dependency between them.

The update of λ requires a great computational cost, since we could define it as a matrix vector. Basically, the parallelization consists in updating each of the elements of that matrix vector in parallel. Then, we also perform the update of *e* <sup>+</sup> and *e* − in parallel. Finally, we calculate the user factors *a* and *b* in parallel.

**Figure 4.** Strategy for parallelizing BNMF.

#### **5. Performance Comparison**

In this section, we highlight the different results obtained by PMF and BNMF. First, we explain the datasets considered for the experiments. Next, we show the performance results in terms of computing time and energy consumption.

#### *5.1. Datasets*

Both PMF and BNMF were tested using four state-of-the-art datasets of different characteristics, widely used for this purpose: The Movies Dataset (Kaggle), Movielens-100K, Movielens-1M, and Netflix-100M (Table 2). These datasets gather the activity of many users when rating movies with scores from 1 to 5, where each user rates at least 20 movies.

We chose datasets of very different sizes to check the impact of the matrix calculations in the performance given by the FPGA implementation. To get a rough idea, the product Users × Items goes from 6.3M in Kaggle to 8495M in Netflix-100M.



#### *5.2. Experimental Procedure*

Figure 5 shows the phases of the experimental procedure followed in our research. First, we studied in depth the best way to parallelize BNMF, looking for those operations that can be parallelized without altering the right calculation of the remaining ones. Once the parallelizaton strategy was determined, we generated the parallel core using Xilinx Vivado HLS. The BNMF design was exported as IP core, which can be reached by the processor and memory in the architecture described in Figure 3. Next, this core was exported as bitstream into the Linaro OS, and the aforementioned datasets were added to perform the tests. Finally, the BNMF algorithm was executed and the results are validated.

This experimental procedure was performed as many times as different datasets available for performance purposes.

**Figure 5.** Experimental procedure.

#### *5.3. Timing Results*

In this section we show the computing time obtained by the hardware implementations of BNMF and PMF algorithms, and by an up-to-date microprocessor for comparison purposes.

With regard to the FPGA implementation, we measured the elapsed time by using HLS, considering the same FPGA device and the required operational frequency. Once the design is synthesized, HLS allows us to know whether the given frequency can be supported by the FPGA device, as well as the number of clock cycles used by the hardware. Hence, we calculated the elapsed computing time.

We considered for the CPU experiments an Intel i7-950 with clock frequency of 3 GHz. Note that the RS implemented on the FPGA reached a very low frequency compared to the CPU: 667 MHz. The CPU runs codes that implement the same operations described in the PMF and BNMG algorithms, considering the same parameters and datasets.

Table 3 shows the computing time in seconds of the PMF and BNMF algorithms for the CPU and FPGA implementations, considering the four datasets. We deduce two interesting conclusions.

**Table 3.** Computing time (s) and FPGA speedup of PMF and BNMF algorithms for the CPU and FPGA implementations.


First, comparing both algorithms, we can observe that BNMF takes more computing time than PMF in CPU, although much less in FPGA. The reason is simply that BNMF provides the greatest parallelization degree in the FPGA implementation. Second, we can observe that, the larger is the dataset, the better are the results we obtain in the parallel implementation of BNMF in FPGA. In both the Kaggle dataset and the Movielenes-100k dataset, the time results are very similar. However, the two largest datasets begin to show a greater computing time difference between FPGA and CPU. Thus, for the Movielens-1M dataset, the FPGA gets a speedup of almost ×5, while this speedup increases to ×8 for the Netflix-100M dataset.

In conclusion, a FPGA implementation is more attractive for the BNMF algorithm and larger datasets. As a proposal, it would be interesting to experiment with larger sets corresponding to other types of data.

#### *5.4. Power Results*

Energy consumption is another important metric for computing systems performance. The RS algorithms have a certain energy impact on the hardware platforms. Knowing this impact is important because it helps us to optimize energy-aware designs of embedded RS. We keep in mind that embedded RS can be demanded for computing-intensive cases when performing many predictions over time.

Xilinx Vivado provides the total on-chip power of the FPGA implementations. Table 4 shows the power in watts of the PMF and BNMF algorithms for the CPU and FPGA implementations, considering the four datasets. We observe that the power reduction in any FPGA implementation is very high (more than 80% on average). Therefore, a clear advantage of implementing RS in FPGA is the low energy consumption with regard to current CPUs.

Under the algorithmic point of view, we can check in Table 4 that BNMF gives a more significant power reduction than PMF. This fact, along with the computing time reduction for large datasets deduced from Table 3, encourage us to consider BNMF as the best algorithmic option for building embedded RS applications.


**Table 4.** Power (w) and FPGA power reduction of PMF and BNMF algorithms for the CPU and FPGA implementations.
