*Article* **Performance of Two Approaches of Embedded Recommender Systems**

#### **Francisco Pajuelo-Holguera <sup>1</sup> , Juan A. Gómez-Pulido 1,\* and Fernando Ortega <sup>2</sup>**


Received: 22 February 2020; Accepted: 21 March 2020; Published: 25 March 2020

**Abstract:** Nowadays, highly portable and low-energy computing environments require programming applications able to satisfy computing time and energy constraints. Furthermore, collaborative filtering based recommender systems are intelligent systems that use large databases and perform extensive matrix arithmetic calculations. In this research, we present an optimized algorithm and a parallel hardware implementation as good approach for running embedded collaborative filtering applications. To this end, we have considered high-level synthesis programming for reconfigurable hardware technology. The design was tested under environments where usual parameters and real-world datasets were applied, and compared to usual microprocessors running similar implementations. The performance results obtained by the different implementations were analyzed in computing time and energy consumption terms. The main conclusion is that the optimized algorithm is competitive in embedded applications when considering large datasets and parallel implementations based on reconfigurable hardware.

**Keywords:** embedded systems; collaborative filtering; recommender systems; parallelism; reconfigurable hardware; high-level synthesis

#### **1. Introduction**

Nowadays, in the framework of the information society, a large amount of information is being generated from multiple and heterogeneous data sources. The own interaction of the user who generates or uses this information is added to the same. Representative examples can be found in areas such as e-commerce (users who buy and value products) and the entertainment industry (users who value series and movies). This information is usually stored in large databases, permanently and dynamically growing and updating, which constitute a source of knowledge regarding user behavior, so that predictions and recommendations can be made. This is where recommendation systems emerge.

*Recommender Systems* (RS) [1] are algorithmic techniques that allow users to obtain recommendations and predictions after an intelligent processing of the data of large databases. RS give personalized recommendations to the users according to their behavior when requesting and handling information [2,3]. In this sense, RS are also known as filters because they block the data not connected to the users' behavior.

Besides the analysis and recommendation of information, an important application of RS is the prediction of the users' behavior. For example, in the *Predicting Student Performance* (PSP) problem [4], the score of an evaluation task in the academic environment for a particular student can be predicted when RS considers it as a ranking prediction problem. Nevertheless, the most popular implementation of RS is *Collaborative Filtering* (CF) [5,6], where users with similar preferences in the past will have

similar preferences in the future [7]. For example, if two users have rated the same movies as positive, new movies that either rates as positive might be liked by the other user.

A matrix defines the relationship between users and items in CF. This matrix stores the ratings (explicit or implicit) of the users to the items, and has a high level of sparsity, because users only rate a small number of available items. Popular online applications, such as e-commerce websites or movies databases, generate rating matrices composed of thousands of million ratings, where hundreds of thousands of users have rated hundreds of thousands of items.

The way to fill the gaps of the sparse ratings matrix [8] considers the *Matrix Factorization* (MF) technique [9]. MF generates a scalable model for prediction purposes [10] composed of two matrices. The prediction is a combination of factors as result of multiplying the row corresponding to a user in the user-latent space with the column corresponding to an item in the item-latent matrix. In addition, MF assumes that users' ratings are conditioned by *K* latent factors describing the items of the system. MF algorithms try to find these hidden factors through the rating matrix.

We would like to highlight the interest in implementing a CF algorithm in hardware for running embedded applications due to several reasons. Firstly, we must bear in mind that CF involves large amount of data because of the number of users and items in databases. The needs of predictions and data handling involve high computational efforts, especially if real time constraints are required. Therefore, the design of hardware circuits that accelerate some processes of the algorithm is especially interesting. Besides, possible embedded applications of CF require fast algorithms if they should be performed on small, low-power computing environments. Therefore, we focus the research on implementing embedded applications of CF by considering *Field Programmable Gate Array* (FPGA) devices [11], under the *Reconfigurable Computing* (RC) [12] and System-On-Chip (SoC) [13] concepts.

We propose using FPGA devices for designing accelerated CF algorithms because this technology combines software flexibility with hardware performance by exploiting parallelism. Thus, if an embedded implementation is designed carefully by following these advantages, it can provide excellent results, even surpassing the performance delivered by usual microprocessors or *Central Processing Units* (CPU) in similar experimental conditions [14]. Other design approaches based on different hardware technologies can also be explored. In this sense, *Graphical Processing Units* (GPUs) can be programmed by using OpenCL for similar purposes, although their high power consumption could be a constraint when using them for embedded applications.

In summary, our proposal is to design an embedded, low-energy implementation of an efficient CF algorithm in order to perform applications on highly-portable light computing environments. Our approach was successfully tested considering several state-of-the-art datasets.

The remainder of this paper is structured as follows. We present some related works in Section 2. In Section 3, we discuss the basis of two approaches, basic and enhanced, of CF algorithms. Next, Section 4 explains the design and implementation of both algorithms, emphasizing on the parallelization strategy considered for improving the performance results. Section 5 shows a performance comparison between the two approaches and usual microprocessors, detailing the state-of-the-art datasets considered, the experimental procedure followed, and the timing and power results. Finally, the conclusions of this paper are summarized in Section 6.

#### **2. Related Works**

RS are a good opportunity to provide advanced services to Internet users. Some classic examples of heterogeneous successful applications are PHOAKS [15] (it helps users to locate useful information on the *World Wide Web* (WWW) examining USenet news messages), Referral Web [16] (it combines social networks and collaborative filtering), Fab [17] (it combines content-based information with collaborative filtering), Siteseer [18] (a conceptual recommender system for CiteSeerX), and many others. However, currently growing concepts in the Internet domain, such as Internet of Things, autonomous driving, and augmented reality, among many others, are pushing to consider new applications of the RS. For example, we can find novel and advanced applications of RS in vehicles [19], voice-enabled devices [20], smartphones [21], and multimedia data for robustness [22], diversification [23], and real-time [24] recommendation aims, among many other examples.

In the context of an increasing application of the RS, many research efforts are focused on improving the accuracy and reducing their limitations. In this regard, RS have some limitations, especially related to their complexity and difficulty in understanding them. They represent black boxes that require personalized explanations related to the individuals' mental models [25], which has consequences in many areas, such as computer vision [26].

Computing systems based on low-performance and low-consumption microprocessors may be involved in some of these new fields of application of RS. Thus, there are environments where RS could run on such computer systems, for example smartphones and IoT devices. In fact, the demand of computing resources by RS may have limited their application in these areas and devices. Particularly, mobile RS are an interesting area for online applications (social networks, e-commerce, and streaming platforms) in situations where the data volume can produce overload. These situations may occur more and more frequently, given the rapid increase in the use of mobile devices in a context of continuous growth and improvement of network infrastructure. The links between web and mobile RS are identified in [27] to provide guidelines for embedded RS in mobile domain. We find some examples of mobile RS in recommending different types of media to its users using a context-aware approach [28] or in recommending photos by means of current contextual data in combination with information found in the photos [29]. Other examples of mobile RS can be found in the mobile news based on the current context and format [30], the recommendation of music depending on the daily activities of a person [31], or the passengers of a car [32].

For all the above reasons, the tools and technologies for designing and implementing embedded computing systems based on low-consumption devices can lead to the application of RS for many purposes in novel fields. Our proposal considers the reconfigurable technology based on FPGA devices for implementing fast, low-power collaborative filtering algorithms for embedded applications. This proposal is in line with other works where ML functions and features have been implemented using similar technology, for different purposes, mainly for acceleration tasks. Thus, we can find FPGA technology applied for *Convolutional Neural Networks* (CNN) [33], *Deep Learning* (DL) [34], K-Means clustering [35–37], and kernel density estimation [38], among others.

It is particularly interesting to explore the application of FPGAs for CF, especially for acceleration purposes. In this regard, there are some attempts to accelerate tasks involved in cloud services and large databases, such as Amazon [39]. We can find some examples of FPGA implementations of different aspects of RS algorithms, rather than the whole system itself. For example, a *Stochastic Gradient Descent* (SGD) algorithm [40] used for training some RS models is implemented on FPGA considering single-precision floating-point [41]. In this sense, our proposal takes a step forward, as we undertake the complete implementation of two CF algorithms, which are capable of handling real datasets.

#### **3. Recommender Systems: Two Approaches**

In this section, we present two approaches of CF algorithms, detailing their mathematical descriptions and how they work.

#### *3.1. Basic Algorithm*

In the context of machine learning, MF technique represents a well known family of algorithms that split a matrix *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>m</sup>* into two matrices *<sup>U</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>k</sup>* and *<sup>V</sup>* <sup>∈</sup> <sup>R</sup>*k*×*m*, in such a way that *<sup>X</sup>* <sup>≈</sup> *<sup>U</sup>* · *<sup>V</sup>* [42]. Note that the rank of the matrices *U* and *V* is much smaller than the rank of *X*, since *k n* and *k m*. Therefore, the factorized matrices *U* and *V* contain a compact representation of the original matrix *X*.

Applied to CF, MF based RS factorize the sparse rating matrix *<sup>R</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>m</sup>* that contains the set of known ratings of *n* users to *m* items [43]. The fundamental assumption of these kinds of algorithms is that the ratings of the users to the items are conditioned by a subset of latent factors intrinsic to the users and items. For example, in a movies' RS, it is assumed that the rating a user provides to a movie is conditioned by the genre of that movie. As consequence of the factorization process, two new matrices are generated: *<sup>P</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>k</sup>* , which represents the *<sup>k</sup>*-latent factors of the *<sup>n</sup>* users; and *<sup>Q</sup>* <sup>∈</sup> <sup>R</sup>*m*×*<sup>k</sup>* , which represents the *k*-latent factors of the *m* items. Once the factorization is performed, the rating predictions (*r*ˆ*ui*) of a user *u* to an item *i* can be computed by the dot product of the row vector of the matrix *P* that contains the latent factors of the user *u* (~*pu*) and the column vector of the matrix *Q* that contains the latent factors of the item *i* (~*q<sup>i</sup>* ):

$$
\vec{\boldsymbol{\sigma}}\_{\rm ui} = \vec{\boldsymbol{\sigma}}\_{\rm u} \cdot \vec{\boldsymbol{q}}\_{\rm i}^T. \tag{1}
$$

Hence, the learning process consists on find the optimal parameters for the matrices *P* and *Q* that verifies

$$\mathcal{R} \approx \mathcal{P} \cdot \mathcal{Q}^{T}.\tag{2}$$

<sup>ߠ</sup> *<sup>ȕ</sup> <sup>ȡ</sup> <sup>Į</sup>*

This process is usually raised as an optimization problem in which the quadratic difference between the known ratings (*ru*,*<sup>i</sup>* ) of the matrix *R* and the predicted ones (~*p<sup>u</sup>* ·~*q T i* ) must be minimized: *ı ı*

$$\min\_{\vec{p}\_{\mu}, \vec{q}\_{\vec{i}}} \sum\_{(u,i) \in \mathcal{R}} (r\_{u,i} - \vec{p}\_{\mu} \cdot \vec{q}\_{\vec{i}}^T)^2. \tag{3}$$

The most popular implementation of MF applied to CF is Probabilistic Matrix Factorization (PMF) [44]. PMF performs the factorization thorough a probabilistic model that represents interaction between the users and items in a CF context. Figure 1 contains a graphical representation of this probabilistic model. The figure contains three representational elements: circles that symbolize random variables; arrows between two variables that indicate dependence between that random variables; and rectangles that indicate repetitions of the random variables. The color of the circles indicates if the random variables are observed (black) or must be learned (white). As we can observe, there exists three random variable: *Rui* that symbolizes the rating of the user *u* to the item *i*; *P<sup>u</sup>* that symbolizes the latent factors of each user *u*; and *Q<sup>i</sup>* that symbolizes the latent factors of each item *i*. The arrows between *P<sup>u</sup>* and *Q<sup>i</sup>* with *Rui* denote that there exists dependency between the rating of user *u* to item *i* and the latent factors of user *u* and item *i*. PMF assumes a Gaussian distribution for all the random variables. *σR*, *σ<sup>P</sup>* and *σ<sup>Q</sup>* denotes model hyper-parameters. *ı* <sup>ߠ</sup> *<sup>ȕ</sup> <sup>ȡ</sup> <sup>Į</sup>*

**Figure 1.** Graphical representation of PMF model.

Algorithm 1 summarizes PMF. The inputs are the rating matrix *R*, the number of latent factors *K*, and the hyper-parameters to control the learning process *λ* and *γ*. The outputs are the latent factors matrices *P* and *Q* learned from the rating matrix.

**Algorithm 1:** PMF algorithm.

**input :***R*, *K*, *λ*, *γ* **output :** *P*, *Q* Create a random matrix *P* with *U* rows and *K* columns Create a random matrix *Q* with *I* rows and *K* columns **repeat for** *each user u* **do** // This loop can me parallelized for each user **for** *each item i rated by user u* **do** *error* = *R*[*u*][*i*] - dotProduct(*P*[*u*], *Q*[*i*]) **for** *each factor k* **do** *P*[*u*][*k*]+ = *γ* · (*error* · *P*[*u*][*k*] − *λ* · *Q*[*i*][*k*]) **for** *each item i* **do** // This loop can be parallelized for each item **for** *each user u that has rated the item i* **do** *error* = *R*[*u*][*i*] - dotProduct(*P*[*u*], *Q*[*i*]) **for** *each factor k* **do** *Q*[*i*][*k*]+ = *γ* · (*error* · *Q*[*i*][*k*] − *λ* · *P*[*u*][*k*]) **until** *convergence* **return** *P*, *Q*
