Next Article in Journal
Adaptive Fault-Tolerant Control for Flexible Variable Structure Spacecraft with Actuator Saturation and Multiple Faults
Previous Article in Journal
A Convolution-Neural-Network Feedforward Active-Noise-Cancellation System on FPGA for In-Ear Headphone
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Aging Residual Factorization Machines: A Multi-Layer Residual Network Based on Aging Mechanisms

School of Mechanical and Information Engineering, Shandong University, Weihai 264209, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(11), 5318; https://doi.org/10.3390/app12115318
Submission received: 2 April 2022 / Revised: 18 May 2022 / Accepted: 19 May 2022 / Published: 24 May 2022

Abstract

:
With the rapid development of recommendation systems, models and algorithms supporting the core of recommendation systems have emerged one after another, and researchers have attempted to optimize them. However, the structure of these models is complex. Popular deep neural networks often achieve the highest utilization of data by increasing the number of hidden layers, ignoring the problems of exploding and vanishing gradients and even the entire degradation of the networks. However, researchers pay too much attention to algorithms and models and do not consider the dataset itself. Methods for processing data and finding possible connections between the data and models have become new explorable points. Cold start is also a problem that researchers have been trying to solve and optimize since the birth of the recommendation system. Recent studies also provide good ideas for solving cold start, but the problem is that researchers still do not focus on datasets. In order to fill the gap in the exploration and research of datasets, this paper takes the long tail distribution and cold start problems that are common in recommendation systems such as the starting point, combines the residual network in computer vision with deep learning, and proposes the aging mechanism of datasets. In this paper, a multi-layer residual network based on aging mechanisms called Aging Residual Factorization Machines (ARFM) is proposed. Parallel experiments with other model algorithms are carried out on three datasets of different sizes and categories. Experimental results show that ARFM achieve performance advantages under the premise of different recommendation tasks.

1. Introduction

It can be said that we are now living in an era of being influenced by recommendation systems everywhere. Whether online shopping, study, work, or relaxation, the recommendation system can help us grasp the key points more efficiently. The recommendation system does not affect people’s lives as much as it currently does, and this is also true for algorithm engineers behind the recommendation systems, which are designed to solve the problem of how users obtain the information they are interested in under the condition of information overload. Thus, information is naturally the core of the algorithm that needs to be focused on and can be divided into item, user, and scene information. In the eyes of engineers, the mathematical meaning of the recommendation system is that for the user U (user) and the specific scene C (context), aiming at massive item information, constructing a function f (U, I, C) to predict the user’s preference degree for a specific candidate item I (item), and then sorting all candidate items according to the preference degree to generate the recommendation list.
Focusing on these basic issues, the development of efficient and cutting-edge recommendation system algorithms and models has become a key task. During the decades of development of recommendation systems, recommendation system models have made qualitative leaps, and the collaborative filtering algorithm [1] undoubtedly has played a pivotal role in the early stage. User-based [2] and object-based recommendation algorithms [3] have generated considerable value in recommendation systems over extended periods. Later, the proposal of matrix factorization [4] became an opportunity for the emergence of factorization machines [5], and provided new ideas for the subsequent model algorithm. Logistic regression [6], based on the classification problem, has also become the core algorithm of the classification problem model. Recommendation systems have achieved real flourishing after entering the era of deep learning, and the concept of models has become increasingly important. The proposal and development of neural networks have laid the foundation for deep-learning models. This wide and deep [7] model combines generalization and memory for the first time, thereby providing new ideas for subsequent models. Recently, the field of graphics has led to an increase in computer vision, and the emergence of CNNs [8] and RNNs [9] has led to a boom in the development of new domain models.
In this context, the rapid development of recommendation system models has made algorithm designers focus on model structure and algorithm design, and little attention has been paid to the exploration of datasets. Specifically, many studies on Internet data have found that a large amount of data on the Internet meets a distribution called PowerLaw [10], which is also known as a long-tail distribution. The long-tail distribution has different performances in different datasets but generally conforms to the law of hot few, cold majority, which leads to users being recommended the most popular items, regardless of whether or not they like these items, which is particularly prominent in the cold start problem. The cold start problem is the problem of how to make recommendations for items or users that are newly added to the recommendation system without sufficient ratings and interaction history of the user. At present, most recommendation system algorithms solve the problem of cold start by recommending the most popular items [11,12,13], ignoring the characteristics of items or users themselves, which we think is not desirable. Therefore, alleviating the problem of recommendation solidification caused by the long-tail distribution and diversifying the recommendation system to recommend items to different users have become the focus of our research.
As we mentioned before, we read a lot of literature and found that most of the research was focused on improving the performance of algorithms or models, and many of the results were impressive, but we did not think they could fundamentally solve the problem. So, we focus our research on datasets and try to find the relationship between models and datasets. After many experiments and attempts, we put forward the aging mechanism of the dataset in this paper. The application of this mechanism can fundamentally solve the problems of long-tail distribution and cold start, and improve the final recommendation system score significantly.
In addition, our team is also very focused on multidisciplinary results. A big problem with the current recommendation system model is the increasing number of hidden layers, which is more obvious in large application scenarios. Theoretically, the increase in hidden layers lets the data features to be mined more accurately, but, in fact, when the number of hidden layers increases to a certain extent, the problem of gradient explosion and gradient disappearance will be caused, which will lead to the regression of model performance. In order to solve this hidden danger, we obtained inspiration when studying the relevant achievements of computer vision, fused the famous multi-residual network in the field of convolution with the recommender system model, and achieved remarkable results.
In this paper, we propose the aging residual factorization machines. It can effectively mine the potential relationship between the dataset and model and improve the accuracy and validity of the model using multi-domain crossover technology. The contributions of this study are as follows.
(1)
We developed a model focusing on the relationship between the datasets and models, which effectively alleviated the recommendation solidification problem caused by the long-tail distribution of the data to a certain extent and improved the accuracy of the model.
(2)
We make full use of the multi-field crossover technology and integrate the inspiration obtained from the field of computer vision into the ARFM model so that the model can process data more accurately.
(3)
The experimental results show that ARFM are superior to previous similar models in terms of both recommendation and classification accuracy, which proves the rationality and effectiveness of the model.

2. Related Work

Before deep learning was widely developed and applied to recommendation systems, the core algorithm of the recommendation system was always collaborative filtering. The classical application of the collaborative filtering algorithm can be traced back to the mail filtering system of the Xerox Research Center in 1992 [14]. However, its development was dependent on Amazon [15] in 2003, making it a well-known classic model. Until now, all major models have been influenced by it. Later, in the 2006 Netflix algorithm competition [16], the proposal of a matrix factorization algorithm caused a sensation in the industry and had a profound influence on the design of subsequent algorithms and models. In 2010, Rendle proposed the FM model [17] using the inner product of two vectors to replace the single weight coefficient, that is, introducing a hidden vector, to better solve the problem of data sparsity. Specifically, let the feature vector be x , weight vector be w , the number of features be n, j 1 and j 2 be the subscripts of the two features vectors in feature crossing operation respectively and the basic expression of FM be:
FM w , x = j 1 = 1 n j 2 = j 1 + 1 n w j 1 · w j 2 x j 1 x j 2 .
Using an implicit vector, the number of n 2 weight parameters is directly reduced to n k (where k is the dimension of the hidden vector), which can significantly reduce the training cost when using the gradient descent. The FM model is a key model in the entire field of recommendation systems. In this study, the basic concept of the FM model was applied to increase the overall efficiency.
In 2015, AutoRec [18], proposed by the Australian National University, officially opened an era of deep learning for recommendation systems by combining autoencoders and collaborative filtering. The AutoRec model is similar to MLP (Multi-layer Perceptron), which is a standard 3-layer (including input layer) neural network, but it combines the ideas of AutoEncoder and Collaborative Filtering. To be more precise, the AutoRec model is a standard autoencoder structure, and its basic principle is to use the co-occurrence matrix in collaborative filtering to complete the self-coding of the item vector or user vector. Then the self-coding results are used to obtain the user’s score on all items, and the results can be used for item recommendation after sorting. Since then, various models and algorithms have mushroomed, and the deep crossing model [19] proposed by Microsoft in 2016 is one of the sources of inspiration for this study. Its biggest progress is to change the traditional method of feature crossing so that the model is not only capable of second-order crossing, but also of deep crossing. The multiple residual unit layer proposed in this study adopts a multilayer residual network, which enables the model to capture more information on nonlinear and combination features, giving full play to the advantages of multifield crossover technology. Another important article in 2016 was The Wide and Deep Model proposed by Google [20], which proposed the concepts of memory and generalization for the first time, breaking the thinking of traditional models and directly developing a special system. The ARFM proposed in this study are also an evolution of a wide and deep model, to a certain extent. Memory ability can be understood as the ability of a model to directly learn and use the co-occurrence frequency of items or features in the historical data. This can be understood as the ability of the model to transfer the correlation of features and discover the correlation between rare features that are sparse or even never appear and the final tags. The wide part is responsible for the memory of the model, whereas the deep part is responsible for the generalization of the model. This method of combining the two parts of the network structure combines the advantages of both sides, and has become absolutely popular at this time. In the subsequent improvement of the wide and deep models, the deep FM [21] model in 2017 focused on the wide part and improved the feature combination ability of the wide part using FM, whereas the NFM model [22] in the same year focused on improving the deep structure by adding a feature cross-pooling layer. The deep data-processing capabilities were further enhanced. In 2017, the AFM model [23], proposed by Alibaba, introduced an attention mechanism based on the NFM model, contributing to the multifield integration of recommendation systems.
On the other hand, the development of computer vision also plays a good role in promoting the realization of multi-domain crossover technology. Convolutional networks are undoubtedly at the core of computer vision. As early as 1998, the emergence of LeNet [24] has marked an increase in CNNs. Although LeNet is a simple and small network from today’s perspective, it is the first to completely define the structure of a CNN, which is crucial for its subsequent development. In 2012, Krizhevsky and Hinton launched AlexNet [25], and in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), he won 10.9% more than second place. Later, in 2014, Simonyan and Zisserman proposed a series of Visual Geometry Group Network (VGG) models [26], which appeared as the basic network that ranked second in the classification task and first in the localization task in the same year’s ImageNet challenge. At that time, VGG was a very deep network, reaching a depth of 19 layers. This was a significant breakthrough because theoretically, the fitting ability of the neural network model should increase with the size of the model. In 2015, CNN ushered in qualitative changes. Kaiming He proposed ResNet [27], which not only solved the problem of neural network degradation, but also swept the competitors in ILSVRC and Common Objects in Context (COCO) competitions in the same year and won first place in classification, positioning, detection, and segmentation tasks. This is also a well-known residual network, the influence of which is phenomenal and provides a new idea for the design and optimization of almost all subsequent models.
Recommendation systems have been developing and improving, and many new problems and research directions have been put forward in the latest research results in the last two years. Among them, causal inference is the recent hot spot, and the process of the recommendation results given by the recommendation system is regarded as causal inference, which introduces the concept of new intervention and puts forward the confounding factors that have not been observed. Around these concepts, recommendation systems for causal inference use two models, the exposure model and the result model, which ultimately provide the user with a list of recommendations. In the latest research [28,29,30], researchers respectively studied the enhancement of anchor knowledge graph generation of news recommendation reasoning, the three-dimensional recommendation system with self-supervision of social consciousness, and the elimination of popularity bias of recommendation system by agnostic counterfactual reasoning model. Causal inference is a very interesting entry point, but the problem at present is that the connection between the exposure model and result model is not very perfect, and the recommendation process using causal inference is very time-consuming, and there is therefore a lot of room for progress. We will try to learn and propose effective solutions in the following research.
Another hotspot of recommendation systems is the development of Embedding technology. Embedding technology is the basis of all recommendation algorithms. One of its most important properties is its ability to transform feature data from high dimensional sparse vector to low dimensional dense vector, which is also the main role of Embedding in this paper. In addition, the Embedding vector itself is also a very important feature vector with stronger expression ability. As with so many important core functions, the optimization and development of Embedding technology have been the focus of recommendation system researchers. The latest research has put forward the new Embedding technology without an Embedding Table [31,32,33,34], and studied the feature modeling of flexible Embedding and non-embedding on customized devices, preference amplification in the recommendation system, and comprehensive analysis of network Embedding methods in the recommendation system. New technology can ease the Embedding of data in the face of growing and increasingly sparse coding speeds, but as we mentioned above, the fundamental problem that needs to be solved is that the scale of the dataset is increasingly large, and if researchers cannot find the connection between the characteristics and model dataset, then the new algorithm can only play this easing role; it is not the key to solve the problem in the long run.
Integrated computer visual technology, particularly residual networks, suggests that the adaptability of a system model will reach a new peak. To cope with the growing scale of the dataset and better mine data with the model, we propose a new aging mechanism with a wide and deep model, adjusting the FM model of the fusion, complementing each other, and achieving a new effect. In addition, inspired by the deep crossing model, we applied the principle of the residual network to the model proposed in this study, which further enhanced the feature crossing and model optimization, achieving obvious results.

3. Materials and Methods

In this part, we will explain the basic structure and algorithm of a series of models related to our model in Part A; in Part B, we will formally introduce the ARFM model proposed in this paper and give detailed algorithm explanation; in Part C, we will make a summary of the model.
A.
Related models and algorithms
(1) Deep Crossing.
The inputs to the model were a set of individual characteristics. The model has four types of layers: embedding, stacking, residual units, and scoring. The target function is the log loss function; however, the Softmax function or other functions can also be used:
l o g l o s s = 1 N i = 1 N y i log p i + 1 y i log 1 p i ,
where N is the number of samples, y i is the label for each sample, and p i is the output of the single-node scoring layer.
Regarding the role of each layer in the model, the embedding layer converts sparse class features into dense embedding vectors. Stacking is the stitching of different embedding and numerical features to form a new feature vector that contains all features. The residual unit layer is key to the model. The main structure of the residual unit layer is a multilayer perceptron, which is realized by a multilayer residual network. Its structure is illustrated in Figure 1.
In Figure 1, X represents the input vector and F X represents the function that processes the input, namely, the RELU function in this article. The model can capture more information on nonlinear and combination features by cross-combining each dimension of the feature vector through a multilayer residual network, thereby enhancing the expression ability of the model.
(2) Wide and Deep
According to Google, a model can be divided into four stages: sparse features, dense embedding, hidden layers, and output units. Sparse features send wide and deep input data into two parts, followed by three stages. For a wide part, the features are combined through cross-product transformation, and the formula is as follows:
k X = i = 1 d x i c k i         c k i 0 , 1 ,
where X is the set of input feature vectors, d is the total number of features, and c k i is a Boolean variable. When the ith feature belongs to the k t h combination feature, the value of c k i is 1; otherwise, it is zero. x i denotes the value of the ith feature.
The deep part is the feedforward neural network. For category features, the original input is a feature string and the sparse high-dimensional category features are first converted into low-dimensional dense real vectors, which are typically called embedding vectors. These low-dimensional dense vectors are fed into the hidden layer of the neural network during the forward propagation. Specifically, each hidden layer performs the following calculations:
a l + 1 = f W l a l + b l ,
where l is the number of layers; f is the activation function, and the RELU function is usually used.   a l , b l and W l are the activation, bias, and model weights vectors of the l layer, respectively. Finally, the wide and deep parts are joined together through the full connection layer, and output is through the logical unit. The prediction of the logistic regression model is as follows:
P ( Y = 1 | x ) = σ w w i d e T x , x + w d e e p T a l f + b ,
where Y is the binary classification label, σ ( ) is the sigmoid function, x is the transformation of the original feature vector x , b is the bias item, l f means the f t h layer, and w w i d e and w d e e p are the weights vectors of wide and deep parts, respectively.
(3) Deep FM
As mentioned above, deep FM’s improvement on wide and deep lies in replacing the wide part with FM. Specifically, the output of the FM component is the sum of addition and inner product units:
y F M = w , x + j 1 = 1 d j 2 = j 1 + 1 d < V i , V j > x j 1 · x j 2 ,
where w R d and V i R k . The addition unit focuses on first-order features, whereas the inner product unit focuses on second-order interactions.
(4) AFM
The AFM model introduces an attention mechanism by adding an attention network between the feature crossover layer and final output layer. The role of the attention network is to provide weights for each cross-feature. The pooling process of the AFM after adding the attention score is as follows:
f A t t f P I ε = i , j R x a i j v i v j x i x j ,
where f P I ε = v i v j x i x j   i , j R x (where means the dot product and R x = i , j i x , j x , j > i ), v i   and v j   respectively represent the corresponding hidden layer weights vectors of x i and x j , and a i j is the attention score.
B.
Details of ARFM model
As mentioned above, compared with the traditional machine-learning model, the largest improvement in ARFM lies in the innovative proposal of the aging mechanism and the integration of multiple residual layers into the model. The following section explains in detail the structure and algorithm principles of these two parts.
The starting point for the study of the aging mechanism is our understanding of the long-tail distribution of the datasets. The so-called long-tail distribution refers to the fact that, in the recommendation system, the most popular and viewed items are the least popular, and most items cannot be exposed to users. To illustrate the long-tail distribution of user behavior, we selected raw data from the Delicious and CiteULike datasets for analysis. Figure 2 shows the distribution curves for the popularity of the items in the Delicious and CiteULike datasets. The horizontal axis represents the popularity K of the item and the vertical axis represents the total number of items with popularity K . Figure 3 shows the distribution curve of user activity in the Delicious and CiteULike datasets. The horizontal axis represents user activity K , and the vertical axis represents the total number of users with activity K .
Therefore, under the influence of a long-tail distribution, new items cannot be well exposed; only popular items receive attention, and the efficiency of the recommendation system is affected. Therefore, it can be seen that the starting point of the aging mechanism is very necessary.
In addition, while studying the distribution of the long tail, a new question arises: Will a user’s time using a certain platform affect the user’s choice of items with different popularity? New users tend to browse popular items because they are unfamiliar with the site and can only click on popular items on the front page, whereas older users gradually begin browsing unpopular items. Figure 4 shows the relationship between user activity and item popularity in the MovieLens dataset, where the horizontal axis is user activity and the vertical axis is the average popularity of items rated highly by all users at a certain activity level.
The problem of long-tail distribution also appears in the problem of cold starts [35]. Cold start can be divided into user cold start and item cold start, which correspond to situations in which there is no historical behavior data after new users and items are added to the recommendation system. From a new perspective, solving cold-start and long-tail distribution problems involves killing two birds with one stone.
To apply a multi-residual network, we imitated the application of the deep crossing model to the residual layer and integrated the multi-residual layer based on the optimized and improved wide and deep model so that the model could further perform multi-feature crossing and improve the model efficiency. Based on these two parts, we built an ARFM model, as follows:
Figure 5 shows the basic framework of ARFM and clearly shows how the data flows. Specifically, the initial data are compressed into feature vectors by the embedding layer and then divided into two types according to whether the features are classified as features or numerical features, entering the FM and hidden layers, respectively. The last two types of vectors enter the residual unit layer after processing, and are processed together with the original data. The results were entered into the aging attention layer and the results were finally output. The following sections introduce the key layer structures and algorithms used in the model.
(1) Embedding Layer
The main functions of Embedding are described in the previous section. Specifically, all input data in ARFM can be divided into two categories: numerical features and category features. Numerical features remain unchanged and are directly used as part of the input layer, while category features become sparse feature values after one-hot processing. These two parts are combined as the input of the input layer.
The Embedding layer uses the training network to calculate the weight values of all input features and store the results in the Embedding Table. The last step is to transform the sparse one-hot encoding into the eigenvalue after Embedding processing. What we need to do is to directly query the corresponding Embedding Table. Figure 6 shows how Embedding works:
(2) FM Layer
The function of the FM layer is based on the improvement of the Wide part in the Wide and Deep model, and its basic principle is basically the same as the FM part in DeepFM mentioned above, except for the connection mode of upward output. Specifically, we have Formula (8) as follows:
y ^ F M = b + i = 1 m w i x i + i = 1 n j = i + 1 n v i , v j x i x j ,
where b is the bias term and can theoretically be set to 0. In practical application, due to the particularity of the ARFM model, the value of b can be assigned as i n p u t , which is the same as the effect of adding i n p u t at the Residual Units Layer. w i is the weight vector learned by Embedding in the previous layer; one for each feature. v represents the implicit vector that needs to be learned by the model itself, automatically at each iteration update.
It can be seen from Formula (8) that the function of the FM layer is to calculate the interaction between first-order and second-order features, so it has stronger expression ability than a linear model.
(3) Hidden Layer
The hidden layer is not a network layer specifically, but a small neural network, which is equivalent to the operation of the Wide part in Wide and Deep. Our processing still inherits the relevant parts of DeepFM. The input of the hidden layer comes from the Embedding of the next layer. The intermediate layer is processed by two RELU functions, and the final result is output by the Sigmoid function. Figure 7 shows the concrete structure of the Hidden Layer:
Next, we define the input and output of each layer:
a 0 = e 1 ,   e 2 ,   ,   e m ,
where a 0 represents the output of Embedding layer and e i represents the vector corresponding to the i t h feature:
a l + 1 = σ W l a l + b l .
With a l , W l , and b l is the output of the first layer l respectively, parameters vectors, and bias, σ as the activation function. So, the final output of Hidden Layer is:
y D N N = σ W H a H + b H ,
where H represents the number of Hidden Layer layers.
(4) Residual Units Layer
The problem to be solved in the residual network is gradient explosion and gradient disappearance when there are too many hidden layers in a network. In ARFM, we use the principle of residual network identity mapping to make multiple use of our original input data, so as to further enhance the ability of feature expression. The identity principle is shown in Formula (12):
a l z l + 1 a l + 1 z l + 2 a l + 2 ,
where a l is our original input.
In a general network:
a l + 2 = g z l + 2 .
After adding the original data through the residual network, Formula (14) can be obtained:
a l + 2 = g z l + 2 + a l .
Therefore, as shown in Figure 1, the input of the residual unit layer is mainly composed of data before and after processing. In ARFM, the data before processing are the original input data, denoted by d a t a i n p u t . The processed data can be further divided into two parts: d a t a F M from the FM layer, and d a t a H i d d e n from the hidden layer. Therefore, the final equation is:
d a t a o u t p u t = R e L U d a t a F M + d a t a H i d d e n + d a t a i n p u t .
(5) Aging of the attention layer
In the research description at the beginning of this section, we summarize two situations in which an aging mechanism has been proposed.
a. The proportion of popular items is very small, and a vast majority of unpopular items cannot be effectively recommended to users.
b. Users who are highly active or have been using them for a long time tend to seek unpopular items, whereas new users tend to accept popular items.
Therefore, it can be seen that the aging mechanism should be applied in the same direction from the perspective of both items and users. Specifically, the more popular the items or active users are, the smaller the proportion they should occupy in the recommendation system. We added two parameters for the user and item: user activity u s e r K and item popularity i t e m K , which are handled by function F, where the function f scales the parameters between 0 and 1. Subsequently, the two parameters are combined with the attention mechanism, and the corresponding weight is assigned to the data using the parameter-weight concept of the attention mechanism. The attention mechanism, which is the latest technology used to capture fires in a recommendation system, originates from the most natural human habit of selective attention. A typical example is browsing a webpage, in which users selectively pay attention to certain areas of the page and ignore other areas. Based on this phenomenon, it is often profitable to consider the influence of the attention mechanism on the prediction results in the modeling process.
Similar to the feature crossing of traditional models, such as NFM, the feature embedding vectors of different domains are added to the output layer composed of a multilayer neural network after the intersection of the feature-crossing pooling layer. The key to the problem lies in addition and pooling operations, which are equivalent to treating all cross-features equally, regardless of the impact of different features on the results. This method eliminates the need for a large amount of valuable information.
Therefore, the attention mechanism in the model is mainly applied to assign a weight to each input neuron, which can reflect the different input weights of different neurons. Specifically, after the residual layer, the attention layer in ARFM calculates the corresponding weight of the input of each neuron through the sigmoid function, and then multiples the weight with the input vector to obtain the new weighted vector. Figure 8 shows the main mechanics of the attention layer.
Thus, we ended up with the following formula for the Aging Attention layer:
O u t p u t = y · I n p u t 1 f u s e r K + 1 y · I n p u t 1 f i t e m K ,
where y is 0 or 1. When the input is a user vector, y = 1. When the input is an item vector, y = 0. In this manner, the same optimization result can be achieved flexibly, according to the object oriented by the input vector. In addition, it is worth noting that the operator * in Equation (16) represents the corresponding multiplication operation, meaning that each user or item has a different weight based on its activity and popularity.
C.
Summary of ARFM model
In general, the ARFM model integrates other structures, such as the deep FM model, and adds new structures and algorithms based on the generalization and memory concepts of a wide and deep model. To better solve the feature crossing and mine potential connections between features, we were inspired by the deep crossing model and added a multi residual network to optimize the network junction structure. To solve the problem of long-tail distribution and cold start, we explored the relationship between item popularity and user activity in the dataset, proposed an aging mechanism, added it to the model, and solved the recommendation dead cycle problem. Therefore, compared to other models, the ARFM model has the following advantages:
(1)
Compared with the deep crossing model, the ARFM model has a more complex and three-dimensional network structure and a higher recommendation accuracy based on the excellent algorithm foundation of wide and deep models.
(2)
Compared with the deep FM and AFM models, the addition of a multi-residual network based on multidisciplinary crossover results in a model with a higher degree of feature crossover and utilization and further improves the accuracy without an obvious speed decrease.
(3)
Compared with other traditional machine learning models, ARFM have improved both speed and accuracy, and the aging mechanism in the model is suitable for various application scenarios.

4. Experiments and Results

A.
Settings
Before conducting the experiments on the ARFM model, we discuss two key issues that the model must address.
Q1. Does our model have smaller loss values than other models when tested in various scenarios?
Q2. Is our model more accurate than the others when tested in a variety of situations?
(1) Dataset
When selecting the dataset, we considered the type and size of the dataset. In terms of type, we chose a dataset based on user data and a dataset based on objects. The user dataset is the data extracted by Becker from the 1994 census database [36]. Each dataset records a person’s age, work, and other data; income can be predicted using this dataset. The MovieLens dataset [37] was selected as the object-based dataset, and the 1 M and 10 M datasets were selected according to their sizes to achieve multiclassification of movie grades.
(2) Input data processing
The processing of input data is relatively simple. First, we remove useless or invalid data from the dataset. This step ensures that we obtain the correct encoding sequence when we use one-hot for the category features. Next, one-hot coding is conducted for category features, including the user’s job type, gender, residence, education level, and movie classification, rating, director, leading actor, and other features. In this step, coding sequences with different sparsity degrees were obtained. In particular, we clipped features of more than 100 categories to ensure that the model operation would not get stuck due to a large number of inputs. Finally, we integrate the processed one-HOT encoding and the original numerical features into the ARFM input d a t a i n p u t .
(3) Evaluation methodology
The evaluation of our model primarily focused on the two aforementioned questions, considering the loss size and accuracy rate. The dataset was divided into training and test sets, according to a certain scale. The model was trained on the training set and was evaluated using the test set. For the size of the loss, the loss function we choose is the log loss; that is,
J θ = 1 m i = 1 m y i l o g h θ x i + 1 y i log 1 h θ x i ,
Formula (17) is the general expression of the loss function, where m is the number of samples, θ is the parameter vector of the model, and h θ x = θ · x + b i a s .
For accuracy evaluation, we added the square variance of the predicted result and the real value to obtain the total deviation value, and the proportion of the remaining correct value to the total value was accurate.
(4) BaseLines
Finally, our model will be compared with the following models:
(a)
Deep crossing model: As an inspiration for our multiple residual network model, we compared and observed an improvement in the effect of our model.
(b)
Wide and deep model: As the framework of the basic structure of our model and the originator of the memory-plus-generalization structure, we compared and observed an improvement in the effect of our model.
(c)
Deep FM model: As an improvement on the wide and deep models and the reference object for the FM part of our model, we compared and observed an improvement in the effect of our model.
(d)
AFM model: As an application of the attention mechanism, it provides basic ideas for our model, and we compared and observed an improvement in the effect of our model.
(5) Parameter setting
For the setting of the hidden layer in the ARFM model, 512,256 and 128 units of the dense layer were set, respectively, and the ReLU layer was added after each dense layer for processing. In terms of the setting of the experiment, 80% of the data was used as the training set and the remaining 20% was used as the test set. The batch size was 128 and the number of epochs was set to 15.
B.
Loss (Q1)
To solve the first question, that is, the loss value of the model, we conducted experiments on three datasets of the model and compared them with those of the other models. In this section, we describe the effects of the model on different datasets and provide corresponding conclusions.
(1) Figure 9 shows the performance results for dataset 1:
(2) Figure 10 shows the performance results for dataset 2:
(3) Figure 11 shows the performance results for dataset 3:
(4) Conclusions:
In this section, we compare the performance of the loss values for the different datasets for each model. It is evident from the figure that the loss value of the ARFM model is excellent overall, second only to the performance of the wide and deep models in Datasets 1 and 2, and the gap narrows, finally reaching the optimal value in Dataset 3. This shows that the ARFM model performs well on all types of datasets, and is more inclined to deal with large datasets.
C.
Accuracy (Q2)
In this section, we focus on the accuracy of the model using different datasets to address the second question. The experimental results are presented in this section:
(1) Figure 12 shows the performance results for dataset 1:
(2) Figure 13 shows the performance results for dataset 2:
(3) Figure 14 shows the performance results for dataset 3:
(4) Conclusions:
Through the comparison in this section, we found that within a reasonable fluctuation range, the accuracy of the ARFM model was excellent in all datasets, and all of them were at the maximum value in all models. It can be seen that the ARFM have a very high comprehensive quality, excellent robustness, and generalization ability.
D.
Data summary:
In this section, we present the data results of ARFM and other models in the form of tables, and finally provide an increase in the accuracy of ARFM compared with other models.
(1) Table 1 shows the loss:
(2) Table 2 lists their accuracies:
(3) Table 3 shows the performance growth:
E.
Comprehensive analysis of model performance
According to the data table and experimental results, the improvement of ARFM is quite excellent compared with other models. It can be seen intuitively that ARFM keep a steady increase in both loss value and final prediction accuracy. In particular, we need to add an explanation about time consumption in this part. Due to equipment processing and dataset fluctuation, the time of model training and prediction has a large fluctuation. On average, however, the ARFM model performed only slightly better than the DeepCrossing model. We analyzed that this was probably due to the aging mechanism, namely, the use of the Aging Attention Layer, and the optimization of our dataset became particularly important. This is most obvious in the 10 m Movielens dataset, so we can infer that the performance advantage of the ARFM model is more obvious in the larger dataset. In the next stage, we will focus on the time complexity of the model and the sensitivity of the model to datasets of different sizes.

5. Discussion

According to the experimental results in the fourth Section, the ARFM model proposed in this study exhibits fairly good performance in terms of both loss value and accuracy. The findings were as follows:
(1)
The model with the addition of a multi-residual network layer has stronger feature crossing ability, which can be seen from the improvement in model accuracy, but at the same time, there is no obvious weakening in running speed, indicating that the residual network has very good adaptability to the model and will not forcibly improve the model effect at the cost of sacrificing time.
(2)
The combination of the attention layer and aging mechanism proposed in this study is surprisingly good, which may be precisely because their essence is to explore the influence of data or features on the recommendation effect and the model output. By adding the aging effect as a weight to the effect of the attention network, the differences between the features can be highlighted to obtain more accurate results.
(3)
Compared to other models, the ARFM model proposed in this study has various advantages. Regardless of the loss or results, the overall stability and average result data of the ARFM model were better than those of the other models, thereby proving the validity of the model.
(4)
Through experiments, we found that the performance of ARFM on different datasets differed to some extent. It can be clearly observed from the loss result diagram that ARFM performed better than the other models on datasets with a larger scale and more samples. This indicated that the model was more suitable for relatively large and complex scenarios.
In this study, the ARFM model was compared with four other models. Specifically, the performance of the ARFM on Dataset 1 was not outstanding; the loss value was slightly higher than that of the well-optimized wide and deep models, and the accuracy was slightly higher than that of the other models. However, the performance of the other two datasets was very obvious. It can be clearly observed that the accuracy of the deep crossing model is lower than that of the other models. With an increase in training time, the ARFM model is more stable and efficient in terms of both reduction in the loss value and improvement in accuracy. Except for the ARFM model, the AFM model has the best performance among the models; however, the ARFM model has a higher accuracy and smaller loss value than the AFM model, which is sufficient to explain the advantages of ARFM.

6. Conclusions and Future Work

The starting point of this study was to solve the problems of the long-tail distribution and cold start. Through statistical analysis, we found that the long-tail distribution of user activities and item popularity was clear and affected the final recommendation results. Therefore, we propose an aging mechanism that aims to achieve more balanced recommendations for unpopular items or new users by inhibiting user activities and item popularity. We innovatively combined the aging mechanism with the attention mechanism, and introduced it into the model as a weight that had an excellent effect. In contrast, based on multidisciplinary crossover technology, we introduced a famous residual network in the convolutional network into the model, which strengthens the feature crossover and further increases the generalization ability of the model. In the final experiment, we used three different datasets to prove that the ARFM has clear advantages in various respects.
The significance and contribution of this study are very novel. Based on the structure and characteristics of the dataset, we explore the long tail distribution and cold start problems caused by the dataset, and propose a new aging mechanism to optimize it. Compared with other algorithms or models exploring the network layer structure and mathematical formula, the ARFM model truly achieves the combination of model and dataset. From the final experimental results, ARFM has great potential for practical application and future development. It is hoped that the research results of this paper can provide other researchers with new ideas and make further progress in dataset and model mining.
In future work, we hope to further optimize the aging mechanism and make it more diverse and suitable for various situations, which will further improve the generalizability of the model and final recommendation accuracy. In addition, we want to conduct experiments on more and larger datasets to further verify whether our model has more advantages when dealing with large datasets, or whether there are potential dangers, which require more time to study datasets, which we find quite challenging. Finally, we will continue to explore the improvement of the ARFM from the model structure to better integrate the functions of all layers and improve the overall benefits of the model.

Author Contributions

Conceptualization, H.Y.; methodology, H.Y.; software, H.Y.; validation, H.Y.; formal analysis, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, J.Y.; supervision, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 61971268.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets in this study are available online at UCI Machine Learning Repository: Adult DataSet and MovieLens|GroupLens.

Acknowledgments

We thank the National Natural Science Foundation of China for funding our work, grant number 61971268.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CNNConvolutional Neural Network
RNNRecurrent Neural Network
FMFactorization Machine
DeepFM Deep Factorization Machine
NFMNeural Factorization Machine
AFMAttentional Factorization Machine
RELURectified Linear Unit

References

  1. Genius, T.V. An Integrated Approach to TV & VOD Recommendations Archived 6 June 2012 at the Wayback Machine; Red Bee Media: London, UK, 2012. [Google Scholar]
  2. Liu, H.; Kong, X.; Bai, X.; Wang, W.; Bekele, T.M.; Xia, F. Context-Based Collaborative Filtering for Citation Recommendation. IEEE Access 2015, 3, 1695–1703. [Google Scholar] [CrossRef]
  3. Wu, Y.; Wei, J.; Yin, J.; Liu, X.; Zhang, J. Deep Collaborative Filtering Based on Outer Product. IEEE Access 2020, 8, 85567–85574. [Google Scholar] [CrossRef]
  4. Shan, H.; Banerjee, A. Generalized probabilistic matrix factorizations for collaborative filtering. In Proceedings of the 2010 IEEE International Conference on Data Mining, Sydney, Australia, 13–17 December 2010; pp. 1025–1030. [Google Scholar]
  5. Luo, X.; Zhou, M.; Xia, Y.; Zhu, Q. An efficient non-negative MatrixFactorization-Based approach to collaborative filtering for recommender systems. IEEE Trans. Ind. Informat. 2014, 10, 1273–1284. [Google Scholar]
  6. Wright, R.E. Logistic regression. In Reading and Understanding Multivariate Statistics; Grimm, L.G., Yarnold, P.R., Eds.; American Psychological Association: Washington, DC, USA, 1995; pp. 217–244. [Google Scholar]
  7. Zheng, Z.; Yang, Y.; Niu, X.; Dai, H.; Zhou, Y. Wide and Deep Convolutional Neural Networks for Electricity-Theft Detection to Secure Smart Grids. IEEE Trans. Ind. Inform. 2018, 14, 1606–1615. [Google Scholar] [CrossRef]
  8. Vedaldi, A.; Lenc, K. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 689–692. [Google Scholar]
  9. Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Chiba, Japan, 6–30 September 2010. [Google Scholar]
  10. Siess, J.; Anderson, C. The Long Tail: Why the Future of Business is Selling Less of More; One Person Library: Hachette, UK, 2006. [Google Scholar]
  11. Zheng, J.; Ma, Q.; Gu, H.; Zheng, Z. Multi-view Denoising Graph Auto-Encoders on Heterogeneous Information Networks for Cold-start Recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 2338–2348. [Google Scholar]
  12. Briand, L.; Salha-Galvan, G.; Bendada, W.; Morlon, M.; Tran, V.-A. A Semi-Personalized System for User Cold Start Recommendation on Music Streaming Apps. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 2601–2609. [Google Scholar]
  13. Lang, L.; Zhu, Z.; Liu, X.; Zhao, J.; Xu, J.; Shan, M. Architecture and Operation Adaptive Network for Online Recommendations. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 3139–3149. [Google Scholar]
  14. Goldberg, D.; Nichols, D.; Oki, B.M.; Terry, D. Using collaborative filtering to weave an information tapestry. Commun. ACM 1992, 35, 61–70. [Google Scholar] [CrossRef]
  15. Linden, G.; Smith, B.; York, J. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 2003, 7, 76–80. [Google Scholar] [CrossRef] [Green Version]
  16. Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
  17. Rendle, S. Factorization machines. In Proceedings of the IEEE International Conference on Data Mining, Sydney, Australia, 13 December 2010; pp. 995–1000. [Google Scholar]
  18. Sedhain, S.; Menon, A.K.; Sanner, S. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 111–112. [Google Scholar]
  19. Shan, Y.; Hoens, T.R.; Jiao, J. Deep crossing: Web-scale modeling without manually crafted combinatorial features. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 255–262. [Google Scholar]
  20. Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
  21. Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A factorization-machine based neural network for CTR prediction. arXiv 2017, arXiv:1703.04247. [Google Scholar]
  22. He, X.; Chua, T.S. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 355–364. [Google Scholar]
  23. Xiao, J.; Ye, H.; He, X.; Zhang, H.; Wu, F.; Chua, T.S. Attentional factorization machines: Learning the weight of feature interactions via attention networks. arXiv 2017, arXiv:1708.04617. [Google Scholar]
  24. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  25. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Processing Syst. 2012, 25. [Google Scholar] [CrossRef]
  26. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  27. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  28. Liu, D.; Lian, J.; Liu, Z.; Wang, X.; Sun, G.; Xie, X. Reinforced Anchor Knowledge Graph Generation for News Recommendation Reasoning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 1055–1065. [Google Scholar]
  29. Yu, J.; Yin, H.; Gao, M.; Xia, X.; Zhang, X.; Hung, N.Q.Y. Socially-aware self-supervised tri-training for recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 2084–2092. [Google Scholar]
  30. Wei, T.; Feng, F.; Chen, J.; Wu, Z.; Yi, J.; He, X. Model-agnostic counterfactual reasoning for eliminating popularity bias in recommender system. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 1791–1800. [Google Scholar]
  31. Chen, T.; Yin, H.; Zheng, Y.; Huang, Z.; Wang, Y.; Wang, M. Learning elastic embeddings for customizing on-device recommenders. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 138–147. [Google Scholar]
  32. Kang, W.C.; Cheng, D.Z.; Yao, T.; Yi, X.; Chen, T.; Hong, X. Learning to embed categorical features without embedding tables for recommendation. arXiv 2020, arXiv:2010.10784. [Google Scholar]
  33. Kalimeris, D.; Bhagat, S.; Kalyanaraman, S.; Weinsberg, U. Preference Amplification in Recommender Systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 805–815. [Google Scholar]
  34. Zhang, S.; Chen, H.; Ming, X.; Cui, L.; Yin, H.; Xu, G. Where are we in embedding spaces? A comprehensive analysis on network embedding approaches for recommender systems. arXiv 2021, arXiv:2105.08908. [Google Scholar]
  35. Zhang, Z.K.; Liu, C.; Zhang, Y.C.; Zhou, T. Solving the cold-start problem in recommender systems with social tags. EPL (Europhys. Lett.) 2010, 92, 28002. [Google Scholar] [CrossRef]
  36. UCI Machine Learning Repository: Adult DataSet. Available online: https://archive.ics.uci.edu/ml/machine-learning-databases/adult (accessed on 2 April 2022).
  37. Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 2015, 5, 1–19. [Google Scholar] [CrossRef]
Figure 1. Residual element structure principle.
Figure 1. Residual element structure principle.
Applsci 12 05318 g001
Figure 2. The long-tail distribution of item popularity.
Figure 2. The long-tail distribution of item popularity.
Applsci 12 05318 g002
Figure 3. The long-tail distribution of user activity.
Figure 3. The long-tail distribution of user activity.
Applsci 12 05318 g003
Figure 4. The relationship between user activity and item popularity in the MovieLens dataset.
Figure 4. The relationship between user activity and item popularity in the MovieLens dataset.
Applsci 12 05318 g004
Figure 5. Basic framework of ARFM model.
Figure 5. Basic framework of ARFM model.
Applsci 12 05318 g005
Figure 6. Working principle of Embedding layer.
Figure 6. Working principle of Embedding layer.
Applsci 12 05318 g006
Figure 7. The structure of Hidden Layer.
Figure 7. The structure of Hidden Layer.
Applsci 12 05318 g007
Figure 8. Main principles of the Attention layer.
Figure 8. Main principles of the Attention layer.
Applsci 12 05318 g008
Figure 9. Loss performance of each model on dataset 1.
Figure 9. Loss performance of each model on dataset 1.
Applsci 12 05318 g009
Figure 10. Loss performance of each model on dataset 2.
Figure 10. Loss performance of each model on dataset 2.
Applsci 12 05318 g010
Figure 11. Loss performance of each model on dataset 3.
Figure 11. Loss performance of each model on dataset 3.
Applsci 12 05318 g011
Figure 12. Performance of Accuracy of each model on dataset 1.
Figure 12. Performance of Accuracy of each model on dataset 1.
Applsci 12 05318 g012
Figure 13. Performance of Accuracy of each model on dataset 2.
Figure 13. Performance of Accuracy of each model on dataset 2.
Applsci 12 05318 g013
Figure 14. Performance of Accuracy of each model on dataset 3.
Figure 14. Performance of Accuracy of each model on dataset 3.
Applsci 12 05318 g014
Table 1. Summary of Loss value of each model.
Table 1. Summary of Loss value of each model.
Dataset1Dataset2Dataset3
ARFM0.32060.41770.4328
DeepCrossing0.31690.45030.4835
Wide and Deep0.26500.38940.4511
DeepFM0.32950.43500.4570
AFM0.33270.46330.4649
Table 2. Summary of Accuracy of each model.
Table 2. Summary of Accuracy of each model.
Dataset1Dataset2Dataset3
ARFM0.84890.83860.8271
DeepCrossing0.82690.81630.8069
Wide and Deep0.84500.82060.8118
DeepFM0.84950.82200.8115
AFM0.84270.82380.8157
Table 3. The performance growth of ARFM.
Table 3. The performance growth of ARFM.
BaselinesInprovement
Deepcrossing+2.15%
Wide and Deep+1.24%
Deepfm+1.05%
AFM+1.08%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yu, H.; Yin, J. Aging Residual Factorization Machines: A Multi-Layer Residual Network Based on Aging Mechanisms. Appl. Sci. 2022, 12, 5318. https://doi.org/10.3390/app12115318

AMA Style

Yu H, Yin J. Aging Residual Factorization Machines: A Multi-Layer Residual Network Based on Aging Mechanisms. Applied Sciences. 2022; 12(11):5318. https://doi.org/10.3390/app12115318

Chicago/Turabian Style

Yu, Huaidong, and Jian Yin. 2022. "Aging Residual Factorization Machines: A Multi-Layer Residual Network Based on Aging Mechanisms" Applied Sciences 12, no. 11: 5318. https://doi.org/10.3390/app12115318

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop