1. Introduction
It can be said that we are now living in an era of being influenced by recommendation systems everywhere. Whether online shopping, study, work, or relaxation, the recommendation system can help us grasp the key points more efficiently. The recommendation system does not affect people’s lives as much as it currently does, and this is also true for algorithm engineers behind the recommendation systems, which are designed to solve the problem of how users obtain the information they are interested in under the condition of information overload. Thus, information is naturally the core of the algorithm that needs to be focused on and can be divided into item, user, and scene information. In the eyes of engineers, the mathematical meaning of the recommendation system is that for the user (user) and the specific scene (context), aiming at massive item information, constructing a function (U, I, C) to predict the user’s preference degree for a specific candidate item (item), and then sorting all candidate items according to the preference degree to generate the recommendation list.
Focusing on these basic issues, the development of efficient and cutting-edge recommendation system algorithms and models has become a key task. During the decades of development of recommendation systems, recommendation system models have made qualitative leaps, and the collaborative filtering algorithm [
1] undoubtedly has played a pivotal role in the early stage. User-based [
2] and object-based recommendation algorithms [
3] have generated considerable value in recommendation systems over extended periods. Later, the proposal of matrix factorization [
4] became an opportunity for the emergence of factorization machines [
5], and provided new ideas for the subsequent model algorithm. Logistic regression [
6], based on the classification problem, has also become the core algorithm of the classification problem model. Recommendation systems have achieved real flourishing after entering the era of deep learning, and the concept of models has become increasingly important. The proposal and development of neural networks have laid the foundation for deep-learning models. This wide and deep [
7] model combines generalization and memory for the first time, thereby providing new ideas for subsequent models. Recently, the field of graphics has led to an increase in computer vision, and the emergence of CNNs [
8] and RNNs [
9] has led to a boom in the development of new domain models.
In this context, the rapid development of recommendation system models has made algorithm designers focus on model structure and algorithm design, and little attention has been paid to the exploration of datasets. Specifically, many studies on Internet data have found that a large amount of data on the Internet meets a distribution called
PowerLaw [
10], which is also known as a long-tail distribution. The long-tail distribution has different performances in different datasets but generally conforms to the law of hot few, cold majority, which leads to users being recommended the most popular items, regardless of whether or not they like these items, which is particularly prominent in the cold start problem. The cold start problem is the problem of how to make recommendations for items or users that are newly added to the recommendation system without sufficient ratings and interaction history of the user. At present, most recommendation system algorithms solve the problem of cold start by recommending the most popular items [
11,
12,
13], ignoring the characteristics of items or users themselves, which we think is not desirable. Therefore, alleviating the problem of recommendation solidification caused by the long-tail distribution and diversifying the recommendation system to recommend items to different users have become the focus of our research.
As we mentioned before, we read a lot of literature and found that most of the research was focused on improving the performance of algorithms or models, and many of the results were impressive, but we did not think they could fundamentally solve the problem. So, we focus our research on datasets and try to find the relationship between models and datasets. After many experiments and attempts, we put forward the aging mechanism of the dataset in this paper. The application of this mechanism can fundamentally solve the problems of long-tail distribution and cold start, and improve the final recommendation system score significantly.
In addition, our team is also very focused on multidisciplinary results. A big problem with the current recommendation system model is the increasing number of hidden layers, which is more obvious in large application scenarios. Theoretically, the increase in hidden layers lets the data features to be mined more accurately, but, in fact, when the number of hidden layers increases to a certain extent, the problem of gradient explosion and gradient disappearance will be caused, which will lead to the regression of model performance. In order to solve this hidden danger, we obtained inspiration when studying the relevant achievements of computer vision, fused the famous multi-residual network in the field of convolution with the recommender system model, and achieved remarkable results.
In this paper, we propose the aging residual factorization machines. It can effectively mine the potential relationship between the dataset and model and improve the accuracy and validity of the model using multi-domain crossover technology. The contributions of this study are as follows.
- (1)
We developed a model focusing on the relationship between the datasets and models, which effectively alleviated the recommendation solidification problem caused by the long-tail distribution of the data to a certain extent and improved the accuracy of the model.
- (2)
We make full use of the multi-field crossover technology and integrate the inspiration obtained from the field of computer vision into the ARFM model so that the model can process data more accurately.
- (3)
The experimental results show that ARFM are superior to previous similar models in terms of both recommendation and classification accuracy, which proves the rationality and effectiveness of the model.
2. Related Work
Before deep learning was widely developed and applied to recommendation systems, the core algorithm of the recommendation system was always collaborative filtering. The classical application of the collaborative filtering algorithm can be traced back to the mail filtering system of the Xerox Research Center in 1992 [
14]. However, its development was dependent on Amazon [
15] in 2003, making it a well-known classic model. Until now, all major models have been influenced by it. Later, in the 2006 Netflix algorithm competition [
16], the proposal of a matrix factorization algorithm caused a sensation in the industry and had a profound influence on the design of subsequent algorithms and models. In 2010, Rendle proposed the FM model [
17] using the inner product of two vectors to replace the single weight coefficient, that is, introducing a hidden vector, to better solve the problem of data sparsity. Specifically, let the feature vector be
, weight vector be
, the number of features be n,
and
be the subscripts of the two features vectors in feature crossing operation respectively and the basic expression of FM be:
Using an implicit vector, the number of weight parameters is directly reduced to (where k is the dimension of the hidden vector), which can significantly reduce the training cost when using the gradient descent. The FM model is a key model in the entire field of recommendation systems. In this study, the basic concept of the FM model was applied to increase the overall efficiency.
In 2015, AutoRec [
18], proposed by the Australian National University, officially opened an era of deep learning for recommendation systems by combining autoencoders and collaborative filtering. The AutoRec model is similar to MLP (Multi-layer Perceptron), which is a standard 3-layer (including input layer) neural network, but it combines the ideas of AutoEncoder and Collaborative Filtering. To be more precise, the AutoRec model is a standard autoencoder structure, and its basic principle is to use the co-occurrence matrix in collaborative filtering to complete the self-coding of the item vector or user vector. Then the self-coding results are used to obtain the user’s score on all items, and the results can be used for item recommendation after sorting. Since then, various models and algorithms have mushroomed, and the deep crossing model [
19] proposed by Microsoft in 2016 is one of the sources of inspiration for this study. Its biggest progress is to change the traditional method of feature crossing so that the model is not only capable of second-order crossing, but also of deep crossing. The multiple residual unit layer proposed in this study adopts a multilayer residual network, which enables the model to capture more information on nonlinear and combination features, giving full play to the advantages of multifield crossover technology. Another important article in 2016 was The Wide and Deep Model proposed by Google [
20], which proposed the concepts of
memory and
generalization for the first time, breaking the thinking of traditional models and directly developing a special system. The ARFM proposed in this study are also an evolution of a wide and deep model, to a certain extent. Memory ability can be understood as the ability of a model to directly learn and use the co-occurrence frequency of items or features in the historical data. This can be understood as the ability of the model to transfer the correlation of features and discover the correlation between rare features that are sparse or even never appear and the final tags. The wide part is responsible for the memory of the model, whereas the deep part is responsible for the generalization of the model. This method of combining the two parts of the network structure combines the advantages of both sides, and has become absolutely popular at this time. In the subsequent improvement of the wide and deep models, the deep FM [
21] model in 2017 focused on the wide part and improved the feature combination ability of the wide part using FM, whereas the NFM model [
22] in the same year focused on improving the deep structure by adding a feature cross-pooling layer. The deep data-processing capabilities were further enhanced. In 2017, the AFM model [
23], proposed by Alibaba, introduced an attention mechanism based on the NFM model, contributing to the multifield integration of recommendation systems.
On the other hand, the development of computer vision also plays a good role in promoting the realization of multi-domain crossover technology. Convolutional networks are undoubtedly at the core of computer vision. As early as 1998, the emergence of LeNet [
24] has marked an increase in CNNs. Although LeNet is a simple and small network from today’s perspective, it is the first to completely define the structure of a CNN, which is crucial for its subsequent development. In 2012, Krizhevsky and Hinton launched AlexNet [
25], and in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), he won 10.9% more than second place. Later, in 2014, Simonyan and Zisserman proposed a series of Visual Geometry Group Network (VGG) models [
26], which appeared as the basic network that ranked second in the classification task and first in the localization task in the same year’s ImageNet challenge. At that time, VGG was a very
deep network, reaching a depth of 19 layers. This was a significant breakthrough because theoretically, the fitting ability of the neural network model should increase with the
size of the model. In 2015, CNN ushered in qualitative changes. Kaiming He proposed ResNet [
27], which not only solved the problem of neural network degradation, but also swept the competitors in ILSVRC and Common Objects in Context (COCO) competitions in the same year and won first place in classification, positioning, detection, and segmentation tasks. This is also a well-known residual network, the influence of which is phenomenal and provides a new idea for the design and optimization of almost all subsequent models.
Recommendation systems have been developing and improving, and many new problems and research directions have been put forward in the latest research results in the last two years. Among them, causal inference is the recent hot spot, and the process of the recommendation results given by the recommendation system is regarded as causal inference, which introduces the concept of new intervention and puts forward the confounding factors that have not been observed. Around these concepts, recommendation systems for causal inference use two models, the exposure model and the result model, which ultimately provide the user with a list of recommendations. In the latest research [
28,
29,
30], researchers respectively studied the enhancement of anchor knowledge graph generation of news recommendation reasoning, the three-dimensional recommendation system with self-supervision of social consciousness, and the elimination of popularity bias of recommendation system by agnostic counterfactual reasoning model. Causal inference is a very interesting entry point, but the problem at present is that the connection between the exposure model and result model is not very perfect, and the recommendation process using causal inference is very time-consuming, and there is therefore a lot of room for progress. We will try to learn and propose effective solutions in the following research.
Another hotspot of recommendation systems is the development of Embedding technology. Embedding technology is the basis of all recommendation algorithms. One of its most important properties is its ability to transform feature data from high dimensional sparse vector to low dimensional dense vector, which is also the main role of Embedding in this paper. In addition, the Embedding vector itself is also a very important feature vector with stronger expression ability. As with so many important core functions, the optimization and development of Embedding technology have been the focus of recommendation system researchers. The latest research has put forward the new Embedding technology without an Embedding Table [
31,
32,
33,
34], and studied the feature modeling of flexible Embedding and non-embedding on customized devices, preference amplification in the recommendation system, and comprehensive analysis of network Embedding methods in the recommendation system. New technology can ease the Embedding of data in the face of growing and increasingly sparse coding speeds, but as we mentioned above, the fundamental problem that needs to be solved is that the scale of the dataset is increasingly large, and if researchers cannot find the connection between the characteristics and model dataset, then the new algorithm can only play this easing role; it is not the key to solve the problem in the long run.
Integrated computer visual technology, particularly residual networks, suggests that the adaptability of a system model will reach a new peak. To cope with the growing scale of the dataset and better mine data with the model, we propose a new aging mechanism with a wide and deep model, adjusting the FM model of the fusion, complementing each other, and achieving a new effect. In addition, inspired by the deep crossing model, we applied the principle of the residual network to the model proposed in this study, which further enhanced the feature crossing and model optimization, achieving obvious results.
3. Materials and Methods
In this part, we will explain the basic structure and algorithm of a series of models related to our model in Part A; in Part B, we will formally introduce the ARFM model proposed in this paper and give detailed algorithm explanation; in Part C, we will make a summary of the model.
- A.
Related models and algorithms
(1) Deep Crossing.
The inputs to the model were a set of individual characteristics. The model has four types of layers: embedding, stacking, residual units, and scoring. The target function is the log loss function; however, the Softmax function or other functions can also be used:
where
is the number of samples,
is the label for each sample, and
is the output of the single-node scoring layer.
Regarding the role of each layer in the model, the embedding layer converts sparse class features into dense embedding vectors. Stacking is the stitching of different embedding and numerical features to form a new feature vector that contains all features. The residual unit layer is key to the model. The main structure of the residual unit layer is a multilayer perceptron, which is realized by a multilayer residual network. Its structure is illustrated in
Figure 1.
In
Figure 1,
represents the input vector and
represents the function that processes the input, namely, the RELU function in this article. The model can capture more information on nonlinear and combination features by cross-combining each dimension of the feature vector through a multilayer residual network, thereby enhancing the expression ability of the model.
(2) Wide and Deep
According to Google, a model can be divided into four stages: sparse features, dense embedding, hidden layers, and output units. Sparse features send wide and deep input data into two parts, followed by three stages. For a wide part, the features are combined through cross-product transformation, and the formula is as follows:
where
is the set of input feature vectors,
is the total number of features, and
is a Boolean variable. When the ith feature belongs to the
combination feature, the value of
is 1; otherwise, it is zero.
denotes the value of the ith feature.
The deep part is the feedforward neural network. For category features, the original input is a feature string and the sparse high-dimensional category features are first converted into low-dimensional dense real vectors, which are typically called embedding vectors. These low-dimensional dense vectors are fed into the hidden layer of the neural network during the forward propagation. Specifically, each hidden layer performs the following calculations:
where
is the number of layers;
is the activation function, and the RELU function is usually used.
,
and
are the activation, bias, and model weights vectors of the
layer, respectively. Finally, the wide and deep parts are joined together through the full connection layer, and output is through the logical unit. The prediction of the logistic regression model is as follows:
where
is the binary classification label,
is the sigmoid function,
is the transformation of the original feature vector
,
is the bias item,
means the
layer, and
and
are the weights vectors of wide and deep parts, respectively.
(3) Deep FM
As mentioned above, deep FM’s improvement on wide and deep lies in replacing the wide part with FM. Specifically, the output of the FM component is the sum of addition and inner product units:
where
and
. The addition unit focuses on first-order features, whereas the inner product unit focuses on second-order interactions.
(4) AFM
The AFM model introduces an attention mechanism by adding an attention network between the feature crossover layer and final output layer. The role of the attention network is to provide weights for each cross-feature. The pooling process of the AFM after adding the attention score is as follows:
where
(where
means the dot product and
),
and
respectively represent the corresponding hidden layer weights vectors of
and
, and
is the attention score.
- B.
Details of ARFM model
As mentioned above, compared with the traditional machine-learning model, the largest improvement in ARFM lies in the innovative proposal of the aging mechanism and the integration of multiple residual layers into the model. The following section explains in detail the structure and algorithm principles of these two parts.
The starting point for the study of the aging mechanism is our understanding of the long-tail distribution of the datasets. The so-called long-tail distribution refers to the fact that, in the recommendation system, the most popular and viewed items are the least popular, and most items cannot be exposed to users. To illustrate the long-tail distribution of user behavior, we selected raw data from the Delicious and CiteULike datasets for analysis.
Figure 2 shows the distribution curves for the popularity of the items in the Delicious and CiteULike datasets. The horizontal axis represents the popularity
of the item and the vertical axis represents the total number of items with popularity
.
Figure 3 shows the distribution curve of user activity in the Delicious and CiteULike datasets. The horizontal axis represents user activity
, and the vertical axis represents the total number of users with activity
.
Therefore, under the influence of a long-tail distribution, new items cannot be well exposed; only popular items receive attention, and the efficiency of the recommendation system is affected. Therefore, it can be seen that the starting point of the aging mechanism is very necessary.
In addition, while studying the distribution of the long tail, a new question arises: Will a user’s time using a certain platform affect the user’s choice of items with different popularity? New users tend to browse popular items because they are unfamiliar with the site and can only click on popular items on the front page, whereas older users gradually begin browsing unpopular items.
Figure 4 shows the relationship between user activity and item popularity in the MovieLens dataset, where the horizontal axis is user activity and the vertical axis is the average popularity of items rated highly by all users at a certain activity level.
The problem of long-tail distribution also appears in the problem of cold starts [
35]. Cold start can be divided into user cold start and item cold start, which correspond to situations in which there is no historical behavior data after new users and items are added to the recommendation system. From a new perspective, solving cold-start and long-tail distribution problems involves killing two birds with one stone.
To apply a multi-residual network, we imitated the application of the deep crossing model to the residual layer and integrated the multi-residual layer based on the optimized and improved wide and deep model so that the model could further perform multi-feature crossing and improve the model efficiency. Based on these two parts, we built an ARFM model, as follows:
Figure 5 shows the basic framework of ARFM and clearly shows how the data flows. Specifically, the initial data are compressed into feature vectors by the embedding layer and then divided into two types according to whether the features are classified as features or numerical features, entering the FM and hidden layers, respectively. The last two types of vectors enter the residual unit layer after processing, and are processed together with the original data. The results were entered into the aging attention layer and the results were finally output. The following sections introduce the key layer structures and algorithms used in the model.
(1) Embedding Layer
The main functions of Embedding are described in the previous section. Specifically, all input data in ARFM can be divided into two categories: numerical features and category features. Numerical features remain unchanged and are directly used as part of the input layer, while category features become sparse feature values after one-hot processing. These two parts are combined as the input of the input layer.
The Embedding layer uses the training network to calculate the weight values of all input features and store the results in the Embedding Table. The last step is to transform the sparse one-hot encoding into the eigenvalue after Embedding processing. What we need to do is to directly query the corresponding Embedding Table.
Figure 6 shows how Embedding works:
(2) FM Layer
The function of the FM layer is based on the improvement of the Wide part in the Wide and Deep model, and its basic principle is basically the same as the FM part in DeepFM mentioned above, except for the connection mode of upward output. Specifically, we have Formula (8) as follows:
where
is the bias term and can theoretically be set to 0. In practical application, due to the particularity of the ARFM model, the value of
can be assigned as
, which is the same as the effect of adding
at the Residual Units Layer.
is the weight vector learned by Embedding in the previous layer; one for each feature.
represents the implicit vector that needs to be learned by the model itself, automatically at each iteration update.
It can be seen from Formula (8) that the function of the FM layer is to calculate the interaction between first-order and second-order features, so it has stronger expression ability than a linear model.
(3) Hidden Layer
The hidden layer is not a network layer specifically, but a small neural network, which is equivalent to the operation of the Wide part in Wide and Deep. Our processing still inherits the relevant parts of DeepFM. The input of the hidden layer comes from the Embedding of the next layer. The intermediate layer is processed by two RELU functions, and the final result is output by the Sigmoid function.
Figure 7 shows the concrete structure of the Hidden Layer:
Next, we define the input and output of each layer:
where
represents the output of Embedding layer and
represents the vector corresponding to the
feature:
With
,
, and
is the output of the first layer
respectively, parameters vectors, and bias,
as the activation function. So, the final output of Hidden Layer is:
where
represents the number of Hidden Layer layers.
(4) Residual Units Layer
The problem to be solved in the residual network is gradient explosion and gradient disappearance when there are too many hidden layers in a network. In ARFM, we use the principle of residual network identity mapping to make multiple use of our original input data, so as to further enhance the ability of feature expression. The identity principle is shown in Formula (12):
where
is our original input.
After adding the original data through the residual network, Formula (14) can be obtained:
Therefore, as shown in
Figure 1, the input of the residual unit layer is mainly composed of data before and after processing. In ARFM, the data before processing are the original input data, denoted by
. The processed data can be further divided into two parts:
from the FM layer, and
from the hidden layer. Therefore, the final equation is:
(5) Aging of the attention layer
In the research description at the beginning of this section, we summarize two situations in which an aging mechanism has been proposed.
a. The proportion of popular items is very small, and a vast majority of unpopular items cannot be effectively recommended to users.
b. Users who are highly active or have been using them for a long time tend to seek unpopular items, whereas new users tend to accept popular items.
Therefore, it can be seen that the aging mechanism should be applied in the same direction from the perspective of both items and users. Specifically, the more popular the items or active users are, the smaller the proportion they should occupy in the recommendation system. We added two parameters for the user and item: user activity and item popularity , which are handled by function F, where the function f scales the parameters between 0 and 1. Subsequently, the two parameters are combined with the attention mechanism, and the corresponding weight is assigned to the data using the parameter-weight concept of the attention mechanism. The attention mechanism, which is the latest technology used to capture fires in a recommendation system, originates from the most natural human habit of selective attention. A typical example is browsing a webpage, in which users selectively pay attention to certain areas of the page and ignore other areas. Based on this phenomenon, it is often profitable to consider the influence of the attention mechanism on the prediction results in the modeling process.
Similar to the feature crossing of traditional models, such as NFM, the feature embedding vectors of different domains are added to the output layer composed of a multilayer neural network after the intersection of the feature-crossing pooling layer. The key to the problem lies in addition and pooling operations, which are equivalent to treating all cross-features equally, regardless of the impact of different features on the results. This method eliminates the need for a large amount of valuable information.
Therefore, the attention mechanism in the model is mainly applied to assign a weight to each input neuron, which can reflect the different input weights of different neurons. Specifically, after the residual layer, the attention layer in ARFM calculates the corresponding weight of the input of each neuron through the sigmoid function, and then multiples the weight with the input vector to obtain the new weighted vector.
Figure 8 shows the main mechanics of the attention layer.
Thus, we ended up with the following formula for the Aging Attention layer:
where
y is 0 or 1. When the input is a user vector,
y = 1. When the input is an item vector,
y = 0. In this manner, the same optimization result can be achieved flexibly, according to the object oriented by the input vector. In addition, it is worth noting that the operator * in Equation (16) represents the corresponding multiplication operation, meaning that each user or item has a different weight based on its activity and popularity.
- C.
Summary of ARFM model
In general, the ARFM model integrates other structures, such as the deep FM model, and adds new structures and algorithms based on the generalization and memory concepts of a wide and deep model. To better solve the feature crossing and mine potential connections between features, we were inspired by the deep crossing model and added a multi residual network to optimize the network junction structure. To solve the problem of long-tail distribution and cold start, we explored the relationship between item popularity and user activity in the dataset, proposed an aging mechanism, added it to the model, and solved the recommendation dead cycle problem. Therefore, compared to other models, the ARFM model has the following advantages:
- (1)
Compared with the deep crossing model, the ARFM model has a more complex and three-dimensional network structure and a higher recommendation accuracy based on the excellent algorithm foundation of wide and deep models.
- (2)
Compared with the deep FM and AFM models, the addition of a multi-residual network based on multidisciplinary crossover results in a model with a higher degree of feature crossover and utilization and further improves the accuracy without an obvious speed decrease.
- (3)
Compared with other traditional machine learning models, ARFM have improved both speed and accuracy, and the aging mechanism in the model is suitable for various application scenarios.
4. Experiments and Results
- A.
Settings
Before conducting the experiments on the ARFM model, we discuss two key issues that the model must address.
Q1. Does our model have smaller loss values than other models when tested in various scenarios?
Q2. Is our model more accurate than the others when tested in a variety of situations?
(1) Dataset
When selecting the dataset, we considered the type and size of the dataset. In terms of type, we chose a dataset based on user data and a dataset based on objects. The user dataset is the data extracted by Becker from the 1994 census database [
36]. Each dataset records a person’s age, work, and other data; income can be predicted using this dataset. The MovieLens dataset [
37] was selected as the object-based dataset, and the 1 M and 10 M datasets were selected according to their sizes to achieve multiclassification of movie grades.
(2) Input data processing
The processing of input data is relatively simple. First, we remove useless or invalid data from the dataset. This step ensures that we obtain the correct encoding sequence when we use one-hot for the category features. Next, one-hot coding is conducted for category features, including the user’s job type, gender, residence, education level, and movie classification, rating, director, leading actor, and other features. In this step, coding sequences with different sparsity degrees were obtained. In particular, we clipped features of more than 100 categories to ensure that the model operation would not get stuck due to a large number of inputs. Finally, we integrate the processed one-HOT encoding and the original numerical features into the ARFM input .
(3) Evaluation methodology
The evaluation of our model primarily focused on the two aforementioned questions, considering the loss size and accuracy rate. The dataset was divided into training and test sets, according to a certain scale. The model was trained on the training set and was evaluated using the test set. For the size of the loss, the loss function we choose is the log loss; that is,
Formula (17) is the general expression of the loss function, where m is the number of samples, is the parameter vector of the model, and .
For accuracy evaluation, we added the square variance of the predicted result and the real value to obtain the total deviation value, and the proportion of the remaining correct value to the total value was accurate.
(4) BaseLines
Finally, our model will be compared with the following models:
- (a)
Deep crossing model: As an inspiration for our multiple residual network model, we compared and observed an improvement in the effect of our model.
- (b)
Wide and deep model: As the framework of the basic structure of our model and the originator of the memory-plus-generalization structure, we compared and observed an improvement in the effect of our model.
- (c)
Deep FM model: As an improvement on the wide and deep models and the reference object for the FM part of our model, we compared and observed an improvement in the effect of our model.
- (d)
AFM model: As an application of the attention mechanism, it provides basic ideas for our model, and we compared and observed an improvement in the effect of our model.
(5) Parameter setting
For the setting of the hidden layer in the ARFM model, 512,256 and 128 units of the dense layer were set, respectively, and the ReLU layer was added after each dense layer for processing. In terms of the setting of the experiment, 80% of the data was used as the training set and the remaining 20% was used as the test set. The batch size was 128 and the number of epochs was set to 15.
- B.
Loss (Q1)
To solve the first question, that is, the loss value of the model, we conducted experiments on three datasets of the model and compared them with those of the other models. In this section, we describe the effects of the model on different datasets and provide corresponding conclusions.
(1)
Figure 9 shows the performance results for dataset 1:
(2)
Figure 10 shows the performance results for dataset 2:
(3)
Figure 11 shows the performance results for dataset 3:
(4) Conclusions:
In this section, we compare the performance of the loss values for the different datasets for each model. It is evident from the figure that the loss value of the ARFM model is excellent overall, second only to the performance of the wide and deep models in Datasets 1 and 2, and the gap narrows, finally reaching the optimal value in Dataset 3. This shows that the ARFM model performs well on all types of datasets, and is more inclined to deal with large datasets.
- C.
Accuracy (Q2)
In this section, we focus on the accuracy of the model using different datasets to address the second question. The experimental results are presented in this section:
(1)
Figure 12 shows the performance results for dataset 1:
(2)
Figure 13 shows the performance results for dataset 2:
(3)
Figure 14 shows the performance results for dataset 3:
(4) Conclusions:
Through the comparison in this section, we found that within a reasonable fluctuation range, the accuracy of the ARFM model was excellent in all datasets, and all of them were at the maximum value in all models. It can be seen that the ARFM have a very high comprehensive quality, excellent robustness, and generalization ability.
- D.
Data summary:
In this section, we present the data results of ARFM and other models in the form of tables, and finally provide an increase in the accuracy of ARFM compared with other models.
(2)
Table 2 lists their accuracies:
(3)
Table 3 shows the performance growth:
- E.
Comprehensive analysis of model performance
According to the data table and experimental results, the improvement of ARFM is quite excellent compared with other models. It can be seen intuitively that ARFM keep a steady increase in both loss value and final prediction accuracy. In particular, we need to add an explanation about time consumption in this part. Due to equipment processing and dataset fluctuation, the time of model training and prediction has a large fluctuation. On average, however, the ARFM model performed only slightly better than the DeepCrossing model. We analyzed that this was probably due to the aging mechanism, namely, the use of the Aging Attention Layer, and the optimization of our dataset became particularly important. This is most obvious in the 10 m Movielens dataset, so we can infer that the performance advantage of the ARFM model is more obvious in the larger dataset. In the next stage, we will focus on the time complexity of the model and the sensitivity of the model to datasets of different sizes.