1. Introduction
With the increasing convenience of information access through the Internet, a large amount of data are generated when obtaining videos, commodities, news, music, etc. For example, the transaction volume of Tmall’s Double 11 in 2022 is CNY 557.1 billion, and the explosion of data causes the problem of information overload [
1]. The recommendation system simulates the user’s consumption preferences based on the user’s behavior preference. It also predicts the items that users may be interested in, and provides personalized recommendation services. At the same time, it can also bring commercial value to enterprises. For example, 80% of Netflix movies come from the recommendation system [
2], so more researchers and multimedia service providers pay attention to the recommendation system. The current recommendation systems can be divided into three categories: traditional model-based recommendation systems, deep learning-based recommendation systems, and deep reinforcement learning-based recommendation systems. However, the following limitations still exist.
Firstly, the quality of recommendation results mainly depends on the data between users and the system, while most recommendation systems do not fully exploit the cross relationship between data. The recommendation framework captures user preferences based on user data, item data, user–items interaction data, and statistical data. Each data point does not exist independently. Early machine learning (ML) practitioners still sought to improve the expressiveness of their frameworks by manually identifying feature crossings [
3]. Recently, Wang et al. proposed a DCN [
4] and improved the DCN-V2 [
5] framework. The framework can effectively learn the explicit and implicit intersection of features and conduct ranking experiments in Google’s system with high accuracy.
Second, most recommendations are assumed to be static without considering the impact of long-term returns. However, in fact, the recommendation system strongly depends on users’ continuous decision-making behaviors. While short-term returns are essential, a failure to consider the long-term returns of a recommendation may lead to recommendation bias behavior. Xie et al. proposed a meta-learning framework (LSTTM) for online recommendations. The framework was deployed on the WeChat Top Stories, with remarkable results [
6]. Shivaram et al. proposed an attention recommendation framework, which mainly increases the attention to specific topic words to avoid excessive attention to general hot terms and reduce the homogenization effect in the recommendation system [
7]. Specifically, the recommendation system recommends a news article to the user, and the user has a series of actions, such as likes and favorites for this news. These actions indicate that the user may be interested in this topic. Still, the user may not like the long-term recommendation of related news since the user’s long-term preferences are not considered. For example, the timeliness of news is very strong, and recommending early news to users may not be very attractive.
Reinforcement learning was first proposed to solve optimal control problems, an integral part of artificial intelligence. Reinforcement learning aims to maximize the goal, reward, and expectation through continuous trial and error between the agent and the environment. In the actual recommendation application, the user’s behavior characteristics constantly change during interaction with the system. Only by dynamically adjusting the recommendation action according to the real-time behavior attributes of the user can the long-term maximum revenue be guaranteed, which is consistent with the features of the reinforcement learning algorithm. When reinforcement learning is adopted to perform a recommendation framework, it mainly includes value-based and policy-based methods. The dynamic recommendation process based on reinforcement learning is shown in
Figure 1, which includes three parts: the user, the agent (recommendation system), and the item list. Users first view the recommended items in the list and then give feedback according to their preferences, including clicks, favorites, forwarding, etc. In particular, the items operated by users are included in the order. According to the user’s feedback, the agent constantly learns to adjust the recommendation actions, and predicts the items that the user may be interested in, to make recommendations.
In our study, we provide a deep deterministic policy gradient-based reinforcement learning recommendation method called DDRCN, which includes Actor and Critic dual networks for recommendation action generation and action value evaluation, respectively. In particular, we use a deep cross network to process the basic features of the user data, and to fit a state representation for strategy selection and value estimation. During the action selection, the Actor policy neural network may fit the policy function through the deep cross network, which directly outputs the recommended actions saved to the project recommendation pool. The Critic network is a value estimation of the current action and state of the user, and the model is trained by two networks together. Through a large number of experiments, the proposed recommendation framework achieves good recommendation results. The contributions of this paper are as follows:
We provide a deep deterministic policy gradient recommendation framework, DDRCN, that fuses deep cross networks. The framework uses the Actor-Critic approach, which maximizes the cumulative reward of recommendations through the continuous exploration and exploitation of the Actor and Critic networks, combined with greedy strategies.
In this recommendation framework, we fit the data features between users and items, utilizing a deep cross network. The deep cross network consists of a Deep network and a Cross network, and the two networks work together to compute the cross relationship between the features.
We conducted relevant recommendation experiments on the real movie and music datasets, and the experimental results show that our proposed model outperforms its competitors regarding recall and ranking effects.
The rest of the sections are organized as follows.
Section 2 presents the related work.
Section 3 introduces the preliminary knowledge.
Section 4 elaborates on the proposed recommendation framework.
Section 5 discusses the experimental results, and the paper is concluded in
Section 6.
3. Preliminaries
Reinforcement learning includes the environment and agent, and the goal of maximizing the cumulative reward is achieved through the continuous interaction between them, which can be modeled as a Markov Decision Process (MDP), abstracted as a five-tuple . S denotes the set of states, A denotes the set of actions, R is the reward function, P is the transition probability, and is the discount coefficient. Therefore, the recommendation system is abstracted as an agent, the user is abstracted as the environment, the recommendation system performing a recommendation is abstracted as an action, and the user’s behavior characteristics are abstracted as states. In each recommendation process, the recommendation system dynamically updates the action based on the user’s status and feedback to ensure the maximum cumulative profit of the recommendation. The specific MDP framework of the recommendation process is as follows:
State space S: S is the set of environment states, and represents the state of the agent at time t, which is the interaction between the user and the recommendation system at time t.
Action space A: A is the set of actions that the agent can take, and represents the action taken by the agent at time t. In particular, actions here refer to action vectors.
Reward R: The recommendation system will recommend actions based on the user’s state and behavior, and the user will provide feedback (click, rating, retweet, etc.). Recommendation systems receive instant rewards based on user feedback .
State transition P: When the recommendation system recommends action at time step t, the state of the user at this time is transferred from to .
Discount factor : is a discount factor used to indicate the importance of future rewards, with being close to 1 to consider long-term rewards, and being close to 0 to consider immediate rewards.
Figure 2 illustrates the interaction process between the agent and the environment. The agent gives an action
according to the current state
. After receiving the action of the agent, the environment converts the state from
to
and rewards the agent
for its behavior. The agent receives the reward
and the state
, and takes the next action
, and so on. In particular, the agent’s action does not represent the recommended item or the recommendation sequence, but a continuous parameter vector. This vector is then the inner product with the item embedding to obtain the item’s rating, and the specific item is recommended to the user according to the order of the rating. In this paper, we construct a recommendation framework through the Actor-Critic network. In the Actor-Critic network, the Actor-network is a policy network, and the Critic-network is a state value estimation network. Through the dual networks, the model acts toward high cumulative rewards. In contrast, Q-learning is performed by storing Q-values (values of state action pairs) in Q-tables and continuously going through the Q-tables to update them. This approach is not suitable for handling large-scale scenarios. SA2C [
21] is a variant based on the Actor-Critic network for recommendation scenarios. It introduces supervised learning to simulate the Actor-network to generate correct actions. In this paper, the Actor-Critic network is trained by a deep deterministic policy gradient network. The model is converged by setting the Actor Target network and the Critic Target network.
4. The Proposed Framework
4.1. The Architecture
As we mentioned in
Section 1, traditional and deep learning recommendation methods neither treat user decision behavior as being static, nor do they consider immediate rewards. To address the shortcomings, we propose a deep deterministic policy gradient recommendation method incorporating a deep cross network, which mainly includes two parts: Actor policy network and Critic value network. In particular, there is a state representation part in the Actor-network, as shown in
Figure 3.
4.1.1. Actor Policy Neural Networks
The Actor network outputs actions based on user state features, as shown in the left part of the architecture. The user behavior vector includes user features, item features, statistical features, and scene features. These vectors are passed through the state representation module (deep cross network) to obtain the user state representation vector. The user is a generalized representation of the user’s preferences, and the state at the moment
t is defined as follows:
where
is the state representation function,
represents the vector embedding of the history where the user has interacted with the recommendation system,
. When the agent recommends an item according to the policy, if the user makes positive feedback on the recommended action, then
is updated to
. Otherwise,
is still equal to
. The user feature representation vector is input, and the recommended action is output after a three-layer activation function. The action under state s is defined as follows:
where
a represents the continuous action vector, which is the output of the Actor network.
represents the selection strategy; here we use
–greedy. Specifically, the candidate items are calculated as follows [
22]: //
Finally, the recommendation results are obtained by ranking each item according to its score.
4.1.2. Critic Value Neural Networks
The Critic network is a deep Q network, as shown on the right of the model architecture figure. It is used to estimate the quality of the state and action, namely, the estimate function of the Q value. The value function
is estimated through neural network fitting out
. The Q-value is a scalar that allows the model to be updated and optimized to enhance the action. We update the Actor network based on a deterministic policy gradient approach [
23], formulated as follows:
where
denotes the expectation of the Q value. N is the size of the batch. The network of Critic is updated using the temporal difference learning method [
24], as follows:
where
represents the time difference target value,
represents the discount rate,
L represents the mean square error,
represents the parameters of the Critic Target network, and
represents the parameters of the Actor Target network.
4.1.3. State Representation Module
The state representation module represents user characteristics and is the input for each of the two neural networks to make predictions. The BINN framework [
25] provides an item embedding method for user interaction, and it is adopted for subsequent recommendation tasks. We take the deep cross network to mine the cross relationship between the data of the recommendation system to obtain the user state representation vector. The advantages include fully mining the cross relationship between the features, preventing the gradient from disappearing, and being memory and computation friendly, as shown in
Figure 4.
The input of the state representation module is the concatenation of the user feature vector, the item feature vector, the statistical feature vector, and the scene feature vector in the recommendation data. DCN includes Deep Network and Cross Network, where Deep Network is a fully connected network defined as follows [
5]:
where
represents the data feature embedding vector, and
W and
b are parameter matrices, which are then passed to the next layer of the network through the Relu activation function. The Cross Network excavates the features between data by disintegrating vectors into subspaces and performing feature cross. The feature vector is defined as follows [
5]:
where
represents the splicing of initial data feature embedding vectors. The Deep Network and Cross Network output is stacked, and finally, the user state representation vector is output through a single linear mapping.
4.2. Training Process
We train the framework through the deep deterministic policy gradient algorithm. The training process mainly includes two stages: transfer (lines 4–12) and parameter update (lines 13–20), which are shown in Algorithm 1.
In the state transition generation phase, we initialize the Actor network and the Critic network parameters, and then we fit the user’s state representation vector based on the deep cross network. Afterward, according to and , the recommended action vector is calculated, and the inner product is completed with the embedding vector of the recommended item to obtain the score of the item. Afterward, the reward is calculated based on the user’s current state and action, and the quadruple is saved in the experience pool.
During the parameters update stage, we first sample a small batch of four-tuples from the experience replay pool, then calculate the objective function and loss function according to lines 14–15 in Algorithm 1. The parameters of the Actor network, Critic network, and target network are updated through the soft replace strategy.
4.3. Evaluation Process
As for reinforcement learning, the most direct way to evaluate the framework is to let users interact with the recommendation system, and to evaluate the quality of the framework through the real environment. However, there are many potential business risks and deployment costs in the online environment, so we use offline evaluation to complete this task.
The offline evaluation tests the policy’s effect learned by the framework, as shown in Algorithm 2. The initial item and state representation are observed in the offline log. The recommendation agent calculates the recommended action vector
according to the current state and policy
. Then, the item’s score is calculated, and
is recommended according to the score. Finally, the current reward
is calculated according to the offline log, the user’s state
is updated, and the recommended items are deleted from the candidate pool.
Algorithm 1: Training Process |
|
Algorithm 2: Offline Evaluation Process |
|
6. Conclusions
In this paper, we propose a deep deterministic policy gradient recommendation framework incorporating deep cross networks to consider the data cross relationship, and we solve the continuity problem in recommendation systems. It is based on the Actor-Critic architecture, where the Actor network and the Critic network are responsible for recommendation action generation and value estimation, respectively. The dual networks interact continuously with feedback until the model converges. We use a deep cross network to mine the feature relationships between data, and a deep deterministic policy gradient approach to training the DDRCN, based on a greedy policy, to cause the model to converge to a recommendation reward with high cumulative returns. We use the Precision and NDCG metrics in the information retrieval domain to evaluate the DDRN. We conduct relevant experiments on the publicly available datasets MovieLens and Lastfm, and the experimental results outperform the existing baseline methods in terms of Precision@5, Precision@10, NDCG@5, and NDCG@10. In the following research, we will consider issues such as recommendation efficiency, and explore deeper recommendation data features such as semantics.