Personalized News Recommendation Method with Double-Layer Residual Connections and Double Multi-Head Self-Attention Mechanisms

Zhang, Dehai; Zhu, Zhaoyang; Wang, Zhengwu; Wang, Jianxin; Xiao, Liang; Chen, Yin; Zhao, Di

doi:10.3390/app14135667

Open AccessArticle

Personalized News Recommendation Method with Double-Layer Residual Connections and Double Multi-Head Self-Attention Mechanisms

by

Dehai Zhang

^1,2,*,†

,

Zhaoyang Zhu

^1,2,†,

Zhengwu Wang

²,

Jianxin Wang

²,

Liang Xiao

^1,2,

Yin Chen

² and

Di Zhao

²

¹

Key Laboratory in Media Convergence of Yunnan Province, Kunming 650504, China

²

School of Software, Yunnan University, Kunming 650504, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(13), 5667; https://doi.org/10.3390/app14135667

Submission received: 17 March 2024 / Revised: 19 June 2024 / Accepted: 25 June 2024 / Published: 28 June 2024

(This article belongs to the Special Issue Advances in Recommender Systems and Information Retrieval)

Download

Browse Figures

Versions Notes

Abstract

:

In today’s society, there is an urgent need to help users better access information that they are interested in, as there is an increasing amount of news and messages available with the development of the Internet. Many existing methods involve directly inputting text into a pre-trained model, which limits the effectiveness of text feature extraction. The personalized news recommendation model discussed in this article is a model that can enhance feature extraction from news articles. It consists of a candidate news module, a historically accessed news module, and an access prediction module. Using news titles that accurately summarize news content, a model with double multi-head attention mechanisms and double residual structures (DDM) is utilized to better capture the features of news articles historically accessed by users, thereby achieving an improved recommendation functionality. The candidate news module aims to help the model learn representations of news that users are likely to select from the news titles. The user historical click news module primarily serves to enable the model to learn personalized representations of users from news they have previously browsed. The model has been tested on MIND-small. The AUC reached 0.6665, the MRR reached 0.3205, the nDCG@5 reached 0.3532, and the nDCG@10 reached 0.4158. The results indicate this model has achieved good results in the downstream tasks of preprocessing news-title texts.

Keywords:

news recommendation; feature extraction capability; multi-head attention; residual structures

1. Introduction

With the development of the Internet, there is a huge amount of news generated at every moment all over the world, which will lead to an information overload problem [1,2]. Therefore, methods to help users find news of interest to them are very important. The key challenges of news recommendations are understanding user interests or preferences based on their history and creating personalized matches [3] with a large number of news articles.

The accurate matching of user interests with candidate news is key to personalized news recommendations. Many methods have been proposed to improve the accuracy of matching user interests with candidate news [4]. Traditional recommendation methods often use categorical features (e.g., news ID, news category) or bag-of-words (tokens or n-grams) to model news contents [5]. For instance, classical recommendation models like the traditional LibFM model [6] and the DeepFM model [7] are both based on factorization machines. Unlike recommendations in many other domains, news has a time-sensitive nature and tends to become outdated. Therefore, traditional ID-based methods often encounter the cold start problem [8].

Many modern methods leverage deep learning techniques to learn news features and user interests to address the cold start problem. Among them, numerous research efforts have attempted end-to-end learning of news representations, such as embedding words into low-dimensional vectors and applying popular network architectures like CNNs and attention mechanisms to learn hidden news representations, enhancing recommendation accuracy. Although these neural news recommendation models have shown improved performance, there is still a need for further exploration on how to better simulate the complex correlations between user interests and news. For example, Okura et al. (2017) [9] proposed learning news representation from news organizations using autoencoders and from user-accessed news using GRUs. However, GRUs are computationally expensive and they fail to capture word contexts. Wu et al. [10] used a news representation model that learns news representations from titles using a CNN and employs user ID embeddings as queries for word-level and news-level attention networks. Wang et al. (2018) introduced a deep-knowledge-aware network for news recommendations (DKN) [8], which preprocesses news using a knowledge graph, learns news representations from news titles using a three-channel CNN, and calculates relevance by aggregating candidate articles with browsed articles. However, the CNN fails to capture long-range word contexts and, thus, it cannot model relationships between accessed news items [11]. In the neural news recommendations with multi-head self-attention (NRMS) model [11], although multi-head attention mechanisms are utilized to process news, the single layer of multi-head attention is too simplistic to extract news features effectively.

The work in this paper is inspired by many classical papers in the field of news recommendations, as well as the format of news headlines in real life. From studying various news recommendation models, we have found that the most important action is the processing of information in news headlines. For textual data, the key is to enable the model to better understand the relationships between words and between sentences. For example, in the preprocessing of text-related tasks in the BERT [12] model, two basic modules are used to help the model better understand the relationships between words and between sentences. However, in real life, most news headlines consist of only one sentence. Therefore, as news titles can accurately summarize news content, we believe that if we can effectively handle the relationships between words and improve feature extraction in this context, this will likely have a positive impact on the final recommendation results.

Therefore, this paper proposes a model with DDM that has an improved feature extraction efficiency for the preprocessing of news headlines. This model consists of three main modules: the historically accessed news module, the candidate news module, and the final prediction module. The objective is to enhance news-feature extraction, as both historically accessed news and candidate news contain news information. Our created structure is applied to process the textual information of historically accessed news and candidate news for a user. In the model, there are two layers of residual connections and two rounds of multi-head attention mechanisms. Additionally, an additive attention mechanism is employed to select important words and news, aiming to enhance the model’s effectiveness.

The main contributions of this paper are as follows:

The designed double residual connections and double multi-head attention mechanisms in this paper are capable of better capturing the word-level interaction information between a user’s historically accessed news and candidate news;
In this paper, inner and outer double residual structures were designed to enhance the model’s effectiveness and prevent the overfitting of word-level interaction information between a user’s historically accessed news and candidate news during model training;
The proposed method in this paper was validated by conducting a substantial number of experiments with the real-world MIND, released by Microsoft Research. Furthermore, this paper conducted a series of ablation studies to further explore the efficacy of our model.

2. Related Works

In this section, this article will introduce some related work in the field of news recommendations.

Traditional news recommendation methods mainly rely on collaborative filtering algorithms, content-based recommendation algorithms, and hybrid recommendation algorithms combined with manual feature engineering to represent users and news [5]. For example, collaborative filtering is a classic recommendation algorithm that can effectively utilize user behavior information to capture user interests and recommend similar news, and is also relatively easy to implement and explain [13,14,15]. Content-based recommendation algorithms [16,17] are mainly used for recommending text-based items, which is a method that utilizes users’ historical behaviors, user features, and item features to provide recommendations for users. It first constructs user features, then item features, and finally recommends to users based on user features and item features. Hybrid recommendation algorithms [18,19] combine various types of recommendation algorithms through methods such as weighting, switching, blending, stacking, cascading, and feature combination, aiming to enhance the performance and effectiveness of recommendation systems. Therefore, many early news recommendation algorithms adopted these methods. Manual feature engineering is a commonly used feature representation method. For news recommendations, some news-related features such as keywords, topics, etc., are utilized to construct news-feature vectors, in addition to some user attributes and behavior information, such as user age, geographical location, etc. The advantages of these methods lie in their ability to utilize abundant domain knowledge to build high-quality feature vectors and improve the accuracy of recommendations. However, these methods also have some limitations. For instance, collaborative filtering algorithms encounter the “cold start” problem and perform poorly when user behavior is limited. Content-based recommendation algorithms typically rely on pre-defined features to describe items, which need to be extracted or specified manually. Due to this dependency, the algorithms may struggle to accurately capture the latest features or attribute changes of emerging and complex items, thereby resulting in recommendations that lack sufficient novelty and personalization. On the other hand, manual feature engineering is constrained by feature sparsity and scalability, making it difficult to handle large-scale and diverse data.

Therefore, in recent years, researchers have begun to explore more intelligent methods, such as deep learning, to address these challenges. For instance, DKNs preprocess news through a knowledge graph, then learn news representations from news titles using a three-channel CNN and calculate relevance by aggregating candidates and news. NPA [2] applies personalized attention mechanisms to model user interests in different contexts and news. NRMS [11] utilizes a multi-head attention mechanism to construct word-level news-encoding modules. FIM [20] uses dilated convolution to extract multi-scale text features, and also employs 3D convolutions and MaxPooling to achieve fine-grained matching between accessed and candidate news at each semantic level. After 2021, many news recommendation methods have incorporated BERT modules for feature extraction in the preprocessing upstream part; some even combining multimodal methods by integrating information such as images to comprehensively process news, such as the UVCAN model [21]. However, this paper primarily focuses on studying the impact of structural optimization of the downstream part of preprocessing on the feature extraction effectiveness of news titles, so there is no comparison with incorporating BERT and multimodal news recommendation methods in this paper.

This paper explores the advantages of some previous methods. However, unlike models such as DKN, NPA, and FIM, the DDM model introduced in this paper does not utilize CNNs; instead, it directly processes news features using a multi-head attention mechanism. This is also in contrast to the NRMS model, where the attention mechanism is employed simplistically without fully exploiting the model’s potential. Therefore, this paper aims to investigate the influence of multiple multi-head attention mechanisms and residual structures on model effectiveness. By comparing various classic models, it is evident that the utilization of multi-head attention mechanisms significantly impacts the final results. In the model designed in this paper, the two-layer residual connections and two multi-head attention mechanism modules can better focus on word-level interaction information between historical user-accessed news and candidate news.

3. Methods of DDM

Currently, the architectures of news recommendation models based on deep learning are quite similar. They typically consist of a candidate news module, a user historical click news module, and a final click prediction module. The essence of model learning can be simply understood as importing user historical click news into the user historical click news module for feature extraction to simulate user interests. Subsequently, candidate news is processed through the candidate news module for feature extraction. Finally, the extracted features of the candidate news are compared with the simulated user interests, and the final evaluation prediction is made through the click prediction module. Models such as DKN [8], NRMS [11], NPA [2], and FIM [20] essentially follow this architecture. Their differences mainly lie in the way and method of feature extraction.

The structure of the DDM news recommendation model presented in this paper is illustrated in Figure 1. It consists of a candidate news module, a historically accessed news module, and a final click prediction module. The red boxes above each key structure contain their respective names. In Figure 1, the left side represents the integration of multi-head attention mechanisms and single-layer residuals. The central DDM part is the structure created in this work. The upper box represents the accessed news module, and the lower box represents the candidate news module. The model starts from the accessed news and candidate news in the graph.

3.1. Candidate News Module

The candidate news module is used to learn news representations from news titles and consists of three layers. The first layer is word embedding, which is used to convert a news title from a sequence of words into a low-dimensional embedded vector sequence. Let us say a news title is represented by

[M]

words

[W_{1}, W_{2}, W_{3}, \dots, W_{M}]

. Through this layer, it is transformed into a vector sequence

[E_{1}, E_{2}, E_{3}, \dots, E_{M}]

.

The second layer consists of the word-level two-layer residual connections and two multi-head attention mechanism modules designed in this paper. The interaction between words is crucial when learning news representations. However, long-distance interactions are often challenging to capture using CNNs. Additionally, a single word may interact with multiple words within the same news article. Therefore, in this paper, two-layer residual connections and two multi-head self-attention modules are employed to capture the interactions between words and learn the contextual representations of the words.

The specific formulation of the multi-head self-attention mechanism in this module can be expressed as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (Q k^{T} / \sqrt{d k}) V

(1)

where Q, K, and V, respectively, represent the query, key, and value matrices. They are projected into the

E_{t}

matrix using different learning projection matrices, as described in the equation.

1 / \sqrt{d k}

is a scaling factor that serves as a proportionality factor, generating a softer attention distribution to prevent vanishing gradients. Multi-head self-attention applies h-attention functions in parallel to generate connected output representations:

h_{i}^{w} = A t t e n t i o n (E_{t} W_{i}^{Q}, E_{t} W_{i}^{K}, E_{t} W_{i}^{V})

(2)

h_{i}^{w} = [h_{i, 1}^{w}, h_{i, 2}^{w}, \dots, h_{i, h}^{w}]

(3)

E_{t}

represents the token matrix of words, while

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are learnable parameters, with i denoting the i-th head. In the case of multi-head attention, information from different representation subspaces at different positions can be jointly learned, aiding in capturing match signals between different words.

h_{i}^{w}

represents the combination of results from multiple independent self-attention heads for feature extraction of the i-th word. The following will introduce the core contribution of this article. The equations for the DDM module are as follows:

\{\begin{matrix} H = h_{i}^{w} \\ H_{1} = L N (M A (H) + H) \\ H_{2} = L N (M A (H_{1}) + H_{1}) \\ H_{3} = L N (M A (H_{2}) + H) \end{matrix}

(4)

LN represents the layer normalization layer, while

M A (H)

represents the comprehensive processing of upper-level data outputs by a multi-head attention layer. Through multiple iterations, we have found that the experimental performance is optimal using this set of equations. The DDM module significantly enhances the model performance compared to that of many classical models. Specific results will be presented in the experimental section. Additionally, the data structure of

h_{i}^{w}

is the same as

H_{1}

,

H_{2}

, and

H_{3}

. Therefore, H can be considered as the reinforced version of

h_{i}^{w}

. In the subsequent equations,

h_{i}^{w}

is used to represent the reinforced results of the H layer, i.e.,

h_{i}^{w} = H_{3}

.

The third layer is an additive attention network. Different words within the same news article may have varying importance when representing said news. Therefore, various methods for selecting important words are compared in this article, including additive attention mechanisms, local attention mechanisms, and three other methods. Through ablation experiments, the additive attention mechanism that best suits the dataset and structure of this article is selected. The attention weight

a_{i}^{w}

for the i-th word in a news headline is calculated as follows:

a_{i}^{w} = q_{w}^{T} \tan h (V_{w} \times h_{i}^{w} + v_{w})

(5)

α_{i}^{w} = \frac{e x p (a_{i}^{w})}{\sum_{j = 1}^{M} e x p (a_{j}^{w})}

(6)

In this equation,

V_{w}

and

v_{w}

are projection parameters, and

q_{w}

is the query vector. The final representation of the news is obtained by taking the weighted sum of the contextual word representations, which can be expressed as follows:

r = \sum_{i = 1}^{M} α_{i}^{w} h_{i}^{w}

(7)

3.2. Historically Accessed News Module

In this module, the model mainly deals with a user’s historically accessed news. The module consists of two layers. We believe that a user’s historically accessed news represents the user’s interests over a period of time. Therefore, processing a user’s historically accessed news will result in a mathematical representation of the user’s real-life interests. User interests do not tend to change in the short term. Therefore, there is inevitably a correlation between the news that users have browsed and the news they want to browse. This correlation will inevitably lead to interactions between keywords in different news articles. In this article, these correlations and interactions are used to make predictions and recommendations.

In the first layer, this article once again uses DDM to enhance the accuracy of news recommendations by determining the interactions between news articles. The specific formula in this part is embodied in the multi-head self-attention mechanism:

A t t e n t i o n (Q, K, V) = s o f t m a x (Q k^{T} / \sqrt{d k}) V

(8)

where Q, K, and V represent the query, key, and value matrices, respectively. They are projected onto the

E_{t}

matrix via different learned projection matrices.

1 / \sqrt{d k}

is a scaling factor that generates a softer attention distribution to avoid vanishing gradients. Multi-head self-attention is applied in parallel using h-attention functions to generate connected output representations:

h_{i}^{n} = A t t e n t i o n (E_{t} W_{i}^{Q}, E_{t} W_{i}^{K}, E_{t} W_{i}^{V})

(9)

h_{i}^{n} = [h_{i, 1}^{n}, h_{i, 2}^{n}, \dots, h_{i, h}^{n}]

(10)

where

E_{t}

represents the token matrix of words and

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are learnable parameters, where i represents the i-th head. In the case of multi-head attention, the information of different representation subspaces at different positions can be jointly learned, which helps to capture matching signals between different words.

h_{i}^{n}

represents the result of feature extraction for the i-th news item, which is a combination of the results from multiple individual self-attention heads. These formulas demonstrate the functionality of the DDM module.

\{\begin{matrix} H = h_{i}^{n} \\ H_{1} = L N (M A (H) + H) \\ H_{2} = L N (M A (H_{1}) + H_{1}) \\ H_{3} = L N (M A (H_{2}) + H) \end{matrix}

(11)

LN represents the layer normalization layer, and

M A (H)

represents the multi-head attention layer that comprehensively processes the output of the upper layer data. It has been observed that the best experimental results are achieved after multiple iterations of these layers. Specific demonstrations will be provided in the experimental section. Moreover, the data structure of

h_{i}^{n}

is the same as that of

H_{1}

,

H_{2}

, and

H_{3}

. Therefore, H can be considered as a reinforced version of

h_{i}^{n}

. In the subsequent formulas,

h_{i}^{n}

is used to represent the reinforced results of the H layer, i.e.,

h_{i}^{n}

=

H_{3}

.

The second layer is an additive attention network. Different news articles may carry varying amounts of information that represent a user. An additive attention mechanism is used here to select important news articles to learn a more informative user representation. In the experimental section, this article also compares the additive attention mechanism to other methods such as the proximity attention mechanism. The attention weight calculation for the i-th news article is as follows:

a_{i}^{n} = q_{n}^{T} \tan h (V_{n} \times h_{i}^{n} + v_{n})

(12)

α_{i}^{n} = \frac{e x p (a_{i}^{n})}{\sum_{j = 1}^{N} e x p (a_{j}^{n})}

(13)

where

V_{w}

and

v_{w}

are projection parameters, and

q_{n}^{T}

is a parameter in the attention network. N represents the number of news articles being browsed. The final representation of the user’s historically accessed news is obtained by weighted summing of the contextual word representations, expressed as follows:

u = \sum_{i = 1}^{N} α_{i}^{n} h_{i}^{n}

(14)

3.3. Click Prediction Module

This module is used to predict the probability of a user clicking on a news article. In this study, the result of the user’s historically accessed news is denoted as u, and the result of a candidate news article is denoted as

r^{k}

. The predicted click probability score, denoted as

\hat{y}

, is computed as the dot product between the user representation vector and the news representation vector, that is:

\hat{y} = u^{T} r^{k}

(15)

3.4. Model Training

In this study, the negative sampling technique is used for model training. For each news article that a user views (considered as a positive sample), the model randomly selects K news articles that are similar but were not accessed by the user (considered as negative samples). The order of these news articles is shuffled to avoid potential positional bias. The click probability scores for the positive news and the K negative news articles are represented as

{\hat{y}}^{+}

and

[{\hat{y}}_{1}^{-}, {\hat{y}}_{2}^{-}, {\hat{y}}_{3}^{-}, \dots, {\hat{y}}_{K}^{-}]

, respectively. These scores are normalized using the softmax function to compute the posterior click probability for a positive sample as follows:

p_{i} = \frac{e x p ({\hat{y}}_{i}^{+})}{e x p ({\hat{y}}_{i}^{+}) + \sum_{j = 1}^{K} e x p ({\hat{y}}_{i, j}^{-})}

(16)

In this study, the problem of predicting click probabilities is reformulated as a pseudo (K+1)-way classification task, and a loss function is used for model training which calculates the negative log-likelihood value for all positive samples s. The formulation of the loss function is as follows:

L = - \sum_{i \in s} log p_{i}

(17)

4. Experiments

4.1. Datasets and the Experimental Setup

The model conducted experiments on the real-world MIND collected by MSN (https://msnews.github.io/). MIND has two versions: MIND-large and MIND-small. In this paper, MIND-small is used. MIND-small is a smaller version of MIND, obtained by uniformly sampling daily behavior logs. Detailed statistical data are shown in Table 1.

4.2. Evaluating Indicator

In the news field, several evaluation metrics are generally used: AUC, MRR, nDCG@5, and nDCG@10. The values of the evaluation metrics in this paper are averaged over the logs. This article will compare these evaluation metrics with some historical classical models.

4.3. Model and Training Details

In the model training experiments, MIND-small was used to determine the parameters. The self-attention network has 20 heads. The batch size for training the model was set to 80, the model was trained for five epochs, the learning rate for the model was set to 2 × 10⁻⁵, the optimizer used for training the model was Adam [22], the length of news titles was set to 30, and the number of historical reading records was set to 50. The word embedding dimension was set to 300, and it was initialized using Glove embedding [23]. The number of positive articles extracted for each user was 50, and the number of negative articles extracted was 4. The DDM module utilizes three residual connections and two multi-head attention mechanisms. The specific dataset and code have been uploaded to the network (https://github.com/zzy-d/DDM.git).

4.4. Performance Evaluation

This article primarily evaluates the performance of the proposed method by comparing it with several baseline methods, including the following:

(1): LibFM [6]. A matrix-factorization-based recommendation method that extracts features from both user-accessed and candidate news and concatenates them as input vectors to the model.
(2): DeepFM [7]. Similar to LibFM, it utilizes factorization machines for recommendations.
(3): DKN [8]. This approach preprocesses the news using knowledge graphs and learns news representations from news titles using a three-channel convolutional neural network (CNN).
(4): NPA [2], which applies personalized attention mechanisms to model users’ interests in different contexts and news.
(5): NAML [24]. A neural news recommendation method with attentive multi-view learning.
(6): LSTUR [25]. A neural news recommendation method that utilizes GRUs to learn user representations.
(7): NRMS [11]. A neural news recommendation method that uses single-layer multi-head self-attention to learn user and news representations in the pre-training phase.
(8): FIM [20], which leverages dilated convolutions for multi-scale text feature extraction and utilizes 3D convolutions with MaxPooling for fine-grained matching between browsed and candidate news at each semantic level.
(9): KIM [26]. This approach primarily employs a graph co-attention network.
(10): DDM. The method presented in this paper.

To ensure a fair comparison, all methods only use news titles. The application results of these methods on MIND-small are summarized in Table 2.

From Table 2, it can be observed that traditional feature-based deep factorization models such as LibFM and DeepFM exhibit a relatively lower performance, while end-to-end models such as DKN, NPA, NAML, NRMS, and FIM exhibit relatively better results. This indicates the superiority of end-to-end approaches. Furthermore, the use of knowledge graphs and a three-channel CNN in the DKN for news-title feature extraction results in a lower performance compared to the attention-based NPA, NAML, NRMS, and FIM models. Establishing knowledge graphs in the KIM model is still not an ideal method.

The method proposed in this paper consistently outperforms other baseline methods in all evaluation metrics. This significant improvement demonstrates that the enhanced two-layer residual connections and two multi-head attention mechanisms improve the news-feature extraction. The enhancement in the pre-training phase contributes to an overall improvement in the news recommendation model’s performance.

4.5. Ablation Study

4.5.1. Ablation Experiment with DDM and GRU

In a previous section, it was noted that some scholars have incorporated GRU modules into news recommendation models and achieved good results. Therefore, it is necessary to compare these two methods here. This article attempted to integrate GRU modules into the news recommendation model. The experimental results are shown in the following Table 3.

From the experimental results, it can be seen that, although adding the GRU module improves the performance compared to only using a multi-head attention mechanism module, this improvement is not significant. However, the performance improvement is quite significant when using the DDM module. We believe that integrating the GRU module does not effectively exploit the model’s potential. It exhibits a relatively low effectiveness in extracting features from news headlines and fails to achieve a more ideal overall recommendation effect. Therefore, this article abandoned the approach of incorporating a GRU module.

Once again, the superiority of the DDM structure over other structures is demonstrated in the data above.

4.5.2. Comparison of Residual Structure and Attention Layers

In order to study the effectiveness of improved multi-layer residual connections and multiple attention mechanisms in enhancing news-feature extraction, each part is evaluated by disabling another part. Firstly, a single multi-head attention mechanism is used to preprocess the news. This is then compared with the performance of a single-layer model after the addition of residual structures, as shown in the left panel of Figure 2. Next, the performance of the model with added residual structures and a multi-layer multi-head attention mechanism is compared, as illustrated in the middle panel of Figure 2. Finally, the model with two-layer residual connections and two rounds of multi-head attention mechanisms is compared with the other models, as depicted in the right panel of Figure 2.

From the left panel of Figure 2, it can be observed that the model with residual structures demonstrates better performance in extracting features from news data. Therefore, in subsequent experiments, models with residual structures are used. We believe that this result can be attributed to the fact that the residual structures improve the overall model fitting, ultimately enhancing the model’s effectiveness. Since the addition of residual structures represents a significant change from no residual structures, the performance improvement is notable. From the middle panel of Figure 2, it is evident that there is a significant difference in the effectiveness of multi-head attention mechanisms with different numbers of layers for extracting news features. According to the experimental results, using a module with two rounds of multi-head attention mechanisms and internal residuals yields better results in extracting features from news data. We believe that this outcome is due to the limitations of a single round of multi-head attention mechanisms, while three rounds of multi-head attention could lead to overfitting. Therefore, the two-round multi-head attention mechanism proves to be the most effective. From the right panel of Figure 2, it can be observed that the presence or absence of a dual-layer residual structure, both internally and externally, also has an impact on the effectiveness of news-feature extraction. Based on the experimental results, it can be concluded that the model with both internal and external dual-layer residual structures performs better in extracting features from news data. We believe that this is due to the improved overall model fitting achieved by incorporating dual-layer residuals, thereby enhancing the training effectiveness of the model.

4.5.3. Comparison of Additive Attention Mechanisms

In this section, this article will demonstrate the impact of the additive attention mechanism [11], scaled dot-product attention mechanism [27], local attention mechanism [28], adaptive attention mechanism [29], and self-attention mechanism [30] on optimizing news data. Based on the experimental results presented here, it is possible to determine which mechanism yields the best performance. Table 4 shows the results of each method after five epochs.

The additive attention mechanism calculates attention weights by computing the similarity between a query vector and a key vector and applying the attention weights to a value vector. Its computation process involves two steps: first, using a neural network or linear transformation to map the query vector and key vector to the same dimension space, and second, computing the similarity between the mapped query vector and key vector, usually using the dot product or other similarity measurement methods. The additive attention mechanism is suitable for short sequences and small-scale attention processing. From the line chart, it can be seen that the performance of the additive attention mechanism gradually improves, which is highly suitable for short-sequence processing tasks such as news headline processing. For scaled dot-product attention, attention weights are calculated by computing the dot product between a query vector and a key vector, scaling it, and then normalizing the scores into attention weights using the softmax function. The self-attention mechanism establishes associations between different positions in the input sequence. It calculates attention weights by mapping queries, keys, and values into different linear transformations, and applies the attention weights to values to generate the final representation. The local attention mechanism is based on distance and is commonly used to handle longer input sequences. It considers the distance between each position in the input sequence and the target position. It calculates attention weights by normalizing the distance, such that positions closer to the target position receive higher attention weights. The local attention mechanism is suitable for tasks with clear positional relationships in the sequence, such as machine translation and speech recognition tasks. Adaptive attention can flexibly control attention by learning weight parameters based on the requirements of the task to adaptively weigh the input features.

Comparing these mechanisms, it is evident from Figure 3 that the additive attention mechanism yields the best results in our model. Therefore, we consider the additive attention mechanism to be well-suited for the characteristics of short sequences and small-scale attention processing in this study. In comparison to other methods, it is particularly suitable for handling news titles.

4.6. The Effectiveness of DDM

In the next section of this paper, this article explores the overall effectiveness of using both internal and external residual layers and a dual multi-head attention mechanism in our model. Our ablation experiments showed that the use of internal and external residual layers and two attention mechanisms is highly effective. This is because modeling the interaction between words and selecting important words can better help the model learn informative news representations, and internal and external residual layers and two attention mechanisms can better fit such interactions and select important words. Furthermore, news-level attention is also useful because modeling the interaction between words and selecting important words can aid in learning informative news representations [11]. However, we also experimented with using residual and multi-head attention mechanisms at the news level and found that this negatively impacted the results. This suggests that it is important to modify the model in appropriate places, as too many residual structures may disrupt previously extracted news-level features, and too many instances of the multi-head attention mechanism may result in overfitting.

5. Discussion

This article proposes a new recommendation method based on internal and external residual layers and a dual multi-head attention mechanism. The core of our method consists of a candidate news module and a historically accessed news module. In these two modules, the interaction between words and news is modeled, and internal and external residual layers and dual attention mechanisms are applied to learn contextual word and news representations. Additionally, an additional attention mechanism is utilized in the model to select important words and news features to gain more informative user and news representations. Unlike models such as DKN, NPA, and FIM, the DDM model does not employ a CNN but rather directly processes news features using multi-head attention mechanisms. This is also in contrast to the NRMS model, where the attention mechanism is employed in a simplistic manner without fully exploiting the model’s potential. Through comparative experiments and ablation studies, it is evident that using CNNs is not ideal, and employing GRUs is also not as effective as directly using the DDM structure. Comparative experiments and ablation studies validate the effectiveness of this approach.

In future work, we will strive to improve our method in the following potential directions. Firstly, in this paper, the positional information of words and news is not considered in the model. However, this could be useful for learning more accurate news and user representations; some news recommendation methods have already used BERT to preprocess news features, such as UNbert and Minder. (These models are very different from the method we propose in this paper, so they were not compared here.) Therefore, we will explore position encoding techniques, such as the DeBERTaV3 [31] method in natural language processing, combined with word positions, news access types, timestamps, user genders, etc., to further strengthen our method. Secondly, we will explore how to effectively incorporate various news data, such as news content and image and video information, into the framework. Extracting important information from news content could have a positive influence on news recommendation results. This is a new direction in this area, completely different from traditional news recommendation methods, and it poses challenges regarding model and dataset requirements.

Author Contributions

Conceptualization, D.Z. (Dehai Zhang) and Z.Z.; methodology, D.Z. (Dehai Zhang); software, Z.Z.; validation, Z.Z., D.Z. (Dehai Zhang), and D.Z. (Di Zhao); formal analysis, Y.C. and L.X.; investigation, J.W.; resources, Z.W. and L.X.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z.; visualization, Z.Z.; supervision, Z.Z.; project administration, Dehai Zhang (D.Z.) and D.Z. (Di Zhao); funding acquisition, D.Z. (Dehai Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by (i) the Natural Science Foundation China (NSFC) under Grant No. 62366059 and (ii) the Open Foundation of Key Laboratory in Media Convergence of Yunnan Province under Grant No. 220235202.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

All data included in this study are available upon request by contacting the corresponding author.

Acknowledgments

The authors thank Zetao Zhang for the help provided.

Conflicts of Interest

We declare that we do not have any commercial or associative interests that may represent a conflict of interest in connection with the work submitted.

Abbreviations

The following abbreviations are used in this manuscript:

LD	Linear dichroism
DDM	Double-layer residual connections and double multi-head self-attention mechanisms
BERT	Pre-training of Deep Bidirectional Transformers for Language Understanding
LibFM	Factorization machines with libfm
DeepFM	A Factorization-Machine based Neural Network for CTR Prediction
DKN	Deep knowledge-aware network for news recommendations
NAML	Neural news recommendation with attentive multi-view learning
LSTUR	Neural news recommendation with long- and short-term user representations
NPA	Neural news recommendation with personalized attention
NRMS	Neural news recommendation with multi-head self-attention
FIM	Fine-grained interest matching for neural news recommendations
KIM	Personalized news recommendations with knowledge-aware interactive matching

References

Phelan, O.; McCarthy, K.; Bennett, M.; Smyth, B. Terms of a feather: Contentbased news recommendation and discovery using twitter. In Proceedings of the European Conference on Information Retrieval, Dublin, Ireland, 18–21 April 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 448–459. [Google Scholar]
Wu, C.; Wu, F.; An, M.; Huang, J.; Huang, Y.; Xie, X. Npa: Neural news recommendation with personalized attention. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2576–2584. [Google Scholar]
Wu, F.; Qiao, Y.; Chen, J.-H.; Wu, C.; Qi, T.; Lian, J.; Liu, D.; Xie, X.; Gao, J.; Wu, W.; et al. Mind: A large-scale dataset for news recommendation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3597–3606. [Google Scholar]
Raza, S.; Ding, C. News recommender system: A review of recent progress, challenges, and opportunities. Artif. Intell. Rev. 2022, 55, 749–800. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Li, J.; Jia, Q.; Wang, C.; Zhu, J.; Wang, Z.; He, X. UNBERT: User-News Matching BERT for News Recommendation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19-27 August 2021; Volume 21. [Google Scholar]
Rendle, S. Factorization machines with libfm. Acm Trans. Intell. Syst. Technol. (TIST) 2012, 3, 1–22. [Google Scholar] [CrossRef]
Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. Deepfm: A factorization-machine based neural network for ctr prediction. arXiv 2017, arXiv:1703.04247. [Google Scholar]
Wang, H.; Zhang, F.; Xie, X.; Guo, M. Dkn: Deep knowledge-aware network for news recommendation. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 1835–1844. [Google Scholar]
Okura, S.; Tagami, Y.; Ono, S.; Tajima, A. Embedding-based news recommendation for millions of users. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1933–1942. [Google Scholar]
Wu, C.; Wu, F.; An, M.; Qi, T.; Huang, J.; Huang, Y.; Xie, X. Neural news recommendation with heterogeneous user behavior. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 4876–4885. [Google Scholar]
Wu, C.; Wu, F.; Ge, S.; Qi, T.; Huang, Y.; Xie, X. Neural news recommendation with multi-head self-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6390–6395. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, J.; Dolan, P.; Pedersen, E.R. Personalized news recommendation based on click behavior. In Proceedings of the 15th International Conference on Intelligent User Interfaces, Hong Kong, China, 7–10 February 2010; pp. 31–40. [Google Scholar]
Capelle, M.; Frasincar, F.; Moerland, M.; Hogenboom, F. Semantics-based news recommendation. In Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, Craiova, Romania, 6–8 June 2012; pp. 1–9. [Google Scholar]
Son, J.-W.; Kim, A.-Y.; Park, S.-B. A location-based news article recommendation with explicit localized semantic analysis. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 28 July–1 August 2013; pp. 293–302. [Google Scholar]
Garcin, F.; Dimitrakakis, C.; Faltings, B. Personalized news recommendation with context trees. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013; pp. 105–112. [Google Scholar]
Pazzani, J.M.; Billsus, D. Content-based recommendation systems. In The Adaptive Web: Methods and Strategies of Web Personalization; Springer: Berlin/Heidelberg, Germany, 2007; pp. 325–341. [Google Scholar]
Balabanović, M.; Shoham, Y. Fab: Content-based, collaborative recommendation. Commun. ACM 1997, 40, 66–72. [Google Scholar] [CrossRef]
Koren, Y. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008. [Google Scholar]
Wang, H.; Wu, F.; Liu, Z.; Xie, X. Fine-grained interest matching for neural news recommendation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 836–845. [Google Scholar]
Liu, S.; Chen, Z.; Liu, H.; Hu, X. User-video co-attention network for personalized micro-video recommendation. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Wu, C.; Wu, F.; An, M.; Huang, J.; Huang, Y.; Xie, X. Neural news recommendation with attentive multi-view learning. arXiv 2019, arXiv:1907.05576. [Google Scholar]
An, M.; Wu, F.; Wu, C.; Zhang, K.; Liu, Z.; Xie, X. Neural news recommendation with long-and short-term user representations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 336–345. [Google Scholar]
Qi, T.; Wu, F.; Wu, C.; Huang, Y. Personalized news recommendation with knowledge-aware interactive matching. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021. [Google Scholar]
Tuan, N.M.D.; Minh, P.Q.N. Multimodal Fusion with BERT and Attention Mechanism for Fake News Detection. In Proceedings of the 2021 RIVF International Conference on Computing and Communication Technologies (RIVF), Hanoi, Vietnam, 19–21 August 2021; pp. 1–6. [Google Scholar]
Manakul, P.; Gales, M.J.F. Long-Span Summarization via Local Attention and Content Selection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 22–27 May 2021. [Google Scholar]
Huang, J.; Han, Z.; Xu, H.; Liu, H. Adapted transformer network for news recommendation. Neurocomputing 2022, 469, 119–129. [Google Scholar] [CrossRef]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. arXiv 2021, arXiv:2111.09543. [Google Scholar]

Figure 1. The framework of our DDM approach.

Figure 2. Comparison of residual structure and attention layers.

Figure 3. Comparison of the results of various optimization mechanisms.

Table 1. Statistics of the dataset.

	Train	Dev	Test
User	162,898	47,187	88,898
News	76,904	53,897	57,856
Impressions	199,998	50,002	100,000
Positive samples	300,357	75,183	153,963
Negative samples	7,060,083	1,779,492	3,740,561

Table 2. The results of different methods.

Method	AUC	MRR	nDCG@5	nDCG@10
LibFM	59.74	26.33	27.95	34.29
DeepFM	59.89	26.21	27.74	34.06
DKN	61.75	27.05	28.90	35.38
NPA	63.21	29.11	31.70	37.81
NAML	65.50	30.39	33.08	39.31
LSTUR	64.38	29.46	31.89	38.17
NRMS	64.83	30.01	32.52	38.92
FIM	65.02	30.26	32.91	39.10
KIM	66.25	31.62	34.97	41.16
DDM	66.65	32.05	35.32	41.58
Improv.	0.40	0.43	0.35	0.42

Table 3. Ablation experiment with DDM and GRU.

Ways	Evaluation Indicator
Ways	AUC	MRR	nDCG@5	nDCG@10
Pure multi-head attention	64.83	30.01	32.52	38.92
With incorporation of the GRU module	65.00	30.45	33.35	39.92
DDM module	66.65	32.05	35.32	41.58

Table 4. Results of various optimization mechanisms.

Ways	Evaluation Indicator
Ways	AUC	MRR	nDCG@5	nDCG@10
Self-attention mechanism	65.05	30.79	33.73	40.21
Scaled dot-product attention mechanism	65.33	31.27	34.19	40.62
Local attention mechanism	65.54	30.75	33.65	40.29
Adaptive attention mechanism	65.67	31.39	34.42	40.83
Additive attention mechanism	66.65	32.05	35.32	41.58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, D.; Zhu, Z.; Wang, Z.; Wang, J.; Xiao, L.; Chen, Y.; Zhao, D. Personalized News Recommendation Method with Double-Layer Residual Connections and Double Multi-Head Self-Attention Mechanisms. Appl. Sci. 2024, 14, 5667. https://doi.org/10.3390/app14135667

AMA Style

Zhang D, Zhu Z, Wang Z, Wang J, Xiao L, Chen Y, Zhao D. Personalized News Recommendation Method with Double-Layer Residual Connections and Double Multi-Head Self-Attention Mechanisms. Applied Sciences. 2024; 14(13):5667. https://doi.org/10.3390/app14135667

Chicago/Turabian Style

Zhang, Dehai, Zhaoyang Zhu, Zhengwu Wang, Jianxin Wang, Liang Xiao, Yin Chen, and Di Zhao. 2024. "Personalized News Recommendation Method with Double-Layer Residual Connections and Double Multi-Head Self-Attention Mechanisms" Applied Sciences 14, no. 13: 5667. https://doi.org/10.3390/app14135667

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Personalized News Recommendation Method with Double-Layer Residual Connections and Double Multi-Head Self-Attention Mechanisms

Abstract

1. Introduction

2. Related Works

3. Methods of DDM

3.1. Candidate News Module

3.2. Historically Accessed News Module

3.3. Click Prediction Module

3.4. Model Training

4. Experiments

4.1. Datasets and the Experimental Setup

4.2. Evaluating Indicator

4.3. Model and Training Details

4.4. Performance Evaluation

4.5. Ablation Study

4.5.1. Ablation Experiment with DDM and GRU

4.5.2. Comparison of Residual Structure and Attention Layers

4.5.3. Comparison of Additive Attention Mechanisms

4.6. The Effectiveness of DDM

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI