MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems

Cui, Xiaohui; Qu, Xiaolong; Li, Dongmei; Yang, Yu; Li, Yuxun; Zhang, Xiaoping

doi:10.3390/electronics12122688

Open AccessArticle

MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems

by

Xiaohui Cui

^1,2

,

Xiaolong Qu

^1,2

,

Dongmei Li

^1,2,*,

Yu Yang

^1,2

,

Yuxun Li

^1,2

and

Xiaoping Zhang

³

¹

School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China

²

National Forestry and Grassland Administration Engineering Research Center for Forestry-Oriented Intelligent Information Processing, Beijing 100083, China

³

National Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing 100700, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(12), 2688; https://doi.org/10.3390/electronics12122688

Submission received: 30 April 2023 / Revised: 13 June 2023 / Accepted: 14 June 2023 / Published: 15 June 2023

(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

With the emergence of online music platforms, music recommender systems are becoming increasingly crucial in music information retrieval. Knowledge graphs (KGs) are a rich source of semantic information for entities and relations, allowing for improved modeling and analysis of entity relations to enhance recommendations. Existing research has primarily focused on the modeling and analysis of structural triples, while largely ignoring the representation and information processing capabilities of multi-modal data such as music videos and lyrics, which has hindered the improvement and user experience of music recommender systems. To address these issues, we propose a Multi-modal Knowledge Graph Convolutional Network (MKGCN) to enhance music recommendation by leveraging the multi-modal knowledge of music items and their high-order structural and semantic information. Specifically, there are three aggregators in MKGCN: the multi-modal aggregator aggregates the text, image, audio, and sentiment features of each music item in a multi-modal knowledge graph (MMKG); the user aggregator and item aggregator use graph convolutional networks to aggregate multi-hop neighboring nodes on MMKGs to model high-order representations of user preferences and music items, respectively. Finally, we utilize the aggregated embedding representations for recommendation. In training MKGCN, we adopt the ratio negative sampling strategy to generate high-quality negative samples. We construct four different-sized music MMKGs using the public dataset Last-FM and conduct extensive experiments on them. The experimental results demonstrate that MKGCN achieves significant improvements and outperforms several state-of-the-art baselines.

Keywords:

music information retrieval; multi-modal knowledge graph; graph convolutional networks; music recommender systems; negative sampling

1. Introduction

Modern online music services have changed the way people search for and listening to music, offering an extensive array of diverse song catalogues while concurrently enhancing user experiences through personalized optimization [1,2]. In this context, the development of music information retrieval [3] has become crucial in enhancing user experience and improving the profitability of these platforms [4]. Music recommender systems, as the core technology of music information retrieval, can provide personalized music recommendations to users based on their preferences and behavioral patterns, thereby increasing user satisfaction, loyalty, and ultimately promoting revenue growth of the music platform [5,6]. Therefore, the importance of music recommender systems is paramount.

Traditional content-based recommendation methods [4] usually only consider the features of the music itself, neglecting the potential relations between music and other entities, such as artists, albums, and playlists. As a result, they fail to uncover deeper semantic information behind the music [7]. Collaborative filtering (CF) methods [8,9], on the other hand, require a large amount of user behavior data, making them less effective for new users or cold-start problems. Additionally, with the development of mobile internet, the data used for recommendation has become more specific and diverse, including user ratings, music tags, and multi-modal data such as texts, images, audios, and sentiment analysis of the music itself. Therefore, there are still challenges in effectively utilizing side information and multi-modal data to enhance the performance of music recommender systems.

In order to effectively utilize side information to enhance recommendation, researchers have proposed integrating side information into CF, such as social networks [10], user/item attributes [11], images [12], and context [13]. Another effective method is to supplement item features through knowledge graphs (KGs) and then use them for recommendation models. KGs are directed heterogeneous graphs composed of many triples, where nodes correspond to entities and edges correspond to relations [14]. In the music KGs discussed in this paper, music items and their tags, such as artists and music genres, are treated as entities, and relations between entities, such as “sings” (who sings the song), are treated as edges. For example, in the right side of Figure 1, the triple (Love Story, music.singer, Taylor Swift) indicates that the singer “Taylor Swift” sings the song “Love Story”. KGs contain rich entities and relations, which can provide recommendation systems with rich structural and semantic information. Furthermore, researchers have gradually focused on modeling the relations between entities to improve the accuracy and interpretability of recommendations, in addition to using KGs to model the rich tag features of items [15,16].

In Figure 2, we can observe that users visiting a music website typically first encounter the song poster, followed by the text description, and then they may inspect the quality of the lyrics before deciding whether to listen to the song. This process highlights the potential knowledge-level connections between different modalities of data such as text, image, and audio, which can be leveraged by recommendation systems. For music items, the audio and sentiment modalities present a unique opportunity for enhancing the performance of music recommender systems. Therefore, MMKGs can integrate multi-modal data to comprehensively represent user and item information, offering an advantage over single-modal KGs that only contain structural triples. This can improve the accuracy and interpretability of recommendation systems and expand the range of application scenarios for such systems.

In our proposed MKGCN, we aim to address the limitations of existing music recommendation methods and leverage the benefits of graph convolutional networks and MMKGs. We design three key components in MKGCN: (1) The multi-modal aggregator, which can be thought of as a multi-channel convolution kernel with a size of 1 × 1. Each channel corresponds to a particular modality of data and it enhances entity embeddings by aggregating multi-modal data on MMKGs. (2) The user aggregator aggregates the user’s historical interaction items on the collaborative knowledge graph (CKG, as defined in Section 3.3) to generate the user embedding representations. (3) The item aggregator takes into account the user’s preferences for different relations when aggregating neighbor entities on MMKGs. It acts as a weight to aggregate different neighbors and characterizes both the semantic information of KG and the user’s personalized interest in the relation. Furthermore, during the training of MKGCN, we avoid using the traditional random negative sampling strategy, as it may lead to sampling items that users might be interested in as negative samples. Instead, we adopt a ratio negative sampling strategy, which has been proven effective and will be discussed in detail in the experimental section. To evaluate MKGCN’s performance, we construct four music MMKGs of varying sizes and sparsity based on the Last-FM public music dataset, since there is a lack of public music MMKGs. Experimental results demonstrate that MKGCN outperforms several current state-of-the-art baseline models.

The main contributions of this paper are summarized as follows:

To the best of our knowledge, we are the first music recommender system based on multi-modal knowledge graphs. We put a strong emphasis on fully leveraging the multi-modal data of music items, particularly the audio-domain and sentiment features of music, to provide more accurate and personalized music recommendations for users.
We propose a novel music recommender system called Multi-modal Knowledge Graph Convolutional Network (MKGCN). Inspired by CNN, we design three aggregators that can effectively integrate different types of data, and propagate entity embeddings through the user’s historical interaction items and diverse adjacency relations of entities on the KG, achieving a high-order integration of user preferences and music items.
We conduct extensive experiments to validate the efficacy of MKGCN and have also made our implementation code and self-made music multi-modal knowledge graph publicly available to researchers for replication and further research. The code and dataset can be accessed at https://github.com/QuXiaolong0812/mkgcn (accessed on 20 April 2023).

Throughout this paper, we utilize several abbreviations to represent key concepts and terms. To aid readers in understanding these abbreviations, we provide a summary of the most frequently used ones in Table 1.

2. Related Work

2.1. Convolutional Neural Networks

In recent years, convolutional neural networks (CNNs) have shown impressive performance in the domains of video [17] and images [18]. However, when it comes to non-Euclidean data structures, such as social networks and knowledge graphs, CNNs’ efficacy is limited. To address this issue, researchers have proposed graph convolutional networks (GCNs), which are an extension of CNNs in the non-Euclidean domain. By integrating the features and labeling information of both the central node and its neighboring nodes, GCNs provide a regular expression form of each node in the graph and input it into CNNs. In this way, GCNs can combine multi-scale information to create higher-level expressions, effectively utilizing both the graph structure information and attribute information. Due to their powerful modeling capabilities, GCNs have found widespread use in recommender systems [19,20]. There are two primary methods for GCNs to perform convolution operations: (1) spectral decomposition graph convolution, which is an eigen decomposition using the Laplacian matrix of the graph, and (2) spatial graph convolution, which leverages the spatial characteristics of graph structure data to explore the representation of neighbor nodes, making the representation of each node’s neighboring nodes uniform and regular, which is convenient for convolution operations [21]. KGCN [22] samples the neighbors around a node and dynamically computes local convolutions based on the sampled neighbors to enhance the item embedding representation. LightGCN [23] proposes a lightweight GCN based on an interaction graph by learning users and linearly propagating embedding items on a user–item bipartite graph, and using the weighted sum of embedding item learning across all levels as the final embedding.

MKGCN uses spatial graph convolution for a GCN. While GCNs have been effective in modeling high-order representations of items in KGs, they often neglect modeling user preferences. To address this limitation, we draw inspiration from the work of [22] and construct a collaborative knowledge graph. By using a GCN to model both music items and user preferences simultaneously, we are able to enhance our music recommendations.

2.2. Multi-Modal Knowledge Graph

Multi-modal knowledge graphs (MMKGs) have become increasingly important in the field of artificial intelligence due to the prevalence of multi-modal data in various domains. MMKGs integrate information from different modalities, such as text, image, and audio, into traditional KGs, which typically only contain structural triples, to improve the performance of downstream KG-related tasks [24]. Figure 3 illustrates the two main approaches for constructing MMKGs. (Please note that the face images in the figure are sourced from the open-source Generated Faces dataset. The dataset can be accessed via the link: https://generated.photos/datasets#, accessed on 6 June 2023.) The first approach, attribute-based MMKGs, considers multi-modal data as specific attribute values of entities or concepts, such as the “poster” and “audio” of a music entity in Figure 3a. The second approach, entity-based MMKGs, treats multi-modal data as separate entities in the KG, as shown by the image-based representation of singer and country entities in Figure 3b. However, entity-based MMKGs do not fuse multi-modal data and therefore limit the exploitation of multi-modal information [25,26].

In our work, we recognize that the multi-modal data associated with each music item, such as posters and audio, are unique and cannot be shared across different items. Therefore, we construct an equal amount of multi-modal data for each music item, resulting in an attribute-based MMKG, where each music entity has its own set of multi-modal data. We will elaborate on the MMKG construction process in detail in the experiments section.

2.3. Recommendations with MMKGs

As MMKGs are a relatively new concept, there is limited related work on MMKG-based recommender systems. This paper proposes three classifications of existing MMKG-based recommender systems from the perspective of multi-modal feature fusion [27]: (1) The feature-fusion method, also known as the early-fusion method, concatenates features extracted from different modalities into a single high-dimensional feature vector that is then fed into a downstream task. For instance, MKGAT [28] first performs separate feature representations of multi-modal data such as text, image, and triples, and then aggregates the embedding representation of the feature vectors from each modality to make recommendations. However, this method is limited in its ability to model complex relations between modalities. (2) The result-fusion method, also known as the post-fusion method, obtains decisions based on each modality and then integrates these decisions by applying algebraic combination rules for multiple prediction class labels (e.g., maximum, minimum, sum, mean, etc.) to obtain the final result. For example, MMGCN [29] is based on a knowledge graph of three different modalities (text, image, and audio) and then performs user–item interaction predictions on all three knowledge graphs simultaneously. It then linearly aggregates the prediction scores for each modality to obtain the final prediction score. However, this method cannot capture the interconnections between different modalities and requires corresponding multi-modal data for each item. (3) The model-fusion method is a deeper fusion method that produces more optimal joint discriminative feature representations for classification and regression tasks. For instance, MKRLN [30] generates path representations by combining structural and visual information of entities and incorporates the idea of reinforcement learning to iteratively introduce visual features to determine the next step in path selection.

Existing MMKG-based recommendation systems mainly focus on textual and visual multi-modal data, which may not be sufficient for music recommendation, as they neglect the important audio and sentiment features of music. In addition, due to the unique characteristics of the MMKGs constructed for music, we propose an enhanced feature-fusion method in MKGCN, which will be explained in more detail in the experiments section. It is worth noting that, in contrast to the multi-model work in [31], our work uses multi-modal data to enhance the embedding representation of users and items, rather than using multi-models to explore which machine learning approach is best suited for the downstream task of recommendation.

2.4. Negative Sampling Strategy

Negative sampling is a process that involves constructing negative samples that are the opposite of positive samples using certain strategies. Most existing recommendation systems employ random negative sampling strategies to generate negative samples, such as KGCN [22], RippleNet [32], and CKAN [33]. However, the random negative sampling strategy may result in sampling items of potential interest to the user as negative samples, causing the model to fail to converge to the optimal value.

To address this issue, this paper draws on popularity hypothesis and adversarial learning methods in computer vision and introduces two additional sampling methods: static negative sampling and hard negative sampling. Static negative sampling [34] involves setting different sampling probabilities for different samples. One common method is to assume that less exposed items are colder and therefore have a higher probability of being sampled as negative samples, while more exposed items have a lower probability of being sampled as negative samples. Hard negative sampling [35] is based on the classification results of the model and typically involves sampling the sample with the highest similarity score to the positive sample and the smallest prediction score to the user as the negative sample.

In the experiments section, we will discuss the contributions of these three negative sampling strategies to the recommendation results.

3. Problem Formulation

With reference to previous work [22,24,32], we give the following conceptual and symbolic definitions before formally starting to introduce the MKGCN.

3.1. User–Item Interaction Matrix

In a typical recommendation scenario, we have users

U = {u_{1}, u_{2}, \dots, u_{M}}

and recommendation items (in this paper, music items)

V = {v_{1}, v_{2}, \dots, v_{N}}

, where M and N are the number of users and items, respectively. The value of the user–item interaction matrix

Y = {y_{u v} | u \in U, v \in V}

(of size

M \times N

) is defined as in Equation (1):

y_{u v} = \{\begin{matrix} 1, & if u interacted with v; \\ 0, & otherwise . \end{matrix}

(1)

If

y_{u v}

equals 1, this indicates that there is a historical interaction between user u and item v, such as clicking, browsing, downloading, etc.; if it equals 0, this means that no interaction has occurred but it does not mean that user u is not interested in item v. This is the point that we need to pay attention to when we do negative sampling.

3.2. Multi-Modal Knowledge Graph

We have explained in Section 2.2 that attribute-based multi-modal knowledge graphs are used in this paper so, referring to [24], we define the multi-modal knowledge graph as

G_{m m k g} = {E, R, A, T_{R}, T_{A}}

, where

E

,

R

,

A

represent entities, relations, and attributes, respectively.

T_{R} = E \times R \times E

and

T_{R} = E \times A \times E

represent relation triples and attribute triples, respectively. For instance, the triple

(e_{1}, r, e_{2}) \in T_{R}

indicates that the relation between entity

e_{1} \in E

and entity

e_{2} \in E

is

r \in R

, and the triple

(e_{3}, a, e_{4}) \in T_{A}

indicates that entity

e_{3} \in E

has the attribute

e_{4} \in E

, where

a \in A

is the attribute type and

e_{4}

is multi-modal data. Some example triples are given in Table 2.

It is worth noting that the entity set

E

contains the music item set

V

, which means

V

has a one-to-one mapping into

E

but not onto

E

.

3.3. Collaborative Knowledge Graph

Collaborative knowledge graphs are used for higher-order modeling of user preferences. Based on the user–item interaction matrix, we can filter the set of user–item pairs, defined as

U I = {(u, v) | u \in U, v \in V, y_{u v} \in Y a n d y_{u v} = 1}

. Since the set

V^{'}

after the alignment of the music item set

V

is a subset of the entity set

E

, we can define the collaborative knowledge graph as

G_{c k g} = {(u, i n t e r a c t i o n, e) | (u, v) \in U I, u \in U, v \leftrightarrow e \in V^{'}}

, where ↔ denotes aligning the music item v to entity e in MMKGs. In Figure 4, we give an illustration of a collaborative knowledge graph.

3.4. Problem Description

Given a user–item interaction matrix Y and a multi-modal knowledge graph

G_{m m k g}

, we try to predict the interaction probability of user u to an item v that u has not interacted with before. Thus, our goal is to learn an interaction probability prediction function

F (\cdot)

from positive and generated negative samples as follows:

{\hat{y}}_{u v} = F (u, v | Y, G_{m m k g}, Θ),

(2)

where

{\hat{y}}_{u v} \in (0, 1)

denotes the probability of recommending the item v to the user u, and

Θ

represents the model parameters of

F (\cdot)

.

4. Methodology

In this section, we introduce the MKGCN proposed in this paper. As shown in Figure 5, MKGCN comprises four main layers: (1) The alignment and knowledge propagation layer, which aligns music items to entities in MMKGs and then obtains high-order neighbor entities through knowledge propagation. (2) The multi-modal aggregator layer, where the multi-modal aggregator aggregates multi-modal data to enhance entity embedding representation. (3) The GCN aggregator, where the user aggregator and item aggregator recursively propagate embeddings from neighbors to update entity representation to model the user preference and music items. (4) The prediction layer, which uses the representations of the user and item output by the aggregation layer for recommendation probability prediction.

4.1. Alignment and Knowledge Propagation Layer

As we have explained in Section 3.2 and Section 3.3, the role of the alignment layer is a one-to-one mapping of the music item v to the entity e in MMKGs. Thanks to the connectivity of the knowledge graph, entity e can be iteratively propagated outward layer by layer along the relation edges, which in turn leads to the set of its high-order neighbors. For example, for an entity e, its lth-order neighbors are obtained by propagating its l-1th-order neighbors along edges in the MMKGs; we define the lth-order neighbors of e as

N {(e)}^{l}

as follows:

N {(e)}^{l} = {e_{2} | (e_{1}, r, e_{2}) \in T_{R} and e_{1} \in N {(e)}^{l - 1}},

(3)

where

T_{R} \in G_{m m k g}

are relation triples in MMKGs, not attribute triples.

e_{1} \in E

and

e_{2} \in E

, respectively, represent the head node and tail node of the relation triples, while

r \in R

represents the relations. In particular, for user u,

N {(e)}^{0}

is the set of historical interaction items of the user u; for the item v,

N {(e)}^{0}

is the entity after its alignment.

4.2. Multi-Modal Aggregation Layer

The role of the multi-modal aggregation layer is to perform multi-modal data aggregation for each entity node in MMKGs to enhance the entity representation. In MKGCN, we extract seven types of multi-modal data for each music item and they belong to four aspects: text, image, audio, and sentiment. We define the multi-modal data notation

m_{i}

as follows:

m_{i} = {e^{'} | (e, a, e^{'}) \in T_{A}}, i = 1, 2, \dots, 7,

(4)

where

T_{A} \in G_{m m k g}

is attribute triples in MMKGs,

e \in E

and

e^{'} \in E

, respectively, represent the head node and tail node of the attribute triples,

a \in A

is the attributes, and i is the index of multi-modal types. We will introduce the preprocessing of multi-modal data in the experiments section.

Before the multi-modal data formally enter the multi-modal aggregator, the dimensionality of each set of modal data is inconsistent, so we first perform a principal component analysis (PCA) dimensionality reduction operation to make the data dimensionality of each modality consistent with the embedding dimension d of the entities in the MMKGs. From the viewpoint of graph convolution shown in Figure 6, each modality corresponds to one input channel, and the multi-modal aggregator is equivalent to a convolution kernel with a multi-channel input and single-channel output of size

1 \times 1

. Then the multi-modal aggregator aggregates the multi-modal data of each entity to obtain a multi-modal aggregated representation vector of the entity. It is worth noting that, in the modal aggregation, we also treat the original embedding of the entity itself as a modality, which means that, for an entity e, there is

e m b e d d i n g (e) = m_{8}

.

In MKGCN, we implement three multi-modal aggregators (

8 \times R^{d} \to R^{d}

).

Sum multi-modal aggregator sums all modal representation vectors and then performs the nonlinear transformation:

$m u l t i_a g g_{s u m} = σ (W_{m} \cdot \sum_{i = 1}^{8} m_{i} + b_{m}),$

(5)

where $W_{m}$ and $b_{m}$ are trainable transformation weights and biases, respectively, $m_{i}$ represents the multi-modal entity, and $σ$ is an activation function, such as LeakyReLU. Please note that the same symbols in the other two aggregators represent the same meanings, so they are not repeated.
Concat multi-modal aggregator connects the representation vectors of multi-modal data before applying the nonlinear transformation:

$m u l t i_a g g_{c o n c a t} = σ (W_{m} \cdot concat (m_{1}, \dots, m_{8}) + b_{m}) .$

(6)
Max multi-modal aggregator selects the modal vector with the largest amount of characterization information to replace the current entity embedding vector and then performs a nonlinear transformation:

$m u l t i_a g g_{m a x} = σ (W_{m} \cdot Max (m_{1}, \dots, m_{8}) + b_{m}) .$

(7)

An entity e in MMKGs is passed through the multi-modal aggregator to obtain its multi-modal reinforced representation

e_{m}

, denoted as follows:

e_{m} = m u l t i_a g g (m_{1}, \dots, m_{8}),

(8)

where

m_{i} \in E (i = 1, \dots, 7)

are the multi-modal attributes of e, and

m_{8}

is the original embedding of e.

The multi-modal aggregator is one of the key components of MKGCN and we will discuss the effect of different aggregation strategies in the experiments section.

4.3. GCN Aggregation Layer

The GCN aggregation layer models the high-order representations of users and items, respectively. In Figure 6, we visualize the aggregation demonstration of the GCN aggregator from the GCN perspective: For each entity, MKGCN samples k (if the neighbors size of entity e is less than k, the sampling is repeated until neighbor size equals k; if it is greater than k, then k neighbors are sampled randomly) neighbors around it (the number of samples in Figure 6 is 8). Then, starting from the outermost neighbors, MKGCN aggregates the neighbor representation down layer by layer using the user/item aggregator until the end of layer 0. Finally, MKGCN obtains the user/item reinforced embedding representation.

4.3.1. User Aggregator

MKGCN models user preferences based on their historical interaction items. For user u, we sample his k historical interaction items from the collaborative knowledge graph, notated as

h_{u} = {v_{1}^{u}, v_{2}^{u}, \dots, v_{n}^{u}}

. After alignment to entities in MMKGs, the entity set used to model user preferences is obtained, notated as

h_{u}^{'} = {e_{1}^{u}, e_{2}^{u}, \dots, e_{n}^{u}}

. Note that the entity embedding here is no longer the original embedding representation but a multi-modal reinforced embedding representation. Subsequently, we sample k entities from

h_{u}^{'}

to model the user embedding representation according to a certain aggregation strategy. We implement the following three user aggregators (

k \times R^{d} \to R^{d}

) in MKGCN:

Max user aggregator selects the largest item representation vector in the user historical interaction items and then performs a nonlinear transformation:

$u s e r_a g g_{m a x} = σ (W_{u} \cdot Max (e_{1}^{u}, \dots, e_{k}^{u}) + b_{u}),$

(9)

where $W_{u}$ and $b_{u}$ are trainable transformation weights and biases, respectively, $e_{i}^{u} \in h_{u}^{'} (i = 1, \dots, k)$ represent the k items that have been selected and multi-modal aggregated from the historical interaction items of user u, and $σ$ is an activation function, such as LeakyReLU. Same symbols in the mean user aggregator represent the same meanings, so they are not repeated.
Mean user aggregator sums all entity representations and averages them, and then performs a nonlinear transformation:

$u s e r_a g g_{m e a n} = σ (W_{u} \cdot \frac{1}{n} \sum_{i = 1}^{k} e_{i}^{u} + b_{u}) .$

(10)
Multi-head attention user aggregator uses a multi-headed self-attention mechanism to capture user preferences for different music items. Specifically, the aggregator first processes the embedding vectors of historical items using the self-attention mechanism to compute attention weights and weighted embedding vectors:

$q_{i} = k_{i} = v_{i} = e_{i}^{u}, i = 1, 2, \dots, k,$

(11)

$z_{i} = softmax (\frac{q_{i} W_{q} {(K_{1} W_{K})}^{k}}{\sqrt{k}}) v_{1} W_{v}, i = 1, 2, \dots, k,$

(12)

where $W_{q}$ , $W_{K}$ , $W_{v} \in R^{d \times k}$ are the learnable parameter matrices, softmax is the activation function, k is the neighbor sampling size, and $z_{i} \in R^{d}$ is the weighted embedding vector of the i-th historical item. Finally, the weighted embedding vector is pooled on average in dimension 0 to obtain the embedding vector of user:

$u s e r_a g g_{m u l t i - h e a d} = \frac{1}{k} \sum_{i = 1}^{k} z_{i} .$

(13)

The final user embedding representation vector

\bar{u} \in R^{d}

of user u is noted as follows:

\bar{u} = u s e r_a g g (e_{1}^{u}, \dots, e_{k}^{u}),

(14)

where

e_{i}^{u} \in h_{u}^{'} (i = 1, \dots, k)

represent the k items that have been selected and multi-modal aggregated from the historical interaction items of user u.

The user aggregator is also one of the key components of MKGCN and we will discuss the effect of the user aggregator in the experiments section.

4.3.2. Item Aggregator

The item aggregator is more complex than the multi-modal aggregator and user aggregator. Upon closer examination of the item convolution kernel (yellow matrix) in Figure 6, we can observe that the shaded parts within each small square are different, representing distinct convolution kernel weights. In reality, we determine the weights that the item aggregator assigns while aggregating the representation vectors of neighboring nodes by modeling the user’s preferences for different relations. This approach enables us to determine whether a user who has listened to the songs “Love Story” and “exile” is given the song “Wildest Dreams” based on her preference for either the singer “Taylor Swift” or “pop” music.

We use the function

g (R^{d} \times R^{d} \to R)

to calculate the preference of user u for relation r:

π_{r}^{u} = g (u, r),

(15)

where

u \in R^{d}

and

r \in R^{d}

are the embedding representations of user u and relation r, respectively. d is the embedding dimension and g is the inner product operation.

For the entity e, we can obtain the its l-order neighbors set {

N {(e)}^{0}, N {(e)}^{1}, \dots, N {(e)}^{l}

}, where

N {(e)}^{0}

is equal to e itself. For each entity to perform neighbor sampling, we fix the sample size as k (same for the user aggregator). Therefore, the number of neighbors in the l-order neighbors set of e is

k^{l}

(possibly containing duplicates). For entity

e^{l - 1} \in N {(e)}^{l - 1}

, we aggregate it and its k neighboring entities

e^{l} \in N {(e)}^{l}

. We define the linear representation of the neighboring entities of

e^{l - 1}

as follows:

{\tilde{e}}^{l - 1} = \sum_{e^{l} \in N {(e)}^{l}}^{} {\tilde{π}}_{r_{e^{l - 1}, e^{l}}}^{u} e^{l},

(16)

where

r_{e^{l - 1}, e^{l}}

is the relation between

e^{l}

and

e^{l - 1}

, and

{\tilde{π}}_{r}^{u}

is the normalized user–relation score:

{\tilde{π}}_{r_{e^{l - 1}, e^{l}}}^{u} = \frac{\exp (π_{r_{e^{l - 1}, e^{l}}}^{u})}{\sum_{e^{l} \in N {(e)}^{l}}^{} \exp (π_{r_{e^{l - 1}, e^{l}}}^{u})} .

(17)

When calculating the aggregation of entities and neighbors, the user’s preference score for the relation acts as a filter, filtering out items of interest to the user to give greater weight.

After obtaining the current entity representations

e^{l - 1}

and

{\tilde{e}}^{l - 1}

, referring to [22], we implement three item aggregators:

R^{d} \times R^{d} \to R^{d}

in MKGCN:

Sum item aggregator sums the two vectors and then performs the nonlinear transformation:

$i t e m_a g g_{s u m} = σ (W_{v} \cdot (e^{l - 1} + {\tilde{e}}^{l - 1}) + b_{v}),$

(18)

where $W_{v}$ and $b_{v}$ are trainable transformation weights and biases, respectively, and $σ$ is an activation function, such as LeakyReLU. Same symbols in the other two aggregators represent the same meanings, so they are not repeated.
Concat item aggregator concats two vectors and then performs a nonlinear transformation:

$i t e m_a g g_{c o n c a t} = σ (W_{v} \cdot concat (e^{l - 1}, {\tilde{e}}^{l - 1}) + b_{v}) .$

(19)
Neighbor item aggregator directly replaces the entity representation with a linear combination of the neighbors aggregation:

$i t e m_a g g_{n e i g h b o r} = σ (W_{v} \cdot {\tilde{e}}^{l - 1} + b_{v}) .$

(20)

According to the above derivation, after iteratively aggregating l-level neighbors from outside to inside, we can obtain the final item embedding representation

\bar{v} \in R^{d}

of the item v for aggregating high-order neighbors, denoted as follows:

\bar{v} = i t e m_a g g (e^{0}, {\tilde{e}}^{0}),

(21)

where

e^{0}

is the embedding of item v after multi-modal aggregation and

{\tilde{e}}^{0}

is the linear representation of the neighbors of item v.

We will discuss the effects of three different item aggregators in the experiments section.

4.4. Prediction Layer

After obtaining the user representation u and the music item representation v, we predict the interaction probability of user u for music item v by using the prediction function:

{\hat{y}}_{u v} = f (\bar{u}, \bar{v}),

(22)

where

\bar{u}

and

\bar{v}

are the final embedding representations of user u and item v, respectively, and

f (\cdot)

is the vector inner product operation.

We use Bayesian personalized ranking (BPR) loss [36] as the loss function, which assumes that we observe higher scores for positive samples than for negative samples. The formula is defined as follows:

L_{l o s s} = \sum_{(u, v_{i}, v_{j}) \in O}^{} - \ln σ (\hat{y} (u, v_{i}) - \hat{y} (u, v_{j})) + {λ | | Θ | |}_{2}^{2},

(23)

where

O \in {(u, v_{i}, v_{j}) | (u, v_{i}) \in R^{+}, (u, v_{j}) \in R^{-}}

represents the training set,

R^{+}

represents the positive sample set, and

R^{-}

represents the negative sample set.

σ

is the dot product operation,

Θ

is the parameter set of the model, and

λ

is the L2 regularization factor.

5. Experiments

In this section, we experiment with the performance of MKGCN with real music datasets. We introduce the datasets, baseline models, experimental setup, and the experimental results performance in turn.

5.1. Datasets

As there are currently no existing music multi-modal knowledge graphs that can be directly utilized, we construct four multi-modal knowledge graphs (

M^{3}

KGs for short) of varying sizes using the publicly available Music4All-Onion dataset [37]. Music4All-Onion is a comprehensive multi-modal music dataset consisting of 109,269 tracks and additional audio, video, and metadata features. It also provides 252,984,396 listening records from 119,140 users who utilized the Last-FM online music platform.

The construction process of the

M^{3}

KGs is divided into two main parts: triple preprocessing and multi-modal data preprocessing.

Triple processing: First, we remove music items with missing multi-modal data. Next, we remove tags with less than five hits, and subsequently remove songs with less than five tags. Using the weight values (ranging from 1 to 100) assigned to each music item tag in Music4All-Onion, we obtain four music datasets with varying sizes and corresponding user–music item rating data based on weight value thresholds of >75, >50, >25, and >0. Finally, we utilize Microsoft Satori to construct four knowledge graphs and select a subset of triples with an execution score greater than 0.9.

Multi-modal data processing: Music4All-Onion provides metadata for each music project corresponding to audio, lyrics, and video. (1) For audio, we utilize the OpenSMILE audio processing tool to extract mel-frequency cepstral coefficients (MFCCs), pitch-related features, and sentiment-related features. Additionally, we use Essentia to extract the integrated audio signal, which consists of a series of time-domain, frequency-domain, rhythm, and pitch features, including statistics such as the mean and standard deviation of the signal. (2) For lyrics, we utilize a pre-trained Word2Vec [38] model to assign a vector representation to each word in the lyrics. We then average all word vectors of a song to obtain a single vector representation for that song. Additionally, we use an extended canonical lexicon of English words to map the words in the lyrics to affective value, evocation, and dominance values to determine the sentimental content of the lyrics. (3) For videos (or posters), we extract image frames at a rate of one frame per second and convert them to vectors using a pre-trained ResNet [39]. Finally, we aggregate the vectors to the track level using the maximum and average values.

In summary, we construct four

M^{3}

KGs and utilize the number of the interaction data in the dataset as the naming suffix. The basic statistical information for these four datasets is presented in Table 3.

5.2. Baselines

We compare MKGCN with the following state-of-the-art models, categorized as follows: the first two are KG-free methods, the middle two are KG-based methods, and the last two are MMKG-based methods.

SVD [40] is a CF-based recommendation framework that models user interactions using inner products.
CKE [12] is a recommendation framework that combines textual knowledge, visual knowledge, structural knowledge, and CF.
RippleNet [32] is a representative of KG-based methods, which is a memory-network-like approach that propagates users’ preferences on the KG for recommendation.
KGCN [22] is a recommendation framework that utilizes GCN aggregation of neighbors and modeling of relations on KGs to improve recommendations.
MMGCN [29] constructs bipartite graphs for each modality, representing the relationship between users and items. These bipartite graphs are then trained using GCN. Subsequently, a decision message is generated for each modality and these messages are fused together to obtain the final decision message.
MKGAT [28] incorporates visual and textual modal information for each entity on the knowledge graph and aggregates it to the entity representation. It then utilizes GCN to aggregate neighbors and improve the entity representation.

5.3. Experiment Setup

We use the Xavier initializer [41] to initialize the parameters of the MKGCN and optimize the MKGCN using the Adam optimizer [42]. The batch size, learning rate, and

λ

coefficients are explored with the following values: [64, 128, 512, 1024, 2048], [0.05, 0.01, 0.005, 0.001], and [

10^{- 5}, 10^{- 4}, \dots, 10^{1}, 10^{2}

], respectively.

Following the previous work [22,28,29,32], for each dataset, we randomly select 80% of the interaction data as the training set, the remaining 20% as the test set, and then 20% from the training set as the validation set to determine the selection of hyperparameters. We test the performance of MKGCN in click-through rate (CTR) prediction and top-K recommendation scenarios, respectively. In the CTR prediction, we select AUC (area under the curve) and F1 values as evaluation metrics. (It is worth noting that in the CTR prediction scenario, we evaluate the overall recommendation performance of the model, which implies that the metrics are dimensionless.) In the top-K recommendation, we select Recall@K and NDCG@K as the evaluation metrics, where the values of k range among [1, 2, 5, 10, 20, 50, 100]. For each experiment, we repeat five rounds and take the average value as the experimental results.

5.4. Performance Comparison with Baselines

In this section, we compare and analyze the performance of the MKGCN and the baseline models. The results of all methods in CTR prediction and top-K recommendation are presented in Table 4 and Figure 7 and Figure 8. The experimental results of all methods are obtained by selecting the optimal hyperparameter values.

Based on the experimental results, we draw the following conclusions.

SVD and CKE achieve the lowest performance, which suggests that collaborative filtering-based algorithms are not able to effectively utilize side information, leading to lower recommendation accuracy. CKE outperform SVD because it incorporates visual and textual multi-modal data, demonstrating that the use of multi-modal data can enhance the performance of recommendation systems.
RippleNet and KGCN are both knowledge-graph-based methods and their performance is better than that of collaborative-filtering-based methods, indicating that knowledge graphs can effectively utilize side information to improve recommendation accuracy. The performance of RippleNet is worse than that of KGCN because RippleNet is a memory-based network that does not model relations, whereas KGCN considers users’ preferences for different relations, demonstrating that modeling relation preferences can improve the performance of the recommender system.
MKGCN, MKGAT, and MMGCN are all multi-modal knowledge-graph-based methods, and achieve superior performance over KG-based and CF-based methods, highlighting the importance of multi-modal knowledge graphs for enhancing recommender systems. Both MKGAT and MMGCN utilize textual and visual multi-modality, but MKGAT outperforms MMGCN, indicating that its feature fusion strategy, which employs decision-based fusion, captures more complementary information between modalities. However, MKGCN achieve the best performance and outperformed MKGAT, thanks to its richer multi-modal data and efficient negative sampling strategy.

5.5. Effect of Component Setting

In this section, we investigate the impact of different components on the performance of MKGCN. We use a single-variable approach to explore the effect of individual components on MKGCN’s performance, which involves controlling for the other variables to be the same. However, it should be noted that this approach does not guarantee that the other variables are optimal.

5.5.1. Effect of Parameters Setting

Table 5 presents the optimal settings for the embedding dimension d, the neighbor sampling size k, and the number of propagation layers l that affect the performance of MKGCN. Following the previous work [22,32,43], the search spaces for embedding dimension d, neighbor sampling size k, and propagation layers l are {4,8,16,32,64,128,256}, {2,4,8,16,32}, and {1,2,3,4}. We employ a univariate approach to find the optimal value of the parameters. For each parameter, we incrementally select larger values and stop when the recommended performance steadily decreases.

Regarding the embedding dimension, we observe that initially increasing d leads to better performance as larger d can encode more user and entity information. However, excessively large values of d can lead to overfitting, while too small values may result in less differential information being represented in the multi-modal data. With respect to the number of neighbor sampling, we note that small values of k may not have enough capacity to merge neighborhood information, while excessively large values of k may be misleading due to noise. Moreover, the sparsity of node neighbors can also affect the value of k. As for the number of propagation layers l, we find that one or two layers can enhance item representations but, as l increases, significant noise can be introduced, which may degrade the performance of the recommendation system.

5.5.2. Effect of Aggregators

To comprehensively assess the performance of the aggregators, we fix the number of training iterations for MKGCN at 50 epochs. Figure 9 shows the performance of MKGCN on three different aggregators with varying aggregation strategies. Our observations are as follows:

The concat strategy of the multi-modal aggregator outperforms the max representation strategy and the sum strategy, as it retains information from multiple modalities simultaneously by concatenating their features. In contrast, the max strategy only considers the maximum value in each modality’s feature, which may result in the loss of some important details. Similarly, the sum strategy may also lose some crucial details and, in addition, the sum value may be less accurate for modalities that contain outliers or noise.
The mean strategy yields the best performance for user aggregators. The mean strategy calculates the average of the historical interaction item representation vector and can effectively represent the user’s interests. Conversely, the max strategy can only capture the most salient user preferences but not the full range of their preferences, since it uses the largest value of the item representation vector as the embedding representation of the user. The multi-head attention strategy is not suitable for users with sparse historical interaction data and hence performs the worst.
The concat and neighbor strategies for the item aggregator achieve comparable results, but the neighbor strategy performs better overall. This can be attributed to the fact that the neighbor strategy directly employs a linear combination of the neighbor nodes, rather than the current entity, which facilitates better utilization of the information from neighbor nodes compared to the concat strategy. The sum strategy achieves the worst results, mainly due to information loss, as values of different dimensions may cancel each other out.

5.5.3. Effect of Negative Sampling Strategy

Similarly, we fix the number of epochs to 50 to assess the impact of negative sampling strategies on the performance of MKGCN. As shown in Figure 10, the ratio negative sampling strategy outperforms the hard negative sampling strategy and random negative sampling performs the worst. Random negative sampling struggles to identify real negative samples in sparse datasets, often generating noisy samples that mask the genuine user interests. In contrast, the ratio negative sampling strategy ranks items according to their number of interactions, increasing the likelihood of selecting items with fewer interactions as negative samples, thereby simulating the sampling of cold items as negative samples and better preserving the original data structure. This improves the quality of negative samples. The hard negative sampling strategy, on the other hand, only samples items that do not appear in the positive sample as negative samples, which may miss some potential negative samples.

5.5.4. Effect of Multi-Modal Data

We investigate the effect of different modes on the performance of MKGCN using a series of ablation experiments. To this end, we design four variants.

(1): MKGCN-mm: MKGCN solely utilizes structural triples data and does not aggregate multi-modal data.
(2): MKGCN-ae: MKGCN employs only text and image modality embeddings, and does not utilize audio and sentiment feature embeddings.
(3): MKGCN-e: MKGCN incorporates embeddings from text, image, and audio modal features, but does not utilize sentiment feature embeddings.
(4): MKGCN-a: MKGCN employs embeddings from text, image, and sentiment modal features, but does not utilize audio feature embeddings.

As shown in Figure 11, we compare MKGCN with four variants. It is evident that MKGCN-mm performs the worst as it does not utilize any modal information, emphasizing the importance of multi-modality in improving recommendations. MKGCN-ae follows the mainstream multi-modal recommendation systems by only utilizing text and image modal information, resulting in an improvement over MKGCN-mm, but not significant enough to demonstrate the importance of text and image information for music recommendation systems. In contrast, MKGCN-a and MKGCN-e achieve performance close to that of MKGCN, indicating that incorporating audio and sentiment multi-modal data can significantly enhance music recommendation. This aligns with our intuition that, when we listen to music, we care more about its melody, tone, and the sentiment it evokes, rather than its history and story.

6. Discussion

Our experimental results show that our proposed MKGCN method can provide more accurate recommendation performance using the rich modal information on the music multi-modal knowledge graph. We compare MKGCN with six baseline models: SVD [40], CKE [12], RippleNet [32], KGCN [22], MMGCN [29], and MKGAT [28]. By comparing SVD and CKE, we confirm that multi-modal data benefits the recommender system; by comparing RippleNet and KGCN, we confirm that relational modeling benefits the performance of the recommender system. Finally, we compare MKGCN with MMGCN and MKGAT; all three are based on multi-modal knowledge graphs for recommendation; the latter two can only use visual and textual information; MKGCN achieves the best performance and confirms the importance of using audio features and emotional features of the music items themselves in music recommendation. This enlightens us that in our subsequent research, we should try to exploit the rich multi-modal information of the item itself to improve the performance of the recommender system as much as possible in conjunction with its characteristics.

In addition, it must be noted that the ratio negative sampling strategy also played a very helpful role in the model training of MKGCN, which confirms that random negative sampling may sample items of potential interest to users as negative, while ratio negative sampling is able to model user preferences more realistically based on popularity.

At the same time, there are still some limitations: currently we cannot effectively model the interactions between modalities, such as how to maximize the complementarity between modalities. In addition, how to confirm which modality determines a user’s preference for an item, i.e., the interpretability of multi-modal recommendations, is also a problem currently waiting to be solved. These will be the focus of our future work.

7. Conclusions

In this paper, we introduce MKGCN (Multi-modal Knowledge Graph Convolutional Network) for music item recommendation, which is the first work to focus on multi-modal knowledge graphs for music recommendation. Compared to existing multi-modal knowledge graph-based recommendation systems, MKGCN leverages a wider range of song-related modal information, such as MFCC, lyric sentiment, audio sentiment, and more. MKGCN employs three aggregators: the multi-modal aggregator enhances entity representation by aggregating multi-modal information; the user aggregator represents user embeddings by aggregating historical item embeddings; the item aggregator enhances item embeddings by layer-wise aggregating higher-order neighbor entities of items and employs attention scores to capture user preferences for different relations during aggregation. In addition, we use a more efficient ratio negative sampling strategy when training MKGCN. We construct four multi-modal knowledge graphs of music with different sizes based on the real music dataset Last-FM and conduct extensive experiments on them. The experimental results demonstrate that MKGCN outperforms the baseline models in CTR prediction and top-K recommendation.

MKGCN employs an early feature-fusion method that fails to fully leverage the interaction between different modalities. Therefore, we plan to investigate fine-grained modeling and interaction methods between different modalities in our future work. Additionally, we also plan to explore a multi-task learning framework that combines the recommender system task with the multi-modal knowledge graph representation learning task, aiming to further improve the accuracy of the recommender system. We believe that these future directions could lead to significant improvements in the performance of multi-modal recommendation systems for music.

Author Contributions

All authors contributed to the writing and revisions; conceptualization, X.C., X.Q. and D.L.; investigation, X.Q. and Y.L.; methodology, X.Q. and D.L.; software, X.Q.; supervision, D.L. and X.Z.; validation, X.Q. and Y.Y.; writing—original draft, X.Q.; writing—review and editing, X.C. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the Beijing Forestry University Science and Technology Innovation Program Project (BLX2014-27) and the CACMS Innovation Fund (CI2021A00512 and CI2021A05403).

Data Availability Statement

The implementation code of MKGCN and a multi-modal knowledge graph (

M^{3}

KG-6k) are available at https://github.com/QuXiaolong0812/mkgcn, accessed on 20 April 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hagen, A.N. The playlist experience: Personal playlists in music streaming services. Pop. Music. Soc. 2015, 38, 625–645. [Google Scholar] [CrossRef] [Green Version]
Kamehkhosh, I.; Bonnin, G.; Jannach, D. Effects of recommendations on the playlist creation behavior of users. User Model. User Adapt. Interact. 2020, 30, 285–322. [Google Scholar] [CrossRef] [Green Version]
Burgoyne, J.A.; Fujinaga, I.; Downie, J.S. Music information retrieval. In A New Companion to Digital Humanities; Wiley: Hoboken, NJ, USA, 2015; pp. 213–228. [Google Scholar]
Murthy, Y.V.S.; Koolagudi, S.G. Content-based music information retrieval (cb-mir) and its applications toward the music industry: A review. ACM Comput. Surv. CSUR 2018, 51, 1–46. [Google Scholar] [CrossRef]
Schedl, M.; Zamani, H.; Chen, C.W.; Deldjoo, Y.; Elahi, M. Current challenges and visions in music recommender systems research. Int. J. Multimed. Inf. Retr. 2018, 7, 95–116. [Google Scholar] [CrossRef] [Green Version]
Schedl, M.; Gómez, E.; Urbano, J. Music information retrieval: Recent developments and applications. Found. Trends Inf. Retr. 2014, 8, 127–261. [Google Scholar] [CrossRef] [Green Version]
Wu, L.; He, X.; Wang, X.; Zhang, K.; Wang, M. A survey on accuracy-oriented neural recommendation: From collaborative filtering to information-rich recommendation. IEEE Trans. Knowl. Data Eng. 2022, 35, 4425–4445. [Google Scholar] [CrossRef]
Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.S. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
Zhang, H.R.; Min, F.; Zhang, Z.H.; Wang, S. Efficient collaborative filtering recommendations with multi-channel feature vectors. Int. J. Mach. Learn. Cybern. 2019, 10, 1165–1172. [Google Scholar] [CrossRef]
Fan, W.; Ma, Y.; Li, Q.; He, Y.; Zhao, E.; Tang, J.; Yin, D. Graph neural networks for social recommendation. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 417–426. [Google Scholar]
Wang, H.; Zhang, F.; Hou, M.; Xie, X.; Guo, M.; Liu, Q. Shine: Signed heterogeneous information network embedding for sentiment link prediction. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Marina Del Rey, CA, USA, 5–9 February 2018; pp. 592–600. [Google Scholar]
Zhang, F.; Yuan, N.J.; Lian, D.; Xie, X.; Ma, W.Y. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 353–362. [Google Scholar]
Sun, Y.; Yuan, N.J.; Xie, X.; McDonald, K.; Zhang, R. Collaborative intent prediction with real-time contextual data. ACM Trans. Inf. Syst. TOIS 2017, 35, 1–33. [Google Scholar] [CrossRef]
Hogan, A.; Blomqvist, E.; Cochez, M.; d’Amato, C.; Melo, G.D.; Gutierrez, C.; Kirrane, S.; Gayo, J.E.L.; Navigli, R.; Neumaier, S.; et al. Knowledge graphs. ACM Comput. Surv. CSUR 2021, 54, 1–37. [Google Scholar]
Duan, H.; Liu, P.; Ding, Q. RFAN: Relation-fused multi-head attention network for knowledge graph enhanced recommendation. Appl. Intell. 2023, 53, 1068–1083. [Google Scholar] [CrossRef]
Wang, X.; He, X.; Cao, Y.; Liu, M.; Chua, T.S. Kgat: Knowledge graph attention network for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 950–958. [Google Scholar]
Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 12175–12185. [Google Scholar]
Sun, Y.; Xue, B.; Zhang, M.; Yen, G.G.; Lv, J. Automatically designing CNN architectures using the genetic algorithm for image classification. IEEE Trans. Cybern. 2020, 50, 3840–3854. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 2017 Annual Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W.L.; Leskovec, J. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 974–983. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and deep locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Wang, H.; Zhao, M.; Xie, X.; Li, W.; Guo, M. Knowledge graph convolutional networks for recommender systems. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 3307–3313. [Google Scholar]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Online, 25–30 July 2020; pp. 639–648. [Google Scholar]
Zhu, X.; Li, Z.; Wang, X.; Jiang, X.; Sun, P.; Wang, X.; Xiao, Y.; Yuan, N.J. Multi-modal knowledge graph construction and application: A survey. IEEE Trans. Knowl. Data Eng. 2022, 1, 1–20. [Google Scholar] [CrossRef]
Mousselly-Sergieh, H.; Botschen, T.; Gurevych, I.; Roth, S. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, LA, USA, 5–6 June 2018; pp. 225–234. [Google Scholar]
Pezeshkpour, P.; Chen, L.; Singh, S. Embedding Multimodal Relational Data for Knowledge Base Completion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3208–3218. [Google Scholar]
Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
Sun, R.; Cao, X.; Zhao, Y.; Wan, J.; Zhou, K.; Zhang, F.; Wang, Z.; Zheng, K. Multi-modal knowledge graphs for recommender systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Online, 19–23 October 2020; pp. 1405–1414. [Google Scholar]
Wei, Y.; Wang, X.; Nie, L.; He, X.; Hong, R.; Chua, T.S. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1437–1445. [Google Scholar]
Tao, S.; Qiu, R.; Ping, Y.; Ma, H. Multi-modal knowledge-aware reinforcement learning network for explainable recommendation. Knowl.-Based Syst. 2021, 227, 107217. [Google Scholar] [CrossRef]
Vyas, P.; Vyas, G.; Dhiman, G. RUemo—The Classification Framework for Russia-Ukraine War-Related Societal Emotions on Twitter through Machine Learning. Algorithms 2023, 16, 69. [Google Scholar] [CrossRef]
Wang, H.; Zhang, F.; Wang, J.; Zhao, M.; Li, W.; Xie, X.; Guo, M. Ripplenet: Propagating user preferences on the knowledge graph for recommender systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; pp. 417–426. [Google Scholar]
Wang, Z.; Lin, G.; Tan, H.; Chen, Q.; Liu, X. CKAN: Collaborative knowledge-aware attentive network for recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Online, 25–30 July 2020; pp. 219–228. [Google Scholar]
Togashi, R.; Otani, M.; Satoh, S. Alleviating cold-start problems in recommendation through pseudo-labelling over knowledge graph. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Online, 8–12 March 2021; pp. 931–939. [Google Scholar]
Chen, Y.; Wang, X.; Fan, M.; Huang, J.; Yang, S.; Zhu, W. Curriculum meta-learning for next POI recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Online, 14–18 August 2021; pp. 2692–2702. [Google Scholar]
Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. arXiv 2012, arXiv:1205.2618. [Google Scholar]
Moscati, M.; Parada-Cabaleiro, E.; Deldjoo, Y.; Zangerle, E.; Schedl, M. Music4All-Onion—A Large-Scale Multi-faceted Content-Centric Music Recommendation Dataset (Version v0). In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Koren, Y. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 426–434. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Wang, H.; Zhang, F.; Zhang, M.; Leskovec, J.; Zhao, M.; Li, W.; Wang, Z. Knowledge-aware graph neural networks with label smoothness regularization for recommender systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 968–977. [Google Scholar]

Figure 1. Illustration of knowledge-graph-enhanced music recommender systems.

Figure 2. An example of music information retrieval website.

Figure 3. Illustration of the multi-modal knowledge graph. The left subfigure shows an illustration of an attribute-based MMKG and the right subfigure shows an illustration of an entity-based MMKG.

Figure 4. Illustration of collaborative knowledge graph. The blue nodes represent the users, the light yellow nodes represent the music items that the user has interacted with, and the orange nodes represent the entities located in the MMKGs after the music item has been aligned.

Figure 5. Illustration of the proposed MKGCN model. It contains three key components: the multi-modal aggregator, the item aggregator, and the user aggregator.

Figure 6. Illustration of three main aggregators. The matrices represent the entity v and its l-order neighbors (in this figure, l = 2). The blue matrix represents the original embedding of the entity, the green matrices represent the multi-modal data embeddings, the purple matrix group represents the multi-modal aggregator, the pink matrix represent the entity representation after multi-modal aggregation, the red and dark red represent the reinforced entity representation of the neighbor entities aggregation, and the yellow matrix represents the user/item aggregator, where different shading indicates the weight for different neighbors.

Figure 7. Recall@K in top-K recommendation.

Figure 8. NDCG@K in top-K recommendation.

Figure 9. Performance of multi-modal aggregator, user aggregator, and item aggregator with different aggregation strategies.

Figure 10. Effect of different negative sampling strategies on the performance of MKGCN.

Figure 11. Performance of different variants of MKGCN.

Table 1. Summary of key abbreviations, sorted by the appearance in this paper.

Abbreviation	Expansion
KG	Knowledge graph
MMKG	Multi-modal knowledge graph
CF	Collaborative filtering
CKG	Collaborative knowledge graph
CNN	Convolutional neural network
GCN	Graph convolutional network
PCA	Principal component analysis
$M^{3}$ KG	Music multi-modal knowledge graph
MFCC	Mel-frequency cepstral coefficient

Table 2. Example RDF triples in attribute-based MMKGs.

Triple	Head	Relation/Attribute	Tail
$T_{R}$	Justin Bieber	Singer.Nation	Canada
	Taylor Swift	Singer.Music	exile
	Love Story	Music.Genre	Pop
$T_{A}$	exile	Music.Poster	exile_poster.png
	exile	Music.Lyrics	exile_lyrics.txt
	Saturn	Music.audio	Saturn.mp3

Table 3. Basic statistics of the four

M^{3}

KGs. #users denotes the number of users, #items denotes the number of items, #interactions denotes the number of interaction data. #entities, #relations, and #triples represent the numbers of entities, relations, and triples in the

M^{3}

KGs.

Table 3. Basic statistics of the four

M^{3}

KGs. #users denotes the number of users, #items denotes the number of items, #interactions denotes the number of interaction data. #entities, #relations, and #triples represent the numbers of entities, relations, and triples in the

M^{3}

KGs.

	$M^{3}$ KG-12M	$M^{3}$ KG-3M	$M^{3}$ KG-20k	$M^{3}$ KG-6k
#users	52,431	44,548	9843	5999
#items	51,842	19,123	5268	4836
#interactions	11,987,329	3,209,509	196,754	56,492
#entities	60,312	23,781	7513	6775
#relations	76	51	26	7
#triples	445,866	152,921	42,170	36,264

Table 4. The results of AUC and F1 in CTR prediction.

Model	$M^{3}$ KG-12M		$M^{3}$ KG-3M		$M^{3}$ KG-20k		$M^{3}$ KG-6k
Model	AUC	F1	AUC	F1	AUC	F1	AUC	F1
SVD	0.729	0.612	0.743	0.639	0.738	0.635	0.731	0.640
CKE	0.767	0.689	0.792	0.704	0.766	0.697	0.744	0.673
RippleNet	0.852	0.792	0.796	0.723	0.785	0.707	0.780	0.702
KGCN	0.909	0.843	0.887	0.799	0.867	0.781	0.811	0.721
MMGCN	0.937	0.885	0.906	0.832	0.885	0.814	0.876	0.756
MKGAT	0.952	0.897	0.935	0.850	0.913	0.843	0.899	0.812
MKGCN	0.973	0.916	0.960	0.896	0.942	0.867	0.918	0.841

The winner is in bold and the runner-up is underlined.

Table 5. The optimal setting of d, k, and l for MKGCN in four datasets.

Parameter	$M^{3}$ KG-12M	$M^{3}$ KG-3M	$M^{3}$ KG-20k	$M^{3}$ KG-6k
d	128	128	64	32
k	8	8	16	8
l	2	1	1	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, X.; Qu, X.; Li, D.; Yang, Y.; Li, Y.; Zhang, X. MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems. Electronics 2023, 12, 2688. https://doi.org/10.3390/electronics12122688

AMA Style

Cui X, Qu X, Li D, Yang Y, Li Y, Zhang X. MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems. Electronics. 2023; 12(12):2688. https://doi.org/10.3390/electronics12122688

Chicago/Turabian Style

Cui, Xiaohui, Xiaolong Qu, Dongmei Li, Yu Yang, Yuxun Li, and Xiaoping Zhang. 2023. "MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems" Electronics 12, no. 12: 2688. https://doi.org/10.3390/electronics12122688

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems

Abstract

1. Introduction

2. Related Work

2.1. Convolutional Neural Networks

2.2. Multi-Modal Knowledge Graph

2.3. Recommendations with MMKGs

2.4. Negative Sampling Strategy

3. Problem Formulation

3.1. User–Item Interaction Matrix

3.2. Multi-Modal Knowledge Graph

3.3. Collaborative Knowledge Graph

3.4. Problem Description

4. Methodology

4.1. Alignment and Knowledge Propagation Layer

4.2. Multi-Modal Aggregation Layer

4.3. GCN Aggregation Layer

4.3.1. User Aggregator

4.3.2. Item Aggregator

4.4. Prediction Layer

5. Experiments

5.1. Datasets

5.2. Baselines

5.3. Experiment Setup

5.4. Performance Comparison with Baselines

5.5. Effect of Component Setting

5.5.1. Effect of Parameters Setting

5.5.2. Effect of Aggregators

5.5.3. Effect of Negative Sampling Strategy

5.5.4. Effect of Multi-Modal Data

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI