A MOOC Course Data Analysis Based on an Improved Metapath2vec Algorithm

Xu, Congcong; Feng, Jing; Hu, Xiaomin; Xu, Xiaobin; Li, Yi; Hou, Pingzhi

doi:10.3390/sym15061178

Open AccessArticle

A MOOC Course Data Analysis Based on an Improved Metapath2vec Algorithm

by

Congcong Xu

¹,

Jing Feng

^1,*

,

Xiaomin Hu

²,

Xiaobin Xu

¹

,

Yi Li

¹ and

Pingzhi Hou

¹

Department of Automation, Hangzhou Dianzi University, Hangzhou 310018, China

²

Department of Science, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Symmetry 2023, 15(6), 1178; https://doi.org/10.3390/sym15061178

Submission received: 19 April 2023 / Revised: 24 May 2023 / Accepted: 29 May 2023 / Published: 31 May 2023

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

Many real-world scenarios can be naturally modeled as heterogeneous graphs, which contain both symmetry and asymmetry information. How to learn useful knowledge from the graph has become one of the hot spots of research in artificial intelligence. Based on Metapath2vec algorithm, an improved Metapath2vec algorithm is presented, which combines Metapath random walk, used to capture semantics and structure information between different nodes of a heterogeneous network, and GloVe model to consider the advantage of global text representation. In order to verify the feasibility and effectiveness of the model, node clustering and link prediction experiments were conducted on the self-generated ideal dataset and the MOOC course data. The analysis of experimental data on these tasks shows that the Metapath–GloVe algorithm learns consistently better embedding of heterogeneous nodes, and the algorithm improves the node embedding performance to better characterize the heterogeneous network structure and learn the characteristics of nodes, which proves the effectiveness and scalability of the proposed method in heterogeneous network mining tasks. It is also shown through extensive experiments that the Metapath–GloVe algorithm is more efficient than the non-negative matrix decomposition algorithm (NMF), and it can obtain better clustering results and more accurate prediction results in the video recommendation task.

Keywords:

heterogeneous graph embedding; meta-path; GloVe; clustering; video recommendation

1. Introduction

In the real-world, many scenarios, such as transportation networks, social networks, communication networks, etc., can be effectively stored and represented using a graph structure. In the graph, nodes represent the objects in the scenario and edges represent the relationships between objects [1]. Therefore, research can be conducted on analyzing the structure and properties of graph structures to understand such real-world complex systems. With the increasing complexity of real-world data, graph structures are characterized as massive, high-dimensional, sparse, heterogeneous, complex, and dynamic [2]. Both symmetry and asymmetry information exist in the graph. For example, in the social network, the marriage relationship is symmetric while the filiation relationship is asymmetric [3]. Therefore, how to learn useful knowledge from the graph has become one of the hot spots of research in artificial intelligence.

Graph embedding is an important technical tool for graph knowledge discovery. It aims to map graph data into a low-dimensional vector that can correctly represent some important information of the original graph [4]. Moreover, graph embedding enables practical application problems such as node/graph classification [5], node clustering [6], connection prediction [7], and recommendation systems [8]. Initially, much of the research on graph embedding focuses on the homogeneous networks, which consist of only one node type and edge type [9]. However, the relationships in the real-world are naturally modeled as heterogeneous networks, which contain richer and more complex semantics and structures. As more than one node type and edge type are provided, the graph embedding techniques for homogeneous networks cannot be directly applied to heterogeneous networks [10]. Therefore, how to learn embedding of heterogeneous networks has become one of the hot topics of research in artificial intelligence [11].

Initially, matrix decomposition methods [12], i.e., singular value decomposition (SVD) [13] and non-negative matrix factorization (NMF) [14], are used to generate potential dimensional features in heterogeneous networks. The main idea of this kind of method is to represent the association information of a graph as a matrix, and then matrix decomposition is implemented on the association matrix to generate a low-dimensional vector of each node. However, the computational cost of decomposing large-scale matrices is usually very expensive, and there are also statistical performance drawbacks [15]. At the same time, matrix decomposition is not easy to incorporate features of different types of nodes and contexts, which makes it impossible for matrix decomposition to obtain enough effective information. Subsequently, random walk-based approaches are proposed to learn the embedding of nodes in the graph. This kind of method evolves from the Word2vec model in natural language processing [16], where different walk strategies are used to obtain the node sequence of the graph, and then the Skip-Gram word embedding model is used to complete the node embedding learning. For example, DeepWalk [17] first randomly generates the neighbors of nodes in the network to form a fixed-length random walk sequence, while Node2vec model [18] obtains the random walk sequence by adjusting the parameters of Breadth-First Search (BFS) and Depth-First Search (DFS) strategies. However, these two random walk strategies do not distinguish the forward direction, and when applied to a heterogeneous graph with various node types, it is easy to lead to statistical bias. Therefore, Metapath2vec [19] is proposed to generate a sequence of heterogeneous nodes by introducing meta-paths to guide random walk, so that it can be used to capture the semantic and structural information between different types of nodes to ensure the correctness of semantic changes, and then the nodes in the sequence are treated as words and the Skip-Gram model is adopted to learn the embeddings of the heterogeneous graph. Indeed, Metapath2vec essentially transforms the heterogeneous graph embedding into a word embedding problem and can fully capture the structural and semantic relevance of different types of nodes and relations. Among them, word embedding models are the key to achieving good embedding results. For example, the Skip-Gram model [20] is a local context window approach for learning word vectors and is a mainstream model for word vector learning. The global matrix decomposition model [21] (e.g., LSA) is a word vector learning method, which is trained on separate local context windows. However, the former may be better at analogy tasks, but they are trained in separate local context windows with limited window lengths and do not make good use of global statistics of the data. Especially, the latter makes effective use of the statistical information but fails to capture the semantic information and so performs poorly in tasks such as the similarity of words [22]. Then, Global Vectors for Word Representation (GloVe) model [23], which incorporates the global statistical information of matrix decomposition (LSA), has the advantage of co-occurrence windows and global priori statistical information. Its core idea is to use the number of co-occurrences between words for training. It is efficient in training the learning of word vectors and can carry more semantic information. Zheng Yanan [24] and others use GloVe to extract text features compared with Word2vec word vector, and then use SVM to classify text. Experiments show that using GloVe to extract features in text classification is better. Shishir Kulkarni [25] and others use second-order random walk to create a corpus, and then generate node embeddings of the graph by transplanting the GloVe algorithm. Experiments show that the feature extraction by GloVe is also effective in extracting node features of the graph.

In this paper, an algorithm named Metapath–GloVe is proposed to improve the efficiency of the Metapath2vec algorithm. The main idea of this method is to train the heterogeneous node sequences, which are obtained by random walk based on Metapath2vec algorithm, and then use the GloVe model to complete the learning of node embedding in heterogeneous graphs [26]. Following that, the proposed algorithm is applied to analyze the MOOC course data, which are collected from the MOOC online learning platform [27]. The dataset is modeled into a heterogeneous graph by extracting the users, videos, course, and the asymmetric relationship among them. Based on the constructed heterogeneous graph, the proposed algorithm is used to transform the structured information of the heterogeneous graph into vector information. Subsequently, based on the learned user and video embedding vectors, a joint Spectral Co-clustering algorithm is used to perform association analysis of users and videos. Then, a classification-based link prediction model is constructed by obtaining the embedding vectors of user–video links to make video recommendations to users.

The main contributions of this paper can be summarized as follows: (1) A heterogeneous graph embedding learning algorithm is proposed by training GloVe word embedding model on meta-path random walks, which can improve the efficiency of heterogeneous node embedding learning. (2) The proposed algorithm is applied on the MOOC course data to learn embeddings of users and videos, which lays the foundation for the realization of subsequent analysis tasks. (3) The proposed method has better results than traditional methods for user and video association analysis and video recommendation experiments on MOOC course data.

The rest of the paper is organized as follows: Section 2 introduces an improved Metapath2vec algorithm proposed in this paper in the context of the MOOC data used in this paper; Section 3 presents the application of the algorithm proposed in this paper to MOOC data analysis; Section 4 gives the experimental results of this paper and its analysis; Section 5 makes a summary of the whole paper.

2. Learning MOOC Course Data Node Embedding Based on Metapath–GloVe

In this section, the proposed algorithm Metapath–GloVe is applied to the context of MOOC data. Firstly, the relationship between users, videos, and courses in MOOC data is extracted, and a heterogeneous graph of MOOC data is constructed. Secondly, random walk sequences, which contain nodes in the type user and video, are obtained from the heterogeneous graph by implementing meta-path random walk. Then, in view of the advantages of the GloVe model, this model is used to learn the node embedding of users and videos based on the meta-path sequences. Finally, an overall framework of the proposed algorithm for learning node embeddings in heterogeneous graphs is given.

2.1. The Construction of Heterogeneous Graph from MOOC Course Data

In general, a graph is structured data, which is essentially a collection of vertices or nodes connected together by edges. Usually, a graph is represented by

G = (V, E)

, where

V

and

E

represent the set of all nodes and the set of all edges, respectively. Then, the types of nodes and the types of edges in the graph are described by defining the node type mapping functions

φ : V \to T v

such that

\forall v \in V, φ (v) \in T v

,

T v

are the sets of node types, and the edge type mapping functions

ψ : E \to T e

,

\forall e \in E, ψ (e) \in T e

,

T e

are the sets of edge types, respectively. Generally, graphs can be classified into homogeneous and heterogeneous graphs according to the types of nodes and edges. Homogeneous graphs refer to graphs with only one node type and one edge type, i.e.,

|T v| + |T e| = 2

, which usually have a simple network structure. While heterogeneous graphs have more than one node type or edge type, i.e.,

|T v| + |T e| > 2

, which contain richer semantic information and more complex network structure.

In this paper, users, videos, courses, and their relationships are extracted from MOOC course data, and a heterogeneous graph is built as shown in Figure 1. The heterogeneous graph contains three types of nodes: users, videos, and courses, and different types of relationships exist between different nodes. Specifically, if a user watches a video, we consider that there is a link between the user and the video, and if the video belongs to a course, we consider that there is a link between the video and the course. In the following experiments, take this heterogeneous graph as the object, we carry out the main analysis of the relationship between the users and the videos, and courses are treated as labels to evaluate the proposed algorithms. Thus, we adopt

B (U, V, E)

to represent the object graph, where

U

and

V

stand for users and videos, respectively, and

E

is the relationship between users and videos.

2.2. The Acquisition of Meta-Path Sequences Based on Heterogeneous Random Walk

When a heterogeneous graph is obtained, heterogeneous random walk is adopted to acquire node sequences with multiple types. Firstly, a pre-set meta-path scheme

ρ

is introduced in the heterogeneous graph to guide the process of random walk to generate node sequences that capture the semantic and structural correlations between different types of nodes. As shown in Equation (1), it is a defined meta-path scheme

ρ

.

V_{1} \overset{r_{1}}{\to} V_{2} \overset{r_{2}}{\to} \dots V_{t} \overset{r_{t}}{\to} V_{t + 1} \dots \overset{r_{l - 1}}{\to} V_{l}

(1)

where

R = r_{1} \circ r_{2} \circ \dots \circ r_{l - 1}

denotes the combination of meta-paths between node

V_{1}

and node

V_{l}

. Finally, a meta-path scheme can be generated from the combination with a higher-order relation (nodes in different types). It is able to capture more complex and rich semantic relations than the first-order relation (nodes in same type) approach.

When a meta-path scheme

ρ

is given, the

i

-th transition probability of the meta-path random walk is defined in Equation (2).

p (v^{i + 1} | v_{t}^{i}, ρ) = \{\begin{cases} \frac{1}{|N_{t + 1} (v_{t}^{i})|} & (v^{i + 1}, v_{t}^{i}) \in E, φ (v^{i + 1}) = t + 1 \\ 0 & (v^{i + 1}, v_{t}^{i}) \in E, φ (v^{i + 1}) \neq t + 1 \\ 0 & (v^{i + 1}, v_{t}^{i}) \notin E \end{cases}

(2)

where

v_{t}^{i}

is the node of type

V_{t}

,

N_{t + 1} (v_{t}^{i})

denotes neighbors of node

v_{t}^{i}

in type

V_{t + 1}

. Obviously, random walk is a biased under the predefined meta-path scheme

ρ

. Thus, meta-path based random walk can be used to capture the semantic and structural information among different types of nodes to ensure the correctness of semantic changes, and also to effectively avoid the statistical bias caused by the high proportion of a certain type of nodes, so that the heterogeneous graph can be effectively integrated into the subsequent model to complete the embedding of nodes.

In our MOOC course heterogeneous graph

B (U, V, E)

, two meta-paths are constructed to form a meta-path scheme as shown in Equation (3).

ρ : \{\begin{cases} R_{1} : U - V - U (user-video-user) \\ R_{2} : V - U - V (video-user-video) \end{cases}

(3)

In details,

R_{1}

describes the relationship between users who have clicked on the same video

R_{2}

describes the relationship among videos that have been clicked by the same user. Biased random walks are guided by the meta-path scheme to obtain the node sequences with multiple node types.

2.3. Learning Node Embedding Based on GloVe Model

When node sequences are obtained, they are treated as corpus and words, and GloVe model [18] is adopted to learn the node embeddings. The GloVe model adopts both the overall statistics feature and the local context feature of the corpus. The progress of the GloVe model is shown as follows.

Step 1: The node sequences generated by random walk are treated as corpus and set as the input of the model.

Step 2: A global co-occurrence matrix is generated. A context window is set to traverse from the beginning to the end of the corpus, and then the number of simultaneous occurrences of two nodes are counted. For example, the co-occurrence matrix is represented as

X

with element

X_{i, j}

denote the number of times the words

i

and

j

appear together in a window.

Step 3: An approximate relationship between the word vectors and the co-occurrence matrix is constructed as shown in Equation (4).

u_{i}^{T} v_{j} + b_{i} + b_{j} = \log (X_{i, j})

(4)

where

u_{i}

and

v_{j}

are the word vectors of the word

i

, and

j

,

b_{i}

, and

b_{j}

are the bias terms of the two word vectors. The equation means the inner product of word vectors converge to the logarithmic value of the co-occurrence matrix.

Step 4: The loss function

L

is constructed based on the approximate relationship using the mean square error, as shown in Equation (5).

G = \sum_{i, j}^{N} f (X_{i, j}) {(u_{i}^{T} v_{j} + b_{i} + b_{j} - \log (X_{i, j}))}^{2}

(5)

where

N

is the dimension of the co-occurrence matrix. Here,

f (x)

is the weight function with the characteristics shown in Equation (6).

f (x)

is non-degeneracy, and the weights do not decrease as the number of co-occurrences increases. In this paper,

x_{\max} = 100, α = 0.75

.

f (x) = \{\begin{cases} {(x / x_{\max})}^{α}, & x < x_{\max} \\ 1, & x \geq x_{\max} \end{cases}

(6)

In this paper, AdaGrad [28] is adopted to train the model. Initially, all non-zero pairs

i

and

j

in the co-occurrence matrix

X

are randomly sample as training data, the corresponding word vectors

u_{i}

,

v_{j}

and two biases

b_{i}

and

b_{j}

are randomly initialized. Then the initial loss value can be calculated. In the following, the gradient can be calculated by a given learning rate; thus,

u_{i}

,

v_{j}

,

b_{i}

,

b_{j}

can be updated. The above process continues until the given iteration conditions reached. Finally, the output vectors are the embeddings feature representation of each node. This obtained embeddings can correctly express the semantic information of the whole heterogeneous network and the relationship between each node, and can be used for node classification output, clustering, or similarity search.

2.4. An Overall Process of Metapath–GloVe Algorithm

In this paper, the proposed Metapath–GloVe algorithm is used to analyze the MOOC course data, which are collected from the MOOC platform. The overall process of learning the node embeddings of MOOC course data based on Metapath–GloVe is shown in Figure 2. Firstly, users, videos, courses, and the relationship among them are extracted from MOOC coursed data to construct a heterogeneous graph of MOOC course data as shown in Figure 1. In the graph, we focus on the relationship between users and videos, while courses are treated as labels for evaluation. Then, meta-paths are pre-set to guide random walk to obtain heterogeneous node sequences. The obtained heterogeneous node sequences are input to the GloVe model as a text corpus, and the co-occurrence matrix is obtained statistically. Then, the word vector training is performed based on this co-occurrence matrix, and the final embedding feature vector matrix is obtained through the optimization of the loss function. Based on the learned node embeddings, tasks such as association segmentation and user–video link prediction can be achieved.

3. Data Analysis of MOOC Course Data Based on Metapath–GloVe

After learning the node embeddings based on Metapath–GloVe, two methods are proposed to analyze the MOOC course data: the node embedding-based association partitioning method and the node embedding-based user–video link prediction method. The details are shown in the following.

3.1. User/Video Association Analysis

Based on the learned node embeddings, the clustering algorithm can be adopted to achieve association analysis. The specific experimental process is shown in Figure 3. Figure 4 demonstrates the effect of clustering analysis on MOOC course data. In details, Figure 4a shows the adjacency matrix of a graph, Figure 4b shows the learned node embedding feature matrix, and Figure 4c shows the result of clustering analysis. It can be seen from the figures, Figure 4a,b do not contain the regular information, while in Figure 4c, the association between users and videos can be extracted easily after clustering analysis.

In this paper, Spectral Co-clustering [29] is implemented on the embeddings of the user and video to perform association partitioning. The processing is shown as follows.

Step 1: An input matrix

A

is constructed by set users and videos as the rows and columns of the matrix, respectively. In the matrix, each element can be obtained by calculating the cosine similarity of corresponding user and video embeddings.

Step 2: The input matrix

A

is preprocessed as follows.

A = R^{- \frac{1}{2}} A C^{- \frac{1}{2}}

(7)

where

R

is a diagonal matrix, where the elements

i

are equal to

\sum_{j} A_{i j}

, and

C

is a diagonal matrix, where the elements

j

are equal to

\sum_{i} A_{i j}

.

Step 3: Singular value decomposition is implanted on A as shown in Equation (8).

A = U \sum V^{T}

(8)

It yields a partition of

A

in rows and columns, which a subset of the singular vectors on the left gives row partitions, and a subset of the singular vectors on the right gives column partitions.

Step 4: Calculate matrix

Z

, which provides the required partitioning information by Equation (9).

Z = [\begin{array}{l} R^{- 1 / 2} U \\ C^{- 1 / 2} V \end{array}]

(9)

Among them, the columns of

U

are

u_{2}, u_{3}, \dots, u_{λ + 1}

, and the columns of

V

also have similar characteristics.

Step 5: The clustering result can be obtained by using k-means on all rows of

Z

.

3.2. User–Video Link Prediction Method Based on Node Embeddings

In this section, link prediction task is implemented on the MOOCCube data to evaluate the performance of Metapath–GloVe algorithm. Users are treated as nodes U, videos are treated as nodes V, and the relation between them are set as user–video links (U-V). The process of link prediction is shown in Figure 5.

Step 1: The user–video pairs with existing links are considered as positive node pairs, forming a set E of link relationships, and all information is integrated to build a user–video bipartite graph

B (U, V, E)

.

Step 2: Randomly select 20% of positive node pairs as test samples and the remaining 80% of positive node pairs as training samples.

Step 3: All unlinked user–video pairs are treated as negative links, from which the same number of negative links are randomly selected to form the test and training sets, respectively.

Step 4: Generate a new user–video bipartite graph

B^{'} (U, V, E^{'})

by removing the positive test link and learn the node embeddings of

B^{'} (U, V, E^{'})

by using Metapath–GloVe.

Step 5: The edge embedding values of all positive and negative node pairs samples in the training and test sets are computed as the following formula.

v a l u e = |a - b|

(10)

where

a

is the feature vector of user nodes in a link, and

b

is the feature vector of video nodes in this link.

Step 6: The positive node pairs in the training and test sets are assigned a label value of 1, and the negative node pairs are assigned a label value of 0.

Step 7: The embedding values are used as input and the label values are used as output, which are input to different classifiers for training.

Step 8: The trained classifiers, i.e., Bagging classifier, Stacking classifier, and Neural Network (MLP) classifier, are used to predict the links in the test set for link prediction.

4. Experiments and Results Analysis

In this paper, we implemented two proposed methods in Section 3 on MOOC course dataset to achieve user/video association analysis and user–video link prediction. To validate the effectiveness and scalability of the proposed methods, extensive experiments were conducted as follows.

4.1. Result of User/Video Association Analysis on MOOC Course Dataset

In the MOOC course dataset, the label of users is not available. To better evaluate the proposed method, synthetic datasets are generated with explicit labels. Therefore, we implement the association analysis method on both a synthetic dataset and the MOOC course dataset.

4.1.1. Synthetic Data

In this paper, synthetic datasets are generated based on the scheme shown in Figure 6. Suppose there are two node types, which are

T_{1}

and

T_{2}

. For each type of node, we suppose there are two node clusters, which are

T_{1} : [C_{1}, C_{2}]

and

T_{2} : [C_{3}, C_{4}]

, and the number of nodes in each cluster is set as 100; thus, the size of the synthetic graph is 200 × 200, and the total node is labelled from 1 to 399. Based on the principle of clustering that connection is dense in the same cluster while sparse between different clusters, we adopt the connection probability

P = [P_{1}, P_{2}, P_{3}, P_{4}]

to represent the degree of sparsity of connections. As shown in the figure,

P_{1}, P_{4}

are the probability in the cluster and

P_{2}, P_{3}

are the probability between clusters. Thus, when

P

is given, edges are generated randomly according to the probability. Three sets of connection probabilities,

P = [0.9, 0.1, 0.1, 0.9]

,

P = [0.7, 0.3, 0.3, 0.7]

, and

P = [0.6, 0.4, 0.4, 0.6]

, were used in the experiments, and the data distribution is shown in Figure 7. It can be seen from the figure, Figure 7a has a clear cluster division, and with the decrease of the probability in the cluster and the increase of the probability between clusters, the noise is raised obviously.

In this paper, we evaluate the clustering efficiency using a combination of commonly used internal and external indices. Our evaluation includes the external evaluation index of Silhouette Coefficient (S), as well as four internal evaluation indices: Adjusted Rand Index (ARI), Standardized Mutual Information (NMI), Adjusted Mutual Information (AMI), and Fowlkes–Mallows Scores (FMI). Specifically, the Silhouette Coefficient (S) measures the degree of cohesion and separation between clusters, with a value range of [−1,1]. The Adjusted Rand Index (ARI) measures the degree of agreement between the clustering results and the true classification, with a value range of [−1,1]. The Adjusted Mutual Information (AMI) is used to calculate the similarity of the clustering results to the true classification, with a value range of [−1,1]. The Normalized Mutual Information (NMI) is used to measure the similarity of the clustering results, with a value range of [0,1]. Lastly, the Fowlkes–Mallows Scores (FMI) calculates the geometric mean of the clustering results with respect to the true value to the exact rate and recall, with a value range of [0,1]. In summary, the clustering quality is determined by the values of these indices, with higher values indicating better clustering results. Conversely, lower values indicate poorer clustering quality.

In this paper, the Metapath–GloVe algorithm, together with NMF and Metapath2vec algorithms, is implemented on the three generated datasets for association analysis evaluation. The important hyperparameters of each method are shown in Table 1. Moreover, node embeddings are learned by each method, and then are fed into clustering algorithm. The average evaluation results of the five indicators are shown in Table 2. Seen from the table, as the cluster relation is distinct in case [0.9,0.1,0.1,0.9], and it can be seen that all three methods can achieve perfect result. As the probability in the cluster decreased and the probability between the clusters increased, the efficiency of all methods is reduced at different level. However, the proposed Metapath–GloVe method can achieve the best result in the other two cases.

4.1.2. MOOC Course Dataset

In this section, the association analysis is implemented on the MOOC course dataset. In detail, the MOOC course dataset is extracted from MOOCCube data of concept named “K_Data_Structure_Computer_Science_Technology”, which includes data on courses, videos, users, and their relationships. Table 3 provides details on the specific data analyzed in this paper. User information is treated as a type of node U. For a node of user information class, add the node attribute “id” and start numbering from 0, that is, the user node “id” number is 0–1049. Video information as a type of node V, for the node of video information class, add the node attribute “video” and start numbering from 10,000, that is, the video node “video” number is 10,000–10,586. The MOOC course dataset includes information on the relationships between videos and courses, where videos are contained within courses. Additionally, the videos are classified into 12 categories based on the course category, which is used as a labeling system for external evaluation indicators. Table 4 provides a detailed list of these categories.

Similarly, the Metapath–GloVe algorithm, together with NMF and Metapath2vec algorithms, is implemented on the MOOC course dataset. Selecting the optimal parameter value through Grid Search, the important hyperparameters of each method on this dataset are shown in Table 5. When performing clustering, this experiment also used the average value of each group as the final evaluation value, and their evaluation results for the five categories of indicators are shown in Table 6, and Figure 8 shows the clustering visualization results for each type of method.

As seen from Table 6, it can be seen clearly that the proposed method in this paper achieves the highest values of all the indicators of the MOOC course datasets. In addition, we used function perf_counter() in the time library to calculate the running time of the algorithm. Because NMF is not iterative and does not need mapping and relationship identification, it saves running costs. The running time is 14 s, which takes the shortest time. However, with the visualization results in Figure 8, it can be seen that the NMF algorithm in Figure 8a cannot effectively divide the nodes into different groups, and the results are not ideal. Metapath2vec has the longest running time (4413.6 s). Combining the results of Metapath2vec clustering in Figure 8b, two types of tag nodes, six and seven, are missing, which means the videos of types ”ARM Microcontrollers and Embedded Systems (Spring 2019)” and “Operating system (autonomous mode)” cannot detected. The Metapath–GloVe algorithm proposed in this paper needs 2252.9 s, which saves a lot of run time compared with Metapath2vec. While the cluster result of the Metapath–GloVe algorithm in Figure 8c demonstrates that it is missing only six labeled nodes, which can indicate that the Metapath–GloVe algorithm model has a better performance.

In addition, we compare the clustering results of the Metapath–GloVe and Metapath2vec algorithms from the aspect of statistics. We assume that there is no significant difference in the NMI between the Meatpath–GloVe and Metapath2vec algorithms. By implementing the t-test, the p value = 1.9721 × 10⁻³⁷, and it is much less than 0.05, and we reject the original hypothesis. Therefore, there is a significant difference in NMI between Metapath–GloVe and Metapath2vec. Similarly, Metapath–GloVe outperforms Metapath2vec in all metrics, which can indicate that Metapath–GloVe has better performance on clustering.

4.2. User–Video Link Prediction

In this section, link prediction which illustrated in Section 3.2 are implemented on the MOOC course data. In detail, 20% of user–video links are selected for prediction evaluation.

(1): Evaluation indicators

In this paper, four metrics, such as accuracy, precision, recall, and F1 value, are adopted to evaluate the result of link prediction. The calculation is shown in Equations (11)–(14).

a c c u r a c y = \frac{T P + T N}{T P + Y N + F P + F N}

(11)

p r e c i s i o n = \frac{T P}{T P + F P}

(12)

r e c a l l = \frac{T P}{T P + F N}

(13)

F 1 = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(14)

where TP is the number of positive cases that the model predicts correctly, FN is the number of positive cases that the model predicts incorrectly, FP is the number of negative cases that the model predicts incorrectly, and TN is the number of negative cases that the model predicts correctly.

(2): Experimental results

From the result of node clustering, it is clear that NMF learns the node features of this data poorly; therefore, this experiment only compares two types of algorithmic models, Metapath2vec and Metapath–GloVe. Therefore, Figure 9 is the visualization result of the confusion matrix, Table 7 is an example of the comparative experimental results, and Table 8 shows the average results of 10 runs per model.

As shown in Table 7, there are some examples of the result of user–video link prediction. For example, the Metapath2vec algorithm is not able to predict that users No. 998 has viewed video No. 10329 of course “College Computer Foundation (Spring 2019)”, while this link can be predicted by the Metapath–GloVe algorithm. Metapath2vec algorithm misclassifies the link between user No. 209 and video No. 10343 of course “College Computer Foundation (Spring 2019)”. Combining with the visualization of confusion matrix in Figure 9, we can see the difference between them more intuitively. The Metapath–GloVe algorithm model results in fewer positive examples of prediction errors and higher accuracy of recommendation. It is slightly better than Metapath2vec. From Table 8, the accuracy of the Metapath–GloVe algorithm on all kinds of classifiers is higher than that of Metapath2vec, and the Bagging classifier achieves the best performance among them.

Similarly, we assume that there is no significant difference between the accuracy of link prediction by the Metapath–GloVe and Metapath2vec algorithms. By implementing the t-test, the p value = 2.8635 × 10⁻¹³, and it is much less than 0.05, and we reject the original hypothesis. Therefore, there is a significant difference between the accuracy of link prediction by the Metapath–GloVe and Metapath2vec algorithms.

5. Summary

As many real-world scenarios are commonly represented by heterogeneous graphs, graph embedding has become an important technical for graph analysis and applications. In this paper, an algorithm that combines meta-path random walk to obtain sequences of heterogeneous nodes and a Glove model for global text learning of word vectors is proposed to improve the embedding of heterogeneous graphs. The proposed model considers not only meta-path random walk to capture semantic and structural information among different nodes, but also utilizes global statistical information in the vectorized representation of heterogeneous graph nodes. In the MOOC course data, rich auxiliary information, such as viewing order and viewing duration, is usually associated with video. It will be interesting to study how to combine such auxiliary information to further improve the efficiency of heterogeneous node embedding learning.

Author Contributions

Conceptualization, J.F. and X.H.; methodology, C.X.; software, C.X.; writing—original draft preparation, J.F.; writing—review and editing, X.X.; visualization, Y.L.; supervision, X.H.; project administration, X.X.; funding acquisition, P.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Zhejiang Province Public Welfare Technology Application Research Project (LGF21F020013), the Zhejiang Province Key R&D projects (2021C03142).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, H.; Zhang, G.; Ma, J. Research progress of graph embedding algorithm. J. Zhejiang Univ. Sci. Ed. 2022, 49, 443–456. (In Chinese) [Google Scholar]
Qi, Z.; Wang, J.; Yue, K.; Qiao, S.; Li, J. Graph embedding methods and applications: A review of research. J. Electron. 2020, 48, 808–818. (In Chinese) [Google Scholar]
Sun, Z.; Deng, Z.H.; Nie, J.Y.; Tang, J. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv 2019, arXiv:1902.10197. [Google Scholar]
Xu, M. Understanding graph embedding methods and their applications. SIAM Rev. 2021, 63, 825–853. [Google Scholar] [CrossRef]
Zhang, T.; Yu, J.; Liao, B.; Yu, G.; Bi, X. A social network node classification method based on graph embedding and support vector machines. Comput. Appl. Res. 2021, 38, 2646–2650. (In Chinese) [Google Scholar]
Guo, L.; Dai, Q. Graph clustering via variational graph embedding. Pattern Recognit. 2022, 122, 108334. [Google Scholar] [CrossRef]
Zhang, L.; Zhou, D.; Zhu, H.; Xu, T.; Zha, R.; Chen, E.; Xiong, H. Attentive heterogeneous graph embedding for job mobility prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event Singapore, 14–18 August 2021; pp. 2192–2201. [Google Scholar]
Deng, Y. Recommender systems based on graph embedding techniques: A review. IEEE Access 2022, 10, 51587–51633. [Google Scholar] [CrossRef]
Cui, P.; Wang, X.; Pei, J.; Zhu, W. A survey on network embedding. IEEE Trans. Knowl. Data Eng. 2018, 31, 833–852. [Google Scholar] [CrossRef]
Wang, X.; Bo, D.; Shi, C.; Fan, S.; Ye, Y.; Philip, S.Y. A survey on heterogeneous graph embedding: Methods, techniques, applications and sources. IEEE Trans. Big Data 2022, 9, 415–436. [Google Scholar] [CrossRef]
Goyal, P.; Ferrara, E. Graph embedding techniques, applications, and performance: A survey. Knowl. Based Syst. 2018, 151, 78–94. [Google Scholar] [CrossRef]
Gong, J.; Du, W.; Li, H.; Li, Q.; Zhao, Y.; Yang, K.; Wang, Y. Score prediction algorithm combining deep learning and matrix factorization in sensor cloud systems. IEEE Access 2020, 9, 47753–47766. [Google Scholar] [CrossRef]
Zeng, M.; Lu, C.; Zhang, F.; Li, Y.; Wu, F.X.; Li, Y.; Li, M. SDLDA: lncRNA-disease association prediction based on singular value decomposition and deep learning. Methods 2020, 179, 73–80. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Jia, C.; Yu, J. A review of community discovery methods applying non-negative matrix decomposition models. Comput. Sci. Explor. 2016, 10, 1–13. [Google Scholar]
Shi, C.; Li, Y.; Zhang, J.; Sun, Y.; Yu, P.S. A survey of heterogeneous information network analysis. IEEE Trans. Knowl. Data Eng. 2017, 29, 17–37. [Google Scholar] [CrossRef]
Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [Google Scholar] [CrossRef]
Wong, L.; You, Z.H.; Guo, Z.H.; Yi, H.C.; Chen, Z.H.; Cao, M.Y. MIPDH: A novel computational model for predicting microRNA-mRNA interactions by DeepWalk on a heterogeneous network. ACS Omega 2020, 5, 17022–17032. [Google Scholar] [CrossRef]
Lei, X.; Wang, Y. Predicting microbe-disease association by learning graph representations and rule-based inference on the heterogeneous network. Front. Microbiol. 2020, 11, 579. [Google Scholar] [CrossRef]
Dong, Y.; Chawla, N.V.; Swami, A. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 135–144. [Google Scholar]
Ma, L.; Zhang, Y. Using Word2Vec to process big text data. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA, 29 October–1 November 2015; IEEE: New York City, NY, USA, 2015; pp. 2895–2897. [Google Scholar]
Chen, B.; Xie, J.; Miao, D.; Wang, Y.; Xu, X. Sentiment analysis of Chinese text based on rough sets and multi-channel word vectors. Chin. J. Inf. 2020, 3, 94–104. (In Chinese) [Google Scholar]
Li, F.; Ke, J. Advances in research on semantic representation of word vectors. Intell. Sci. 2019, 5, 155–165. (In Chinese) [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Zheng, Y.; Tian, D. Text classification based on GloVe and SVM. Softw. Guide 2018, 17, 45–48. [Google Scholar]
Kulkarni, S.; Katariya, J.K.; Potika, K. Glovenor: Glove for node representations with second order random walks. In Proceedings of the 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, The Hague, The Netherlands, 7–10 December 2020; pp. 536–543. [Google Scholar]
Zhang, X.; Cheng, H.; Fang, Y. Coauthorship prediction based on meta-paths and node attributes. Comput. Eng. Appl. 2021, 57, 164–169. (In Chinese) [Google Scholar]
Yu, J.; Luo, G.; Xiao, T.; Zhong, Q.; Wang, Y.; Feng, W.; Tang, J. MOOCCube: A large-scale data repository for NLP applications in MOOCs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. 5–10 July 2020; pp. 3135–3142. [Google Scholar]
Lydia, A.; Francis, S. Adagrad—An optimizer for stochastic gradient descent. Int. J. Inf. Comput. Sci. 2019, 6, 566–568. [Google Scholar]
Huang, S.; Wang, H.; Li, D.; Yang, Y.; Li, T. Spectral co-clustering ensemble. Knowl. Based Syst. 2015, 84, 46–55. [Google Scholar] [CrossRef]

Figure 1. A heterogeneous graph of MOOC course data.

Figure 2. The process of learning node embedding of MOOC course data.

Figure 3. User/video association analysis experimental flowchart.

Figure 4. Association analysis of MOOC course data.

Figure 5. The flowchart of user–video link prediction on MOOC course data.

Figure 6. Scheme of generating the synthetic dataset.

Figure 7. Distribution of synthetic dataset with different connection probabilities.

Figure 8. Visualization of clustering results of MOOC course dataset.

Figure 9. Visualization of confusion matrix.

Table 1. Hyperparameters of various methods on the self-generated synthetic dataset.

Method	Important Hyperparameters
NMF	k = 150
Metapath2vec	The meta-path random walk step size is 20,000 and the dimension is 150. The window size is 30, the learning rate is 0.01, and the number of iterations is 20.
Metapath-GloVe	The meta-path random walk step size is 20,000, the dimension is 150, the window size is 30, the learning rate is 0.01, and the number of iterations is 20.

Table 2. Evaluation results of various methods on the self-generated ideal dataset.

Evaluation Metrics		S	ARI	NMI	AMI	FMI
(0.9,0.1,0.1,0.9)	NMF	0.2166	1.0	1.0	1.0	1.0
	Metapath2vec	0.299	1.0	1.0	1.0	1.0
	Metapath-GloVe	0.9374	1.0	1.0	1.0	1.0
(0.7,0.3,0.3,0.7)	NMF	−0.1326	0.0010	0.0144	0.0083	0.6651
	Metapath2vec	0.0120	0.0046	0.0081	0.0034	0.5016
	Metapath-GloVe	0.5974	1.0	1.0	1.0	1.0
(0.6,0.4,0.4,0.6)	NMF	−0.1156	0.0097	0.0170	0.0127	0.6089
	Metapath2vec	0.0142	0.1674	0.1292	0.1260	0.5824
	Metapath-GloVe	0.033	0.5468	0.4512	0.4492	0.7728

Table 3. Entities in the MOOC course dataset and various contact statistics.

Entity	Quantity	Contact	Quantity
User	1050	User-video	36,677
Video	587	Video-Course	587
Courses	12	Course-Concept	12
Concept	1

Table 4. Name of 12 types of courses.

Course Number	Course Name
0	Smart Car Production: Embedded Systems
1	C++ Language Programming Advanced (Spring 2019)
2	Embedded System Design
3	Java Programming by Doing (Autonomy Model)
4	College Computer Fundamentals (Spring 2019)
5	Java Programming Advanced (Spring 2019)
6	ARM Microcontrollers and Embedded Systems (Spring 2019)
7	Operating system (autonomous mode)
8	Java Programming (Autonomy Model)
9	Test system integration technology (autonomous mode)
10	Internet architecture (autonomous model)
11	Internet Architecture

Table 5. Hyperparameters of various methods on the MOOCCube dataset.

Method	Important Hyperparameters
NMF	k = 50
Metapath2vec	The meta-path random walk step size is 20,000, the dimension is 300, the window size is 15, the learning rate is 0.01, and the number of iterations is 20.
Metapath-GloVe	The meta-path random walk step size is 20,000, the dimension is 300, the window size is 15, the learning rate is 0.01, and the number of iterations is 20.

Table 6. Evaluation results of various methods on the MOOC course dataset.

Evaluation Metrics Experimental Method	NMF	Metapath2vec	Metapath-GloVe
S	−0.029	0.3629	0.6411
ARI	0.6975	0.7570	0.8999
NMI	0.8334	0.8325	0.9021
AMI	0.8249	0.8236	0.8981
FMI	0.7393	0.7754	0.9156

Table 7. The example of user–video link prediction results.

Method		Metapath2vec	Metapath-GloVe
Test Set Positive Node Pair	(437, 10,335)	1	1
	(998, 10,329)	0	1
	(1005, 10,343)	0	0
	(1014, 10,329)	1	1
	(1027, 10,344)	1	1
Test Set Negative Node Pair	(64, 10,344)	0	0
	(754, 10,345)	1	1
	(357, 10,348)	0	0
	(209, 10,343)	1	0
	(1031, 10,336)	0	0

Table 8. Model experimental results.

Method		Bagging	Stacking	MLP
Metapath–GloVe	accuracy	0.9387	0.9377	0.9125
	precision	0.9211	0.9116	0.9011
	recall	0.9574	0.9659	0.9249
	F1 value	0.9391	0.9381	0.9128
Metapath2vec	accuracy	0.9107	0.8982	0.8864
	precision	0.9117	0.9024	0.8913
	recall	0.9097	0.8941	0.8819
	F1 value	0.9108	0.8981	0.8865

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, C.; Feng, J.; Hu, X.; Xu, X.; Li, Y.; Hou, P. A MOOC Course Data Analysis Based on an Improved Metapath2vec Algorithm. Symmetry 2023, 15, 1178. https://doi.org/10.3390/sym15061178

AMA Style

Xu C, Feng J, Hu X, Xu X, Li Y, Hou P. A MOOC Course Data Analysis Based on an Improved Metapath2vec Algorithm. Symmetry. 2023; 15(6):1178. https://doi.org/10.3390/sym15061178

Chicago/Turabian Style

Xu, Congcong, Jing Feng, Xiaomin Hu, Xiaobin Xu, Yi Li, and Pingzhi Hou. 2023. "A MOOC Course Data Analysis Based on an Improved Metapath2vec Algorithm" Symmetry 15, no. 6: 1178. https://doi.org/10.3390/sym15061178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A MOOC Course Data Analysis Based on an Improved Metapath2vec Algorithm

Abstract

1. Introduction

2. Learning MOOC Course Data Node Embedding Based on Metapath–GloVe

2.1. The Construction of Heterogeneous Graph from MOOC Course Data

2.2. The Acquisition of Meta-Path Sequences Based on Heterogeneous Random Walk

2.3. Learning Node Embedding Based on GloVe Model

2.4. An Overall Process of Metapath–GloVe Algorithm

3. Data Analysis of MOOC Course Data Based on Metapath–GloVe

3.1. User/Video Association Analysis

3.2. User–Video Link Prediction Method Based on Node Embeddings

4. Experiments and Results Analysis

4.1. Result of User/Video Association Analysis on MOOC Course Dataset

4.1.1. Synthetic Data

4.1.2. MOOC Course Dataset

4.2. User–Video Link Prediction

5. Summary

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI