1. Introduction
The e-commerce sector is still growing as a result of advances in Internet technology and methods, particularly as infrastructure conditions such as 5G, mobile payment, and smartphones gradually mature. Online social networks have gained a lot of traction in recent years and have evolved into a vital component of many different applications that also plays a crucial function in the e-commerce platform [
1]. Users can review and comment on the things they have purchased on e-commerce platforms based on online social networks [
2], such as Amazon and eBay.
Both researchers and policymakers need to rely on publicly available information to study the e-commerce platform market. However, many studies on e-commerce platforms are still hampered by the issue of data monopoly [
3]. Despite widespread interest in e-commerce platforms, the companies that operate them retain control of their data and are frequently opaque and hesitant to share it with researchers. As a result, researchers are frequently limited to conducting experiments based on publicly available data.
Product prices are the most obvious public information. For instance, Chevalier and Goolsbee [
4] studied the price sensitivity of online consumers. Baye et al. [
5] revealed the relationship between online search cost and price. Hollenbeck et al. [
6] studied how prices change with a dramatic increase in consumer information provided by online reputation mechanisms. One significant flaw in these studies is that, while price data are abundant and readily available, there are no corresponding quantity data. We use Amazon as an example to demonstrate why this happens. Amazon, the world’s largest e-commerce platform, does not directly display product sales volumes. Amazon provides more growth space for new products and makes users pay more attention to product reviews to weaken the influence of sales volume on consumers’ purchase decisions, so it provides users with a real-time “best seller rank” [
7]. It is a key indicator of how well a product sells. It is used to replace the number of goods sold in a given period. However, acquiring this commodity sales ranking necessitates constantly tracking the data changes of the Amazon platform, which is difficult to obtain and has a long acquisition cycle [
3]. On the other hand, it is rather simple to obtain databases of reviews from various e-commerce platforms. We consequently had to think about how to evaluate the extremely large-scale review data of enormous e-commerce platforms with the least amount of time complexity to find high-impact products.
The following is a summary of the important contributions of this paper.
We propose a new book impact evaluation method, a book influence evaluation method based on user ratings of the e-commerce platform (URBI), whose calculation cost is only O(n), where n is the total number of comments.
We ran the experiment with a dataset of Amazon reviews, a real, large-scale e-commerce platform. The effectiveness of our URBI method was confirmed by designing two experiments with different time spans for comparison reference and comparing it to other five node influence assessment methods.
The following describes the structure of this article. The second part talks about related work. The preliminary research preparation is described in
Section 3. Our
URBI approach is shown in
Section 4. In
Section 5, by creating two experiments with various time lengths and contrasting them with five other network analysis measures, we examine the efficacy of our suggested approach. The conclusion of the essay is contained in the
Section 6.
2. Related Work
Recent research has found that a variety of systems in the actual world take the shape of network structures, including relational networks in social systems [
8,
9], protein networks [
10,
11], and collaborative networks [
12]. The identification of key network nodes is a critical topic. As a result, screening nodes with high influence in networks has become a focal point of attention, particularly on how to evaluate the importance of nodes in complex networks [
13].
Network structure and operation are more affected by key nodes than by other nodes. Many fields [
14] can benefit from key node identification, including community structure networks [
15,
16], disease [
17], social network service [
18,
19,
20], resource allocation [
21], biological information [
22], and so on [
23]. For instance, Wang et al. [
24] presented a new approach to extract community structure from the network by employing co-inversion of the original community whose degree of adjacent vertices is less than its degree. They believe that some significant nodes play a central role in a community. Cui et al. [
25] discovered that different communities overlap in some real-world situations and proposed an ACC algorithm to detect overlapping community structures in complex networks based on the clustering coefficients of two adjacent maximal subgraphs. Li et al. [
26] proposed a new method for detecting overlapping communities in a weighted network by using seed communities. Through simulation experiment results, Nian et al. [
27] demonstrated that node activity will also affect immunity, with the strongest immunological effect depending on node activity.
The relevance of nodes in complex networks can be assessed using a variety of techniques, such as degree centrality (
DC) [
28], eigenvector centrality (
EC) [
29], closeness centrality (
CC) [
30], betweenness centrality (
BC) [
31], PageRank (
PC) [
32], H-index [
33], and so on [
34].
In network analysis, DC is the most direct measure of node centrality. The higher the degree of a node, the more important it is in the network. According to EC, the importance of a node is determined by the number and importance of its neighbors. CC measures how close a node is to other nodes in the network. CC is extremely sensitive to the network structure, and even minor changes will cause the node order to change. The number of shortest paths through a node is used to calculate the importance of a node. Path concentration is used by both CC and BC to determine the importance of a node. PC evaluates the impact of nodes using iterations of information about neighbors, and it is stable on scale-free networks but very sensitive to random networks.
In summary, the aforementioned approach ignores the impact of the information that nodes themselves carry on how important nodes in complex networks are assessed. Particularly, the above method does not consider the user’s assessment information of the book when evaluating the relevance of book nodes in the user-book heterogeneous graph network created by the review data set of the e-commerce platform. Consequently, there is room for improvement in the outcomes provided by the current approaches.
3. Preliminary
Given a graph G = (V, E), where V stands for nodes and E stands for edges. In this section, we will discuss degree centrality (DC), eigenvector centrality (EC), closeness centrality (CC), betweenness centrality (BC), and PageRank (PC).
3.1. Degree Centrality (DC)
DC is the most direct measure of node centrality in network analysis. The higher the degree of a node, the higher its degree of centrality, and the more important the node in the network. The following is the definition of degree centrality
DC:
In this equation, stands for the degree of the node i, indicates the link from node i to node j, and DC(i) for the centrality score of node i.
3.2. Eigenvector Centrality (EC)
The basic idea behind
EC is that the relevance of a node is determined by both the number of its neighbors and the importance of its neighbors. The centrality of one node is a function of the centrality of neighboring nodes. In other words, the more significant the neighbor node is, the more significant the current node is. Given an n × n matrix A,
denotes the value of the
term of the normalized maximum eigenvector.
EC is defined as follows:
where
is the maximum eigenvalue of matrix A, and
EC(j) is the centrality score of node
j.
3.3. Closeness Centrality (CC)
CC shows the proximity of the node to other nodes in the network. The closer a node is to other nodes, the larger its proximity centrality.
CC is used to discover nodes that can efficiently spread information through the graph. The closeness centrality algorithm determines the sum of the distances between all node pairs for each node based on computing the shortest path between all node pairs and then calculating the reciprocal of the result to obtain the proximity centrality score of the node:
where
denotes the link between node
j and node
i, and
represents the shortest distance between node
j and node
i. The centrality score of node
j is represented by
CC(
j).
3.4. Betweenness Centrality (BC)
BC is an index that describes the importance of a node by the number of shortest paths through a node.
BC computes the number of shortest pathways via a point. The more shortest pathways via a point, the greater its betweenness centrality:
where
is the number of pathways from node
i to node
k via node
j, and
is the total number of paths via node
j. The centrality score of node
j is given by
BC(
j).
3.5. PageRank (PC)
The basic idea of the
PC algorithm is to define a random walk model on a directed graph, that is, a first-order Markov chain. It depicts the behavior of a random walker who visits each node along a directed graph at random. In the limit situation, the chance of accessing each node converges to the stationary distribution, and the stationary probability value of each node equals its value of
PC, which represents the relevance of the node.
PC is defined recursively, and it can be calculated using an iterative process:
where
is an output of node
j,
denotes the link between node
j and node
i,
is the importance of node
j at step
t, and
PC(
j) denotes the centrality score of node
j.
4. Proposed Method
This study offers a book influence method based on user ratings of e-commerce platform (URBI), intending to estimate the influence ranking of each book in the e-commerce platform. The goal of this strategy is to uncover hidden book influence information in book review datasets by analyzing e-commerce platform review datasets. Using the review dataset, the user–book heterogeneous graph network is built, and the relationship between user rating information and book influence is investigated with very little time complexity.
STEP 1: Construct network.
Each review in the book review dataset of the e-commerce platform contains the user and the book to be evaluated. As a result of traversing the review data set, a user–book heterogeneous graph network can be built. The user–book network is represented by the graph G = (V, E), where V represents the node and E represents the edge.
STEP 2: Tag nodes.
In the e-commerce platform, the quantity of user comments far outnumbers the number of books. Calculating the influence of all nodes in the user-book heterogeneous graph network will take a long time. As a result, we tagged the book node and the user node separately, and we only calculated the influence of book nodes.
STEP 3: Calculate influence of nodes.
Comments from users on a book on the e-commerce platform show the purchasing behavior of the user and appraisal of the book. However, we do not know how much of the book the customer purchased, and many people do not leave a remark after purchasing the book. As a result, the number of reviews can only partly reflect the impact of a book. In order to better quantify the influence of books without knowing the particular quantity of books purchased by users, we included user ratings of books to forecast the chance of users purchasing books again. Using Amazon as an illustration, customers on Amazon can give the book they purchase a rating between 1 and 5, and the related reviews range from low to high. We assign a rating factor to each rating that reflects the probability that the user with the corresponding rating will buy the book once more. The
URBI measure is defined as follows:
where
C represents all book nodes,
represents all book neighbor nodes of node
j,
represents the connection between node
j and node
i,
represents the rating of user
i coefficient on book
j, and
URBI(
j) represents the influence score node
j.
COMPLEXITY ANALYSE
Assume there are n nodes in the user–book network, n1 being the book node, n2 being the user node, m being the comment side, and k being the rating. The time complexity of the method is divided into three parts:
- 1.
Mark the type of all nodes (book node or user node). The time complexity is O(n);
- 2.
Count the number of each rating of each book node. The time complexity is O(km);
- 3.
Calculate the influence of each book. The time complexity is O(n1).
As a result, the overall time complexity of URBI is O(n + n1 + km). In the real user–book network, the number of book nodes is far smaller than the number of comments, and the number of book rating categories is negligible compared with the number of comments. Therefore, the time complexity of URBI is O(m).
5. Experiments
We compared the suggested
URBI approach against
DC,
EC,
CC,
BC, and
PC in our experiment. The above six methods are described in
Table 1. To validate the effectiveness of the proposed method, we examined the accuracy of six techniques in rating the Top10, Top50, Top100, Top200, Top300, Top400, Top500, and Top1000.
5.1. DataSet
We used the Amazon Review Data (2018) dataset [
35] collated and published by Ni et al., which contained 233.1 million reviews from May 1996 to October 2018. Datasets include reviews (rating, text, and vote), product metadata (description, category information, price, brand, and image characteristics), and links (also browse/buy charts).
The dataset was split up among them according to the type of commodity. For the experiment, we chose the dataset of books with the most review data and the most items for the experiment.
5.2. Ground Truth
Amazon publishes a measure called “best seller rank”, the exact formula of which is a trade secret, but it converts actual sales over a specific time into a sequential ranking of products [
36]. Amazon provides a “sales ranking” attribute for each book on the site. The sales list reflects the total sales of that book on the site relative to the sales of other books on the site. Note that the smaller the sales ranking value, the higher the sales volume of the item in a certain time. Chevalier and Goolsbee reported the following: According to Amazon, the top 10,000 books are ranked based on the previous 24 h and are updated hourly. The sales rankings are updated every day for books in the top 10,000–100,000 positions; they are updated every month for books in the top 100,000 positions [
4]. Books that have not been acquired within the last month will not be sorted based on the aforementioned method. However, there is a ranking for thousands of books that almost certainly sell fewer than one per month. Clay et al. [
37] claimed that for these rarely purchased books, Amazon’s ranking is based on total sales since the inception of Amazon. So, except for books that are very highly ranked (and sell very little) on Amazon, the rankings represent a snapshot of the book’s current sales. In other words, “Sales rankings” represent the real-time ranking information for a book, showing how well that book has sold compared to other books in a given time.
We used a public dataset on the Kaggle community: Amazon sales rank data for print and kindle books
https://www.kaggle.com/c/asap-aes/data (accessed on 9 June 2022). For the dataset, authors collected sales rankings for authors published on Amazon.com worldwide through NovelRank.com. The data collection period was from 1 January 2017 to 29 June 2018. Data can be collected as often as every hour and as often as every 24 h. We select the average “sales ranking” of all books in a certain time as the ground truth.
5.3. Experimental Results and Analyses
5.3.1. Exp-1: Effectiveness (The Average of Sales Rank Every 7 Days Is Taken as the Ground Truth)
We selected Amazon sale rank data for print and kindle books data set of “sales rank” data in the first 7 days of each of the 12 months in 2017 and took its average value as the ground truth. Due to the lag between product reviews and product sales time, user reviews in the three months after product sales time in the Amazon Review Data (2018) dataset were selected as the experimental dataset. The specific information of the experimental dataset is shown in
Table 2.
We measured the accuracy by comparing the books ranked Top10, Top50, Top100, Top200, Top300, Top400, Top500 and Top1000 retrieved by each method with the books in the corresponding ground truth.
In the
URBI method proposed by us,
is 0,
is 0,
is 0.1,
is 0.1,
is 0.8, and
is the review coefficient of the user score i, which means we speculate that the user who gives the book score of 5 has a high probability to pay for the book again, and the probability of buying multiple books in one transaction is also higher. It is also speculated that users who give the book a score of 4 or 3 are less likely to buy the book again, while users who give the book a score of 2 or even 1 are not likely to buy the book again. A total of 12 experiments were conducted. The experimental results are shown in
Figure 1,
Figure 2,
Figure 3,
Figure 4,
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12, where the abscissa represents the Top-N books and the y-coordinate represents the accuracy rate compared with the ground truth.
According to the experimental results shown in
Figure 1,
Figure 2,
Figure 3,
Figure 4,
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12, as the value of N in Top-N gradually increases,
DC,
PC, and
URBI methods are obviously superior to the other three methods, so we focus on these three methods. It can be seen from the third section that we build a user–book heterogeneous graph network through the Amazon review set. All algorithms run on this heterogeneous graph network. We will mark the book node and user node respectively. In the experiment, we focus on the book node, while the user node is the neighbor node of the book node, providing different influence information for the book node according to different algorithms.
DC is concerned with the correlation of degree in the graph. In the graph network constructed in this paper, the influence of book nodes calculated by DC is equivalent to the number of users who have made comments. PC evaluates the influence of nodes by iterating neighbor information, that is, it obtains the influence of book nodes by iterating comments on user information. By analyzing the 12 groups of experiments conducted this time, we can also find that the accuracy of DC and PC methods is very close under each Top-N, mainly because of the attribute of the degree of book node, that is, the number of comments plays an important role in the user-book network.
However, the URBI method is different from the above two methods. Not only URBI considers the impact of the number of user reviews on the book, but DC and PC can indeed represent the influence of a book to some extent according to the number of user comments. However, in this graph network, the base when a user buys a book cannot be reflected, nor can they predict whether a user will buy again because they are both based on the number of user comments on a book. The URBI method makes a prediction of users’ purchase behavior based on the existing user ratings of the book. Through the analysis of the 12 groups of experiments conducted this time, it can be seen that when the value of N is small in the Top-N, the results of the URBI method are similar to those of the DC and PC methods, and as the value of N gradually increases, the accuracy of the URBI method is steadily higher than that of the DC and PC methods.
5.3.2. Exp-2: Effectiveness (The Average of Sales Rank for Each of the Three Months Is Regarded as the Ground Truth)
In Exp-1, we demonstrated the effectiveness of the
URBI method in predicting the book impact ranking over a short time horizon, that is, using three months of review data to predict the book impact ranking for seven consecutive days. Therefore, in the second part of the experiment, we use the review data of six months to predict the book influence ranking for three consecutive months so as to prove that the
URBI method is also suitable to predict the book influence ranking in a long time range. We divide the “sales rank” data of Amazon sales rank data for the print and Kindle books dataset in 12 months in 2017 into four parts, and take the average value as the ground truth. Meanwhile, user reviews in the six months after product sales time in the Amazon Review Data (2018) dataset are selected as the experimental dataset, and the specific information of the experimental dataset is shown in
Table 3.
It can be seen from Exp-1 that
EC,
CC, and
BC methods have poor effects, so only
DC,
PC, and
URBI methods are compared in this experiment. We also compare the books ranked Top10, Top50, Top100, Top200, Top300, Top400, Top500, and Top1000 retrieved by each method with the books in the corresponding ground truth to measure the accuracy. The setting of the user rating coefficient in the
URBI method is consistent with that in Exp-1. The experimental results are shown in
Figure 13,
Figure 14,
Figure 15 and
Figure 16.
As shown in
Figure 13,
Figure 14,
Figure 15 and
Figure 16, we used six months of review data to predict the book influence ranking for three consecutive months. Compared with Exp-1, Exp-2 had more book nodes, and the number of user nodes and edges nearly doubled. In this case, the performance of the
URBI method is the same as that of Exp-1. In the Top N, the value of N is small, and the results of the
URBI method are similar to those of the
DC and
PC methods. With the gradual increase in the value of N, the accuracy of the
URBI method is steadily higher than that of the
DC and
PC methods.
6. Conclusions
In this paper, we propose a book influence evaluation method based on user ratings of an e-commerce platform (URBI). In order to verify the effectiveness of the proposed method, we designed two experiments with different time spans for comparison and compared the proposed method with five other node influence evaluation methods. The experimental results show the effectiveness of the method.
In a nutshell, the URBI approach analyzes the impact of each book with a very low time complexity depending on the book rating by the user. Experiments using real-world e-commerce platform Amazon book review data reveal that our URBI method outperforms the other five methods. We feel that the number of comments on a book can indicate its influence to some extent, but that the influence of a book can be better portrayed if the user ratings of a book are introduced on this premise.