1. Introduction
With the rapid increase in online scientific literature, the task of publication ranking has become increasingly important for scholarly information retrieval and recommendation, which aims at enhancing the efficiency of research. Generally speaking, the article with more influence should be ranked higher in the publication ranking. While citation analysis plus graph mining is a commonly used bibliometric method of performing this task. As contrasted with the traditional method of simple citation counting [
1], PageRank [
2] is a global page ranking algorithm based on the intuition that links from important pages are more significant than links from other pages. It is an effective means of ranking web pages and scientific publications for addressing user information needs. The relative “importance” of a node is calculated by back-links with transition authority from other nodes, thereby determining which nodes are more important in the network.
However, the original PageRank algorithm is based solely on links, independent of any particular search query, and regardless of other metadata in the content, e.g., title, keywords, abstract, and link location. Because of this limitation, some previous studies have tried to improve PageRank by employing topic-sensitive methods [
3], which take the topic into account in determining the rank. Then, it can be widely used for discovering important papers [
4].
As one contribution of this paper, we used an optimized topic-sensitive PageRank algorithm, PageRank with Priors (PRP), to determine the topical ranking scores of scientific publications and authors. This method has been proposed in our previous work [
5], which combined PageRank and a supervised topic modeling algorithm Labeled-LDA [
6] via full-text extraction with two priors: a publication topic probability and a citation transition probability, and, then, publication ranking results and author ranking results were presented by the PRP algorithm.
On the other hand, the idea behind AuthorRank [
7] is that content created by more popular authors should rank higher than content created by less popular authors. It was proposed by Google’s Agent Rank patent in 2005, but cannot be implemented in ranking web pages, because it is difficult and expensive to assess the topical reputation or importance of many Web page creators. Actually, it is much easier and economical in the scientific literature domain. For scientific literature, the creator, as an author, is a common and important metadata item, easy to extract from a digital library, and widely studied in previous research [
8]. So, the other contribution of this paper is to generate a topical AuthorRank score, which is based on full-text extraction and the Labeled-LDA model, according to topic-sensitive author ranking results by four methods: first author, last author, the most famous author, and average author.
Finally, to test whether topical AuthorRank can replace PageRank, or how it can make the results more accurate for publication ranking [
9], we will compare publication ranking results by using topical AuthorRank plus PRP to validate the results. In order to confirm that AuthorRank positively influences publication ranking, two combination methods will be validated, the linear method and the Cobb–Douglas method.
In the remainder of this paper, we: (1) review relevant literature on the methodology of publication ranking and AuthorRank; (2) introduce our novel methods for publication and AuthorRank, as well as how to achieve publication ranking with AuthorRank; (3) describe the experimental setting and evaluation results; and (4) discuss the findings and limitations of the study and identify subsequent research steps.
3. Methodology
3.1. Topic Modeling of LLDA
Labeled Latent Dirichlet Allocation (Labeled-LDA or LLDA) was proposed for training the labeled topic model. It employs a generative probabilistic model in the hierarchical Bayesian framework and assumes the availability of topic labels and the characterization of each topic by a multinomial distribution, , overall vocabulary words.
In Labeled-LDA,
is a set of words,
, chosen from a document in the training text,
is the number of words. The set of documents in this article, as
. d is the number of documents. KEY is a set of labels,
; we used keywords as labels from training text, so KEY = {
.
m is the number of labels. Words are chosen in proportion to a label’s preference for the word,
and the publication’s preference for the associated label,
.
The two matrices can be constructed by
and
, representing the co-occurrence probability of label (
) and word (
), and the co-occurrence probability of paper (
) and label (
), respectively.
Stemming from the LDA method, Labeled-LDA is a supervised topic modeling algorithm, which employs existing topics from scientific metadata. Therefore, in this paper, topic labels were assigned based on author-assigned keywords, and each scientific publication was treated as a mixture of its author-assigned topics (keywords). As a result, both topic labels and topic numbers (the total number of keywords in the metadata repository) are given.
We can therefore infer a possible topic distribution for each paper by LLDA.
Figure 1 is an example of 2 topic distributions; to show it clearly, we randomly sampled 50. The horizontal axis shows the possible topics (keywords authorized by the author), and the vertical axis is the topic probability. Content in the paper includes title, abstract, and full-text or citation context in publications. For different content, the topic probability distribution is diverse, while the sum of all values (topic probabilities) for each paper is equal to 1.
The blue line is a conference paper, “Using architectural ‘families’ to increase FPGA speed and density”, which is about narrowing the speed and density gap between FPGAs and MPGAs. So, the highest topic in this paper is “FPGA”. The orange line is a paper named “Rerun: Exploiting Episodes for Lightweight Memory Race Recording”, published in “the 35th Annual International Symposium on Computer Architecture”, which mainly focuses on the field of Multiprocessor deterministic replay. So, the highest topic in this paper is “multi core processor”.
In this method, the labels are keywords from the sources. These keywords may be author-provided or derived by greedy matching. Greedy matching means loading all possible keywords into memory, and then searching each keyword from the paper title and abstract by using fast matching, with the purpose of expanding the paper topic space.
3.2. PageRank with Priors
There are many different ranking methods (e.g., citations, publications, h-index, and PageRank) in the field of scientific literature. In this paper, we employ a topic-dependent ranking method based on the combination of Labeled-LDA and PageRank with Priors (PRP), an optimized PageRank algorithm for full-text extraction [
5]).
In bibliometrics, most previous studies on citation network analysis are based on the simple assumption that and are connected, whenever cites In this paper, we represent two kinds of prior knowledge in a citation graph by LLDA: a publication topic probability and a citation transition probability.
In this paper, we first create an academic publication network. Among them, the vertex of the network is the academic publication, , the dataset of all publications means V, and the edge of the network is the citation relationship between papers, , the dataset of all citations we used in the network means E.
For each vertex, , the publication topic prior vector is , where . is the co-occurrence probability of paper () and label (). The prior probability of vertex for topic is trained by publication metadata (title, abstract, and full-text), and where .
Each edge, , on the graph represents a citation connecting and ( cites ). The citation topic transitioning probability vector for each edge is , where is the probability of transitioning from vertex to for topic .
For each topic , the score of vertex and edge calculated by Labeled-LDA and author-assigned keywords plus greedy matching results. Hence, graphs for different topics may be different. If topic does not belong to paper , the publication topic prior is 0. If both citing and cited papers do not include topic , we assign a lower score: , where , because we did not want to totally remove this citation from the academic publication network.
The PageRank with priors algorithm takes into account these two kinds of priors,
and
to calculate the relative importance of vertices in the citation graph. A hyperlink to a publication counts as a vote of support. For example, if a publication cites only 3 papers and, for a specific topic, the transitioning probabilities to these 3 papers are 0.1, 0.1, and 0.8, then most of the paper’s credit on this topic (it is topic authority) goes to the third paper. So, the vertex’s (topic relative) importance can be calculated by
and:
The output, for each vertex (publication), , is an authority vector . Each authority score in the vector indicates the publication topic importance with respect to both paper topic and full-text citation priors. We obtain ranking lists as a result ().
3.3. Author Ranking by PRP
As described before, we used PRP for publication ranking by combining Labeled-LDA-based topic information and full-text publication information. We also applied this method of author ranking based on the assumption that if author1’s paper cites author2’s paper, then author1 and author2 are somehow related. The relation can be characterized on a directed graph with authors as vertices and citations as edges. In the author graph, , is a set of vertices representing all the authors; and is a set of edges representing author relationships as generated from the citation network.
In most cases, one author has multi-publications, so , the vertex for a given author is a set of papers, i.e., . Similarly, the number of edges between two vertices is always more than one, each edge is expressed as , only when and .
Therefore, the publication topic prior can be formulated as:
where
, and the transitioning probability score for an edge can be calculated as:
where
is the citation context from
to
.
So, the (topic-relative) importance of an author (vertex) can be calculated by the same formula as: . For output, we can obtain ranking lists for each specific topic as a result.
Author topical ranking by PRP may be more accurate than paper ranking, as each vertex is represented by a list of papers from an author, and more textual information results in more accurate topic inference.
3.4. AuthorRank by Author Ranking
We introduced the concept of AuthorRank and its underlying assumptions in the previous section. Generally speaking, author ranking may influence publication ranking. Thus, a famous or important author on a specific topic will have his or her publications ranked higher, which will thereby have a greater chance to be recommended for the topic. To validate this hypothesis, in this section we propose to use AuthorRank with a topical author ranking score.
Since AuthorRank is a query-independent criterion similar to PageRank, it should be calculated as offline processing, where content authority is measured by the authority accumulated from links, regardless of the query. The PRP algorithm based on the author-ranking method has been proposed before. It is effective for allowing an author to have a different rank for each topic. So, in this section, we propose a method for paper ranking (AuthorRank) via author ranking, which is also an offline and topic-based method.
The following formula defines the relationship between paper and author.
where
is the paper, always created by at least one author
. How the author affects the publication ranking score,
, can be calculated in at least four different ways: First Author, Last Author, Max Author, and Average Author.
First Author: The ranking score of a publication (AuthorRank) depends only on the first author’s ranking. In other words, the publication’s topical importance,
, is based on the score of only the first author (
).
Last Author: The AuthorRank score of a publication,
, depends only on the last author’s ranking, where
is the last author if
is co-authored by multiple (s) authors. Otherwise, as is decided by the unique author.
Max Author: The AuthorRank score of a publication is decided by the most popular author’s score among all the authors.
Average Author: The AuthorRank score of a publication is determined by the average scores of all the publication’s authors.
Example: To understand these methods better, we provide a simple example. Assume that there are two papers:
and
, where
is authored by
and
, as
; and where
is authored by
,
, and
, as
. The importance score of these four authors is shown in
Table 1.
From the four methods mentioned above, we obtain the two papers’ ranking scores as shown in
Table 2.
In this example, we found that the papers’ author ranking scores are different than their AuthorRank. We verify the effectiveness of publication ranking by AuthorRank in the following.
3.5. AuthorRank Combined with PRP
We have introduced two methods for publication ranking, PRP and AuthorRank. The former, PRP, depends on links in the graph to calculate the authority of each node. The latter, AuthorRank, is decided by author ranking. Ideally, we would recommend publications that have both a high AuthorRank and a high PageRank, meaning that they are really important for the topic. In contrast, papers with a low AuthorRank and a low PageRank have little importance for the topic.
In this part, we would like to use two methods to combine the results of PRP and AuthorRank to verify whether the performance of publication ranking can be improved. For these methods, we used AuthorRank to enhance the publication prior probability via publication prior probability smoothing.
Linear Combination:
where
is a parameter between 0 and 1, which controls the weight of
and
.
is the relative importance score for
on
calculated by the PRP algorithm, whereas
is the importance ranking score for
on
generated by the AuthorRank algorithm.
The linear form assumes that the total score is a linear combination of the two scores. Each score’s contribution to the total score is controlled by a parameter, .
This form assumes that the total score is the product of the two scores, rendering it insensitive to small scores but sensitive to large scores.
As we already know, one limitation of the topic-based PageRank algorithm is that the importance scores of papers, , are 0, if the paper is not related to the specific topic ().
This may limit the performance of the harmonic form. Thus, if
is 0, the final score of
is infinite. One smoothing method will improve the score of
as follows:
where
is the score after smoothing, and
is always larger than 0, the paper topic score is also always positive. The parameter,
controls the amount of smoothing. In this research, we used as a tentative value,
.
3.6. Evaluation Methods
In order to verify whether AuthorRank can replace PRP, that is, whether AuthorRank can improve on PRP’s performance, and to determine which methods yield the best performance, we compared the results with several baseline approaches. The original, topic-insensitive PageRank algorithm was the first baseline. The other was based on PageRank with Priors (PRP). Two indicators were used in this paper to measure algorithm performance: mean average precision (MAP), and normalized discounted cumulative gain (nDCG) [
30].
We clearly understand that it is difficult to obtain the “ground truth” for this experiment dataset, so we tried to use review or survey papers to find the most important publications for a specific scientific keyword. To achieve this goal, a list of review or survey papers along with their cited papers was collected. Collected review papers were screened so that they only focused on one topic (keyword). We assumed that if a publication was cited by a review paper, and if this review paper concentrated on a keyword, , then this publication was important for . Since the degree of importance of cited papers may be different, we used the number of citations (by a review paper) to characterize the importance. Thus, if a review paper for keyword cited twice and once, then, and . If a paper was not cited by the target review paper, then the importance of this paper for the target topic was 0. If a paper was cited more times by the review paper, then we assume its maximum importance was equal to 4.
4. Experiments
4.1. Experimental Data
The experimental data for this paper were derived mainly from the ACM digital library. We used 41,370 publications and 223,810 citations, where full text and citations were extracted from XML files. In these publications, there are 63,323 authors. Among them, 49,101 authors (accounting for 77.54% of all the sampled publications) have only one publication. The selected dataset is a sub-graph in the database, which is reasonable for graph mining.
In this graph, we extracted 28,013 publications’ text, including titles, abstracts, and full text. For the other 9879 publications, whose full texts were not available in our database, we used the title and abstract from a metadata repository to represent the content of the paper. For the remaining 3479 publications, only the title was available.
We then wrote a list of regular expression rules to extract all possible citations from the paper’s full text. A text window surrounding the target citation, (−n words, +n words), was used to infer the citation topic distribution via LLDA. Intuitively, n should be a small number, as nearby words should provide more accurate citation information. However, n should not be too small to minimize randomness. In this experiment, we used an arbitrary parameter setting, where n = 150. In a total of 223,810 references, we successfully identified 94,051 references. The
Table 3 shows possible citation formats in publications.
For training the Labeled-LDA topic model, we first sampled 10,000 publications (with full text) and used author-provided keywords as topic labels. For instance, this paper has six author-provided keywords. Thus, our LLDA training would have assumed that this paper is a multinomial distribution over these six topics.
For the sampled publications, we first used tokenization to extract words from the title, abstract, and publication full text, and then employed Snowball stemming to extract the root of the target word. If a keyword appeared less than 10 times in the selected publications, we removed it from the training topic space. After that, we trained an LLDA model with 3910 topics (keywords) on 46,010 single words (bag of words), as . These topics were used to infer the publication and citation topic distribution.
4.2. Experimental Result
As proposed in the section on methods, the PRP combined Labeled-LDA topic model with full-text citation analysis measured the relative importance of vertices (papers) in the publication networks. The vertices (41,370 publications) and edges (223,810 citations) represented a topic distribution on 3910 topics. For each topic, the graph may be totally different than for others.
In
Figure 2, the first graph is the complete graph with 41,370 vertices and 223,810 edges, while the second one is a sub-graph with 580 vertices and 1148 edges based on the topic “Information Retrieval” (i.e., the publications used “Information Retrieval” as a keyword or “Information Retrieval” was found in the paper abstract by using greedy matching). The last graph with 3356 vertices and 6671 edges is an extended graph of the “Information Retrieval” graph, which means each node on the graph is directly or conditionally (cited by a directly relevant paper) related to “information retrieval”. The biggest node in these two graphs is “R_291008: A language modeling approach to information retrieval”, which means that the citation count is the biggest one in our dataset.
In order to compare the different methods for publication ranking results, we took the topic “Information Retrieval” as our example and listed all the results by seven methods, as shown in
Table 4 (top five results shown). PageRank is the original PageRank algorithm, where the damping factor is 0.85. PRP_1 is the PageRank with priors algorithm with only one prior: publication topic probability. PRP_1 neglects the other prior, the citation transition probability, thereby treating the scores of all the edges as equal. PRP_2 brings two kinds of prior knowledge into the citation graph by LLDA: publication topic probability and citation transition probability. First Author, Last Author, Max Author, and Average Author were proposed before as methods for calculating an AuthorRank.
In this table, we found that ranking result lists are totally different by different methods. For example, the paper “R_321035: On Relevance, Probabilistic Indexing and Information Retrieval” authored by M.E. Maron and published in the “Journal of the ACM” in 1960, was ranked high according to both PageRank and AuthorRank, but there also exist papers with high AuthorRank but low PageRank or, conversely, low AuthorRank and high PageRank. The former may indicate a new publication resource, one the importance of which has not yet been widely recognized. The latter might be an important paper by a young or relatively unknown author.
4.3. Experimental Evaluation
- (1)
If AuthorRank can replace PageRank for publication ranking
Based on our previous work [
5], the PRP algorithm is better than other classical methods for publication ranking. In this paper, we tried to improve publication ranking by using AuthorRank to inform a topic-based PageRank algorithm. AuthorRank scores were computed by author ranking via different four methods: First Author, Last Author, Max Author, and Average Author. For evaluation, 104 topics were selected associated with review or survey papers. For each topic, we calculated each publication’s ranking score in the dataset by the methods: PageRank, PRP_1, PRP_2, First Author, Last Author, Max Author, and Average Author.
In
Table 5 and
Table 6, we found that the best result was generated by PRP_2 both for MAP and nDCG, which verified that AuthorRank cannot replace the PageRank algorithm for publication ranking. We also found that topic-based PageRank (PRP_1) can significantly improve on the original PageRank algorithm, and that PRP_2 (which included the citation topic distribution) outperformed PRP_1 (which used the publication topic distribution only), especially for nDCG. Among AuthorRank results, we found the method of Average Author to be the best. The average contribution from all authors in an article is a more accurate guide to a paper’s AuthorRank than the other methods. However, the evaluation shows that AuthorRank cannot simply replace PageRank.
- (2)
Whether AuthorRank can improve publication ranking results
To test whether the AuthorRank results can improve publication ranking results (as calculated by topic-based PageRank), we then used the combination methods evaluation results to compare with the PRP_2 results (best-performed method without Author Rank).
Linear combination:
For each AuthorRank method (First Author, Last Author, Max Author, or Average Author), the parameter , which controls the relative contributions of a PRP and AuthorRank to a publication’s topical importance score, was trained from 0 to 1 with a step of 0.1.
Table 7 and
Table 8 display results by the linear combination method for AuthorRank in informing publication ranking. The best result in the training results for each method is shown in the tables.
As shown in the above tables, when the parameter = 0.9 in each method, AuthorRank combined with PageRank received the best result, meaning that AuthorRank can improve ranking results, but cannot replace the traditional link analysis ranking algorithm (PageRank with Priors). For MAP@n, the First Author method was better than the others when n 30, but for n 50, the Average Author method was better than all of the others.
nDCG@n is a more important indicator in this research, for it tells the degree of (publication topic) importance. If an nDCG score is large, the target algorithm can prioritize the most important on the ranking list. In the tables, it is clear that the First Author method is better than the others when n < 30, but for n > 1000, the Average Author method is the best one.
We also used significance testing to compare each method with the baseline PRP_2, and t < 0.001.
Cobb–Douglas Form:
This combination method is insensitive to small scores and sensitive to large scores. The parameter, , was trained from 0 to 1 with a step of 0.1. The best results for each method are shown in the following tables.
For MAP@n in
Table 9, the Max Author method was better than the others when n
10, and Average Author was the best when 30 < n < 50 and n > 1000, while for 100 < n < 100, the First Author method was better than the others.
It is clear in
Table 10 that Average Author is always better than all the other baseline methods, especially for the nDCG indicator.
5. Conclusions
In this paper, we aim to test whether and how topical AuthorRank can replace or enhance classical PageRank for publication ranking. From the results of our experiment, we can conclude that:
(1) AuthorRank cannot replace PageRank for publication ranking. This conclusion is supported by the results from
Table 5 and
Table 6. We also found that PRP with two priors, publication topic probability and citation transition probability, outperforms significantly the original topic-insensitive algorithm of PageRank, and is better than PRP with only the publication topic probability.
(2) AuthorRank can improve publication ranking results, it also proves that the article written by influential authors often deserves a higher ranking in information retrieval. When we combine the results of PRP and AuthorRank by linear combination method and Cobb–Douglas combination, the results calculated by MAP and nDCG are better than PRP without AuthorRank.
(3) By comparing the linear combination method with the Cobb–Douglas combination method, we found that calculating AuthorRank results by Average Author is the best method for improving publication ranking. This conclusion is supported by
Table 7,
Table 8,
Table 9 and
Table 10. Although the Cobb–Douglas combination method for AuthorRank and PRP is better than the linear combination method, this advantage is not significant. We also found that AuthorRank is effective for assessing the importance of publications where content or citation metadata is missing or partially missing. When we do not have publication content information, we cannot use topic modeling to infer the topic distribution, but AuthorRank can still help us to estimate the prior probability of these papers.