4.3.1. Validity Verification of Word Frequency, Position, Co-Occurrence Frequency, and Similarity
In the process of extracting keywords based on the TextRank algorithm, this paper introduces the word frequency and position in the document to distinguish the importance of words and assigns the score of the word non-uniformly to its co-occurring words based on their co-occurrence frequency and similarity. This section verifies the validity of the above four features based on the AF1 metric.
- (1)
Validity verification of word frequency
TF-TextRank: Based on the TextRank algorithm, word frequency is introduced to weight the words in the word graph, as shown in Equation (18), where
tf(
Vi) is shown in Equation (2).
The
AF1 comparison of TextRank and TF-TextRank in extracting keywords from the test document set is shown in
Figure 2. TF-TextRank is 0.47% higher than TextRank on average in terms of
AF1, indicating that word frequency has a certain influence on measuring word importance.
As for why TF-TextRank is not significantly better than TextRank, we believe that it is mainly because words with higher frequency in a document are more likely to co-occur with other words, and there are more edges connected to them in the word graph. That is, the influence of word frequency on word importance has been reflected in the TextRank algorithm, so when word frequency is used again to measure the importance of words, the performance is not particularly outstanding.
- (2)
Validity verification of word position
Pos-TextRank: Based on the TextRank algorithm, word position is introduced to weight the words in the word graph, as shown in Equation (19), where
pos(
Vi) is shown in Equation (3) and
β is set to 0.4, which is determined in
Section 4.3.2.
The
AF1 comparison of TextRank and Pos-TextRank in extracting keywords from the test document set is shown in
Figure 3. Pos-TextRank is 8.30% higher than TextRank on average in terms of
AF1, indicating that word position has a great influence on measuring word importance and can significantly improve the performance of TextRank in extracting keywords.
- (3)
Validity verification of word co-occurrence frequency
Co-occur-TextRank: Based on the TextRank algorithm, the co-occurrence frequency between words is used as the probability of transitioning from one word to another, as shown in Equation (20), where
Co(
Vj,
Vi) is shown in Equation (4).
The
AF1 comparison of TextRank and Co-occur-TextRank in extracting keywords from the test document set is shown in
Figure 4. Co-occur-TextRank is slightly better than TextRank, with an average increase of 0.64% in
AF1 value, indicating that assigning the importance score of the target word to its co-occurring words based on word co-occurrence frequency can highlight the importance of topic-related words to a certain extent.
As for why Co-occur-TextRank is not significantly better than TextRank, we believe that it is mainly because, in this paper, keywords are extracted from the document abstract, which is usually short and concise, so the co-occurrences of words in the abstract cannot well reflect their co-occurrences in the full text of the document. When the sliding window size is set to 3, the mean, median, and mode of the number of co-occurrences between words in the 10,000 experimental documents are 1.1294, 1, and 1, respectively, which means the number of co-occurrences between most words is 1. This makes the transition probability from word Vj to Vi obtained by Equation (4) similar to that obtained according to the adjacency of Vj (i.e., ), meaning that Co-occur-TextRank is not significantly superior to TextRank.
- (4)
Validity verification of the similarity between words
CoGlo-TextRank: Based on the Co-occur-TextRank algorithm, GloVe-based similarity is used to transfer word importance score to the co-occurring words, as shown in Equation (21), where
ω represents the proportion of the co-occurrence frequency of
Vj and
Vi in the transition probability from
Vj to
Vi, and
Co(
Vj,
Vi) and
Glo(
Vj,
Vi) are shown in Equation (4) and Equation (8), respectively.
To determine the value of
ω, this paper tests the changes in
AF1 when using the CoGlo-TextRank algorithm to extract four to nine words with the highest scores from each document in the training set as keywords, with
ω varying from 0 to 1 in steps of 0.1, as shown in
Figure 5. It can be seen that
AF1 is optimal at
ω = 0.8 when
k = 5, 6, 7, 8, or 9 and the third-best at
ω = 0.8 when
k = 4. Therefore, when using the CoGlo-TextRank algorithm to extract keywords from documents,
ω is set to 0.8.
The
AF1 comparison of Co-occur-TextRank and CoGlo-TextRank in extracting keywords from the test document set is shown in
Figure 6. It can be seen that there is no significant difference between CoGlo-TextRank and Co-occur-TextRank in terms of the
AF1 value. CoGlo-TextRank is 0.03% higher than Co-occur-TextRank on average, indicating that the CoGlo-TextRank algorithm, with the introduction of the external knowledge of documents, has achieved a relatively small improvement in keyword extraction.
As for why CoGlo-TextRank is not significantly better than Co-occur-TextRank, we believe that it is mainly because when assigning the importance score of a word to its co-occurring words according to the similarity in the iterative process, to ensure the similarity is not less than 0, this paper standardizes it by the min-max normalization method, which will lead to fewer similarity differences between words. For example, V1 has two co-occurring words, V2 and V3, in the sliding window, and the similarities between V1 and V2, and V1 and V3 according to Equation (5) are and . The two similarities are normalized by Equation (7) as and . It can be seen that the similarity ratio of V1 and V2 increases from 9.09% (0.05/(0.05 + 0.5) = 9.09%) to 41.18% (0.525/(0.525 + 0.75) = 41.18%), making the similarity difference between V1/V2 and V1/V3 decrease, which leads to no significant improvement in the CoGlo-TextRank algorithm in keyword extraction performance with the introduction of word similarity.
4.3.2. Determination of Model Parameters
Since AF1 takes AP and AR into account, we mainly use the AF1 metric to determine the values of the parameters β, γ and λ in this paper.
- (1)
Determination of the parameter β of word position importance
According to Equation (3), the position importance of the abstract words appearing in the title is 1, and that of other abstract words is
β (0 <
β ≤ 1). To determine the value of
β, this paper tests the changes in
AF1 when using the Pos-TextRank algorithm (as shown in Equation (19)) to extract four to nine words with the highest scores from each training document as keywords and combining the adjacent keywords in the original text into phrases, with
β varying from 0 to 1 in steps of 0.1, as shown in
Figure 7.
The
k in
Figure 7 represents the number of words extracted from each training document. As can be seen from
Figure 7, the variation trend of
AF1 with
β is basically the same when extracting the top four to nine words with the highest scores as keywords—that is, with the increase in
β,
AF1 slightly increases, reaches the maximum value, and then decreases. When
k = 4, 5, 6, 7, 8, or 9,
AF1 is optimal at
β = 0.4. Therefore, when using the TP-CoGlo-TextRank algorithm to extract keywords from documents, the position importance parameter
β is set to 0.4.
When
β = 1, the importance of the title word and the abstract word is the same—that is, the position importance of the word is not taken into account. In this case, the Pos-TextRank algorithm is the TextRank algorithm. It can be seen from
Figure 7 that the Pos-TextRank algorithm at 0 <
β < 1 has better performance in extracting keywords from the training documents than at
β = 1. As described in
Section 4.3.1, this also shows that word position can help improve the performance of the TextRank algorithm in extracting keywords.
- (2)
Determination of the parameter γ in the transition probability and the parameter λ in the overall weighting
In Equation (10),
λ (0 ≤
λ ≤ 1) represents the proportion of the position importance in the overall weighting of word
Vi, and in Equation (11),
γ (0 ≤
γ ≤ 1) represents the proportion of the co-occurrence frequency in the total transition probability from word
Vj to word
Vi. To determine the values of
λ and
γ, this paper tests the changes in
AF1 when using the TP-CoGlo-TextRank algorithm to extract four to nine words with the highest scores from each document in the training set as keywords and combining the adjacent keywords in the original text into phrases, with
λ and
γ respectively varying from 0 to 1 in steps of 0.1, as shown in
Figure 8.
The three-dimensional coordinate system is shown in the lower part of
Figure 8, where the
x-axis refers to
γ, the
y-axis refers to
λ, and the
z-axis refers to
AF1. In
Figure 8, the dots represent the
AF1 values of the TP-CoGlo-TextRank algorithm as
λ and
γ change, the triangles on the
zx-plane are the projections of the
AF1 values as
λ changes, and the stars on the
zy-plane are the projections of the
AF1 values as
γ changes. The orange, pink, green, blue, and gray dots represent the
AF1 value at
λ = 0,
λ = 1,
γ = 0,
γ = 1, 0 <
λ < 1 and 0 <
γ < 1, respectively. Correspondingly, the orange, pink, green, blue, and gray triangles on the
zx-plane represent the projections of the
AF1 values at
λ = 0,
λ = 1, 0 <
λ < 1 and
γ = 0, 0 <
λ < 1 and
γ = 1, 0 <
λ < 1 and 0 <
γ < 1, respectively; the orange, pink, green, blue, and gray stars on the
zy-plane represent the projections of the
AF1 values at 0 ≤
γ ≤ 1 and
λ = 0, 0 ≤
γ ≤ 1 and
λ = 1,
γ = 0 and 0 <
λ < 1,
γ = 1 and 0 <
λ < 1, 0 <
γ < 1 and 0 <
λ < 1,respectively. It can be seen that when
λ = 0 (i.e., only word frequency is considered regardless of word position), the
AF1 values (orange dots) are projected at the lowest level (orange triangles), far below the projections (grey or pink triangles) at which the position importance is considered. When
λ = 1 (i.e., only position importance is considered regardless of word frequency), the projections (pink triangles) of the
AF1 values (pink dots) are mostly at the lower-middle level compared with other projections (the grey triangles) at 0 <
λ < 1. This illustrates that both word position and word frequency can help distinguish the importance of words, as described in
Section 4.3.1. It can also be seen that when
γ = 0 (i.e., only similarity-based transition probability is considered), the
AF1 values (green dots) are mostly projected at the middle or upper-middle level (green stars) compared with other projections (grey stars) at 0 <
γ < 1. When
γ = 1 (i.e., only co-occurrence frequency-based transition probability is considered), the projections (blue stars) of the
AF1 values (blue dots) are mostly at the middle level. This illustrates that the word importance score can be better assigned to its co-occurring words according to their co-occurrence frequency and similarity ratio, as described in
Section 4.3.1. When
λ = 0.3 and
γ = 0.2, the TP-CoGlo-TextRank algorithm with adjacent keywords combined into phrases has the best average
AF1 value in extracting four to nine words from the training documents. Therefore, considering the influence of
γ and
λ on the performance and stability of the TP-CoGlo-TextRank algorithm, this paper sets
λ = 0.3 and
γ = 0.2.
4.3.3. Comparative Experiment
To verify the effectiveness of the proposed TP-CoGlo-TextRank algorithm in keyword extraction, six different algorithms are compared in the experiment.
- (1)
M1: the TF-IDF method. The TF score is calculated by Equation (2) and the smoothed IDF is calculated by Equation (22), where
N represents the size of the document set and
df(
w) represents the document frequency of word
w.
- (2)
M2: the LDA topic model-based method. In addition to the 10,000 documents in the experimental dataset, another 38,365 documents are selected from the DBLP-Citation-network V14 dataset, and a total of 48,365 documents are used to train the LDA model. The trained LDA model is then used to compute the influence of words in each test document, and words with the highest word influence are extracted as keywords.
- (3)
M3: the TextRank algorithm [
6].
- (4)
M4: the modified TextRank algorithm proposed in [
32], which uses word frequency, position, and word co-occurrence relationship to compute the transition probability matrix. The optimal weights of the three parts of transition probabilities are 0, 0.9, and 0.1, respectively.
- (5)
M5: the modified TextRank algorithm proposed in [
37], which uses word similarities and co-occurrence relationship, both of which have a weight of 0.5, to compute the transition probability matrix, and takes the sum of similarities between the word and all its co-occurring words as the initial value of the word. The same word embedding model and word similarity calculation method as M6 are used here.
- (6)
M6: the TP-CoGlo-TextRank algorithm proposed in this paper. The glove.42B.300d [
31] word embedding model, trained on 42 billion tokens from Common Crawl (
http://commoncrawl.org, accessed on 1 March 2024), is used to obtain the word vectors.
The above six algorithms are used to extract four to nine words from each document in the test set as keywords, and
AP,
AR, and
AF1 are employed to evaluate their performance.
Figure 9 shows the comparison of the six algorithms when the extracted keywords that are adjacent words in the original text are not combined into phrases.
The following points can be determined from
Figure 9.
- (1)
The M2 algorithm has the worst performance in keyword extraction. We believe this is mainly because the M2 algorithm is closely related to the training effect of the LDA model. Although the 48,365 documents used to train the LDA model are all computer-related, their topic distribution is relatively scattered. As shown in
Figure 10, the optimal number of topics is 15, and the highest topic coherence score is 0.5250, indicating that the semantic coherence of the words in the topics is weak, resulting in the keywords extracted by the M2 algorithm not being able to express the document topics well.
- (2)
As the number of extracted keywords k increases, the AP of the six algorithms gradually decreases, indicating that with the increase in k, the number of correctly extracted keywords increases slightly. AR gradually increases, mainly because the number of keywords provided by the document itself is fixed, but as k increases, the number of correctly extracted keywords increases, although by a smaller amount. With the increase in k, except for the M2 algorithm, the AF1 of the other five algorithms increases first and then decreases gradually after reaching a peak. As for the M2 algorithm, its AF1 reaches a peak at k = 16 and then shows a downward trend. The reason for the late peak of the M2 algorithm is that the topic coherence of the trained LDA model is not high, which leads to the fact that when a small number of words are extracted as keywords, the extracted words cannot express the document topics well. However, as k increases, more high-quality words are extracted, making the M2 algorithm gradually reach its peak.
- (3)
Graph-based algorithms M3, M4, M5, and M6 are better than the statistical feature-based algorithm M1 in extracting keywords. The average AP values increased by 1.40%, 10.42%, 1.65%, and 12.14%, respectively. The average AR values increased by 1.73%, 6.60%, 1.74%, and 8.09%, respectively. The average AF1 values increased by 1.57%, 8.95%, 1.74%, and 10.61%, respectively. This shows that graph-based keyword extraction algorithms can better measure the importance of words by using the relationship between words.
- (4)
Compared with the other five baseline algorithms, the M6 algorithm has a significant improvement in AP, AR, and AF1. The average AP values increased by 12.14%, 23.34%, 10.74%, 1.72%, and 10.49%, respectively. The average AR values increased by 8.09%, 24.31%, 6.36%, 1.48%, and 6.34%, respectively. The average AF1 values increased by 10.61%, 23.65%, 9.04%, 1.66%, and 8.87%, respectively. This shows that the M6 algorithm has good performance in keyword extraction when adjacent keywords in the original text are not combined into phrases.
Figure 11 shows the comparison of the six algorithms when the extracted keywords that are adjacent in the original text are combined into phrases. It can be seen that the M6 algorithm is still superior to the other five algorithms in terms of
AP,
AR, and
AF1.
Figure 12 shows the
AF1 comparison of the six algorithms when adjacent keywords in the original text are not combined or combined into a phrase as one keyword. Adding “Uncombined” and “Combined” after the algorithm name refers to whether adjacent keywords in the original text are combined or not. It can be seen that the
AF1 value of each algorithm with adjacent keywords combined is higher than that without combining adjacent keywords. Specifically, the M1 (Combined) algorithm has an average improvement of 3.83% compared with the M1 (Uncombined) algorithm; the M2 (Combined) algorithm has an average improvement of 1.67% compared with the M2 (Uncombined) algorithm; the M3 (Combined) algorithm has an average improvement of 4.46% compared with the M3 (Uncombined) algorithm; the M4 (Combined) algorithm has an average improvement of 6.00% compared with the M4 (Uncombined) algorithm; the M5 (Combined) algorithm has an average improvement of 4.46% compared with the M5 (Uncombined) algorithm; and the M6 (Combined) algorithm has an average improvement of 6.54% compared with the M6 (Uncombined) algorithm. This is mainly because for each test document, after combining adjacent keywords into a phrase as one keyword, the number of keywords extracted by the algorithm decreases greatly, while the number of correctly extracted keywords changes little. Therefore, the precision of keyword extraction of each algorithm for most test documents is significantly improved, and so is the
AP. Since the number of keywords provided by the document itself is unchanged, each algorithm has a relatively small increase in recall for keyword extraction from most test documents, and correspondingly, there is a small increase in
AR. As
AF1 is a combination of
AP and
AR, the
AF1 of each algorithm increases significantly after combining adjacent keywords into a phrase as one keyword. It can also be seen from
Figure 12 that the
AF1 value of the M6 (Combined) algorithm reaches its peak at
k = 7 (i.e., extracting seven words with the highest scores from each test document).
Table 5 compares the effects of the six algorithms when adjacent keywords are not combined or combined into a phrase as one keyword at
k = 7. It can be seen that when extracting seven words with the highest scores from each test document, the M6 (TP-CoGlo-TextRank) algorithm proposed in this paper has the best performance compared with the five baseline algorithms (M1 to M5), regardless of whether adjacent keywords in the original text are combined. When adjacent keywords are not combined into one keyword, the
AF1 value of the M6 algorithm is 10.65%, 23.16%, 9.25%, 1.62%, and 9.13% higher than that of the M1 to M5 algorithms, respectively. When adjacent keywords are combined into one keyword, the
AF1 value of the M6 algorithm is 13.65%, 28.45%, 11.58%, 2.34%, and 11.50% higher than that of the M1 to M5 algorithms, respectively. Therefore, based on the TextRank algorithm, combining word frequency and position to measure the importance of words, and using word co-occurrence frequency and the GloVe model to achieve the non-uniform transfer of word scores, can obtain keywords that are more consistent with the document topic.
Therefore, when constructing the academic literature knowledge graph, this paper will use the TP-CoGlo-TextRank algorithm to extract keywords from documents without keywords, select the top seven words with the highest scores as keywords, and combine the keywords that are adjacent words in the original text into phrases. The combined final results are used as the keywords of the document.