1. Introduction
In recent years, with the aggravation of global environmental pollution and resource shortages, it has been proposed to use green technology and other measures to solve this problem [
1]. The Chinese government has issued relevant policies to encourage and support the development of green industries, thus promoting the rapid development of green technology and environmental protection industries [
2,
3]. Existing studies use patents as an important indicator of innovation, and China has become the world’s largest patent application country since 2011, but the quality of patents has become uneven extremely, and how to identify significant patents from them earlier is a topic worthy of research [
4]. Scholars have built patent citation networks based on citation data and created a series of centrality metrics to identify the impact of patents [
5]. However, those networks are homogeneous information networks, which only contain patent nodes while ignoring the impact of other information of patents. More seriously, most scholars assessed the impact or significance of patents using citation-based ranking metrics, and fewer analyzed the age bias problem that exists in ranking metrics and the impact on the performance evaluation indicators [
6]. Therefore, this paper needs to address the following two issues: first, using the multidimensional information of patents, and constructing a heterogeneous information network to evaluate the importance of patents more comprehensively. Second, when using the citation-based ranking metrics, it needs to analyze the age bias, so as to identify significant patents earlier.
It is found that if only analyzing the homogeneous information network, it generally misses important information that is useful for further exploring the nature and laws of the research target. Most of the information networks existing, in reality, belong to the heterogeneous network, which contains richer information, hence building a heterogeneous information network can be more comprehensive and closer to the real information [
7]. Since patent information contains applicants and inventors, among which applicants are subjects who can apply for and obtain patent rights, they are of great significance to patents. For example, in a certain technological domain, if an authoritative applicant has the strength of technical monopoly or has a certain influence in the domain, then the quality of his patent application is generally higher than that of other applicants’ patents [
8]. Therefore, this paper combines the applicant and citation information of patents to build a
CNGP heterogeneous information network, which not only can alleviate the sparsity problem of homogeneous information network, but also can reflect the actual situation of patents more truly.
When we use citation-based ranking metrics to evaluate the node importance of a network, we inevitably discuss the age bias induced by the ranking metrics [
9,
10]. Since the constructed datasets are truncated and the number of citations accumulates over time, old patents have the advantage of a long time span to obtain more citations compared to young patents [
10,
11]. Then, the use of classic ranking metrics (e.g., the citation count, PageRank) suffers from age bias [
9,
12]. In order to suppress age bias, scholars have proposed methods such as CiteRank [
13], Time Weighted PageRank [
14], and rescaling method [
10], among which the rescaling method proposed by Mariani et al. [
10] was shown to be effective in suppressing age bias of ranking metrics, it was used to adjust the ranking metrics to achieve relative fairness in ranking old and young patents.
In this paper, based on the Chinese green patent (CNGP) dataset from 1985 to 2020, a CNGP heterogeneous applicant-citation network is constructed and uses the Chinese Patent Award (CPA) data as the expert-selected significant patents to identify. In the model analysis, the rescaling method and normalization procedure are used to solve the problem of age bias. The results illustrate that the proposed analysis method can not only identify the significant patents earlier, but also the constructed heterogeneous information network with better performances. Therefore, the analysis method proposed in this paper can reasonably evaluate the quality of patents and provides a new method to measure the quality of patents.
It is an extraordinary significance to evaluate the quality of patents by building heterogeneous information networks. This study has the following innovations. (1) To the best of our knowledge, this paper is the first study to build a dataset of Chinese green patents and conduct patent importance analysis, laying a solid foundation for the research of green innovation in China. (2) Combining the applicant information of patents with the citation information is a way to analyze patents using multidimensional information, which provides scholars with new perspectives to study patent quality. (3) In the heterogeneous information network, we consider the effect of age bias, and the rescaling method and normalization procedure are used to solve the age bias. Thus, constructing a complete analysis method and providing a new approach for scholars to study the importance of patents.
The paper is organized as follows. In
Section 2, we present the work related to our research. In
Section 3, we describe and analyze the dataset in this study, including the build steps of the
CNGP dataset and the expert-selected significant patents, then in-depth analyze the obtained data. In
Section 4, we introduce the heterogeneous applicant-citation networks, and present the considered ranking algorithms and the evaluation indicators. In
Section 5, we evaluate and analyze the results. In
Section 6, we offer some discussion. Finally, in
Section 7, we draw the conclusions of this paper.
3. Data and Analysis
We collected the Chinese green patents (
CNGP) dataset, which spans the period covering the years 1985 to 2020. This dataset contains patent citations and applicant information, where patents gradually appear with time. Nodes include patents and applicants, directed links represent patent citation relationships, and undirected links represent patent-applicant relationships. There are a set of corresponding expert-selected patents of high impact that are referred to as significant patents.
Table 2 summarizes the analyzed dataset’s basic characteristics, including the time span, patent nodes, applicant nodes, patent citation edges, patent-applicant edges, and the corresponding sets of significant patents.
This section includes the following three parts: first, we describe how to build the CNGP applicant-citation dataset. Then, we explain the source of expert-selected significant patents and match them to the CNGP dataset. Finally, we have an in-depth analysis of the obtained data, thereby laying the foundation for the research of this study.
3.1. CNGP Applicant-Citation Dataset
As one of the important means of environmentally sustainable development, green innovation has attracted more attention from the society and government. However, as the world’s largest patent applicant country since 2011, China has not existed a corresponding dataset to study green innovation yet. As the phased achievements of technological innovation, patents are of extraordinary significance to analyze. Hence, the establishment of a Chinese green patent dataset is beneficial to the in-depth exploration of the Chinese green innovation process and to discover significant patents earlier. The flowchart for building the
CNGP applicant-citation dataset is shown in
Figure 1.
The procedure of building the CNGP applicant-citation dataset as follows:
- (1)
Using crawler technology to collect the patent invention applications from the “China National Intellectual Property Administration” (
CNIPA) website (
http://epub.cnipa.gov.cn/ (accessed on 1 July 2022)). Some scholars studied show that the data from this website is valuable for studying patent quality and can be used as a proxy variable for studying innovation [
43,
44]. After collecting and sorting, the total number of invention patents was 12,814,946, obtained from 1985 to 2020.
- (2)
In order to identify Chinese green patents, we are using the patent’s International Patent Classification (
IPC) number to match the
IPC green inventory published by the World Intellectual Property Organization (WIPO) (
https://www.wipo.int/classifications/ipc/green-inventory/home (accessed on 1 July 2022)), which left a total of 1,670,450 green patents in China.
- (3)
Since the
CNIPA database does not include patents’ citation information. Google Patent (
https://patents.google.com/ (accessed on 1 July 2022)), fortunately, provides related citation data for all Chinese patent applications. Moreover, Google Patent updates the forward citation data of each patent according to the information on the backward citation. We link those data with the
CNGP dataset through the patent’s publication number [
45,
46].
- (4)
Finally, we further process the obtained data. Referring to the operation of Kogan et al. [
21], we retain the citation relationships from the year of 1985 to 2020 and ensure that the cited patents belong to the
CNGP dataset. In addition, we keep information about each corresponding applicant and link it to the patent. Therefore, the
CNGP applicant-citation dataset is composed of the number of patent nodes which are 878,007, and the number of applicant nodes are 202,764, the number of directed citation edges are 1,676,458, and the number of undirected applicant-patent edges are 516,201 in the end.
3.2. Expert-Selected Significant Patents
The Chinese Patent Award (CPA) is the highest award in the field of Chinese patents, which is jointly issued by the CNIPA and the WIPO since 1989. This award is the only government award in China that specifically award authorized patents, and has a certain international influence. The evaluation criteria for the patent award not only grab the legal, technical, and market dimensions of the patent, but also care about the social benefits and development prospects of the patent. Therefore, patents that have won this award are scientific and feasible as high-quality patents.
In addition, Moser and Nichalas’s [
47] studies of US patents found that the use of incentives attracts more innovators and has a positive impact on more patents and better innovation at a later stage. The
CPA is a government department of award set up by the Chinese government to find and reward high-quality Chinese patents, which reflects the technological quality and economic benefits of Chinese patents, focuses on the value of Chinese patents, and plays a role in leading innovation [
48]. Some scholars believed that the award-winning patents selected by the
CPA are characterized by high creation quality, strong patent protection, and good patent application, and thus the intellectual property rights and innovations represented by the
CPA have a great impact and contribution to society [
49,
50]. Other scholars also directly take the Chinese Patent Gold Award as a high-value patent and analyze the differences in patent quality among different types of patent owners and different regions [
51,
52]. Based on the data of
CPA, some researchers construct a patent quality assessment indicator framework, and the studies find that the award results of
CPA are relatively fair, and the awarded patents have a higher value than the non-awarded patents [
48,
53,
54]. Therefore, it is reasonable to adopt the
CPA as the label of significant patents in this paper.
The steps for collecting the award-winning patent dataset are as follows. First, by visiting the website of the
CNIPA (
https://www.cnipa.gov.cn/col/col41/index.html (accessed on 1 July 2022)), collecting and sorting out all the award-winning information of the patent invention applications, and standardizing the processing of the dataset, we obtain the number of all the award-winning patents which is 6169. Then, restricting our analysis to those patents that were issued within our dataset’s temporal span, and matching the green patents in the
CNGP applicant citation dataset, and the number of the green patents is 839 in the end. From 1989 to 2020, there are a total of 22 sessions of the Chinese Patent Award. The number of green patents awarded in the
CPA by session is shown in
Figure 2. Before 2007, this award was held biennially, and after that, it was held annually.
Figure 2 shows that the number of invention patents awarded is not fixed in each session, but the number of awards for green patents has increased obviously since 2015. It can be seen that to achieve sustainable development, the research of green patents has gradually received attention from society and the government.
The Chinese Patent Award has three categories of awards: gold, silver, and excellent award. The reasons why we do not distinguish the different types of awards are as follows. (1) The selected award-winning patents in this paper are the invention patents, not the utility model patents. Due to the value of the invention, patents are better than that of utility model patents and are more reflective of innovation. (2) We know that different category of awards means different values of patents. However, it is well-known that the category of award and the total number is not fixed in each session. For example, the silver award was awarded in 2018, hence the total volume is minimal. While the number of gold awards has increased from 10 to 30 per session. The number of patents awarded excellent awards is far more than the number of gold and silver awards. (3) The number of green patents studied in this paper is 878,007, and the number of green patents awarded with gold, silver, and excellent awards are 42, 27, and 770, respectively. The total number of awarded green patents and its percentage of the total number of green patents is 0.95‰, moreover, the percentage of both gold and silver awards is less than one ten-thousandth. Therefore, it is not meaningful to distinguish different categories of awarded patents. In addition, the purpose of this paper is to construct a patent heterogeneous network to identify significant patents earlier, so the effect of different types of CPA on the identification effect is not considered.
3.3. Data Analysis
The CNGP applicant-citation dataset has been constructed through the above-mentioned section, and it is necessary to analyze the obtained dataset in-depth. This dataset not only contains the citation information of patents but also the applicant information, it is beneficial to identify significant patents by using the characteristics of the heterogeneous information dataset.
Foremost, analyzing the distribution characteristics of patents as shown in
Figure 3. From
Figure 3a, it is found that the log-scale number of green patents almost increases linearly as time goes by (the point corresponding to the last two years can be ignored, since the data might be incomplete caused by data lag). The result shows that there is a continued focus on green technology innovation and development in China and that more patented products are being used in the environment and resource areas. In addition, we analyze the distribution of the total number of patents with the number of citations, the result as
Figure 3b shown. This result indicates that most “normal” patents have few citations, while a few “seminal” patents have large citations. Such a network is called a scale-free network. Moreover, we calculated the average path length of this dataset, which value is equal to 4.34, representing that it belongs to the small-world network (Six Degrees of Separation). Hence, those metric applicable to complex networks could be applied to our dataset as well.
Many scholars applied the number of citation counts to assess the quality of patents [
19,
23]. We use the same method to analyze our dataset, the result of the top 10 patents ranked by the number of citations count as shown in
Table 3. It includes the patent’s rank, application number, title, application year, applicant, and the “count” refers to the citation count of the patent. The patent title clearly indicates that it belongs to the green patent product, and we can see from the year of application of patents, the top 10 patents are relatively old. It illustrates that, in the
CNGP dataset, old patents have a longer time span to obtain citation relationships, and consequently old patents have more advantage over other young peers to obtain more citations. The more citation counts, the higher the value of the patent. This conclusion is consistent with the results of other literature [
10,
11].
We divided the applicant for patents into four types: enterprise, university, research institute, and individual.
Figure 4 shows the proportion of different applicant types in our dataset, it illustrates that enterprises constitute 65.03% of the Chinese green patent applicants and are the largest component. The second-largest component is individuals, which account for 31.82% of applicants. Meanwhile, research institutes and universities constitute 2.01% and 1.15% of applicants, respectively. These findings show that in Chinese green patents, enterprises are dominant, followed by individuals, while research institutes and universities’ participation is relatively small. The high proportion of enterprises and individuals illustrates that they are more inclined to transfer technological innovation into patents and could benefit from their own patents, such as by enhancing their core competitiveness. On the contrary, the proportion of research institutes and universities is smaller, because the number of them is limited, but their R&D capabilities cannot be ignored, so the analysis of applicants is particularly indispensable.
Many studies rank applicants according to the number of patents the applicants own [
33,
55,
56]. Therefore, we simply analyze the number of Chinese green patents owned by different applicants and found differences in the status of technological innovation, we do not consider whether it is a current name or a former name. The result is shown in
Table 4, the “number” refers to the number of patents owned by a certain applicant, and the “average citations” indicates the average number of citations of all the patents invented by the applicant in this dataset.
Table 4 illustrates that, in the Chinese green patent dataset, the State Grid Corporation of China has the largest number of patents, indicating it has a high level of R&D and a strong awareness of intellectual property protection. China Petroleum & Chemical Corporation follows and also has an important status in Chinese green patent innovation. Additionally, there are 3 enterprises, 5 universities, and 2 research institutes among the top 10 applicants. Individuals are not ranked in the top 10, implying that, although individual applicants comprise the proportion of 31.82%, the influence of individuals is weaker than that of organizations. Among the five universities, all belong to “Project 211 (a National Key Universities)”, indicating that most green technological innovation activities are in well-known universities. Only larger state-owned enterprises will actively participate in green innovation, perhaps because green innovation belongs to a new field, and green patents are difficult to convert into market value, hence other enterprises are not paying enough attention to green patents. By applying for patents, universities and research institutes can not only improve their innovation capabilities, but also obtain considerable economic income through the transfer of patents. Therefore, universities and research institutes have become the main force of green invention patents. Moreover, we calculated the average number of patent citations for the applicants in this dataset as the value of 1.62, however, only the State Grid Corporation of China of the above Top 10 applications is smaller than the overall average, because it is a company that changed its name only in 2017 and has not enough time to obtain citations. In general, patent applicants with influence or authority in the field, such as the China Electric Power Research Institute and Southeast University, generally have the strength of technological monopoly, and their average patent citations are much higher than those of common applicants. Thus, the patents invented by these applicants may all contain high technological innovation and value.
In the past, lots of literature generally used the method of citation networks to analyze the importance of patents and did not consider the influence of applicant information on it. On the one hand, the homogeneous information network does not conform to the actual situation, and on the other hand, the obtained results are biased. Therefore, this paper intends to build a heterogeneous information network, which contains abundant information (e.g., applicants and citations) and is more feasible to study the importance of patents.
5. Results
5.1. Metrics’ Performance on the Complete Heterogeneous Networks
We start by measuring the identification rate and the average precision of all the metrics on the complete heterogeneous applicant-citation networks, where uses to single out the significant patents. The colors of the bars are used to distinguish the original ranking metrics (red) and their age-rescaled counterparts (dark).
Figure 6 shows that all the original metrics have a similar performance in identifying the significant patents on the complete heterogeneous networks, besides their rescaled metrics as well. However, the scores of
IR and
AP are not very high. The reason may be shown that significant patents are difficult to separate from the other patents in our dataset. In addition, from the identification rate in
Figure 6a, we can notice that: (1)
ID is the best performing metric with a small over the others original metrics and a large margin over all rescaled metrics. (2) The ratio between the best and the worst metric’s
IR score is 2.05. (3) All rescaled metrics perform significantly worse than their non-rescaled counterparts. In
Figure 6b, all the metrics’ performance of the score of
AP are quite small. The reason is that the Top
z N data contained a small proportion of significant patents (as shown in
Figure 3b), and its ranking is also a very important factor, so the resulting score appears to be very small. In addition,
PR and
R_ID perform better than other metrics on the scores of
AP, respectively.
From the above analysis, we know that a better metric for the analysis of complete heterogeneous networks does not exist. The original metrics get better performance over the rescaled metrics, while it contains age-biased ranking metrics. Therefore, it is crucial to analyze the effect of the patents’ age by dissecting the complete networks.
5.2. Metrics’ Performance with Patents Age
Although the analysis in the previous section reveals the important differences among all the ranking metrics, the main objective of this paper is to reveal the dependence of the ranking metrics’ performance as a function of patent age and evaluate the ability of metrics to identify the significant patents earlier. Therefore, we dissect the network evolution by constructing network snapshots at the end of each calendar year and ranking all the nodes on each network snapshot, the detailed steps refer to
Section 4.2.1. The results, as
Figure 7 shows, show the metrics’ performance as a function of the age of the significant patent. This method is used to reveal the time evolution of the metrics’ performance. To facilitate the comparison between the obtained results, the performance of each metric was normalized to the best metric in each age bin. Specifically, it means that there is the best metric that receives the best score in each age bin. For instance, a metric with zero
IR then obtains a zero score, while a metric that achieves the best
IR for given significant patents age obtains one score. In addition, the same process is used for
AP as well.
As shown in
Figure 7a, the relative performance of metrics changes dramatically with the age of the significant patent. All the rescaled metrics that work well shortly after the date of application lose their advantage as the significant patents become older, then the original metrics perform better. In our study, there is no single metric that performs well for most age values. Rescaled indegree (
R_ID) is better until age 3, then indegree (
ID) is better until age 10, LeaderRank (
LR) is the best until age 13, and then
ID and
PR are better until age 20 take place by turn. However,
Figure 7b indicates that the
R_ID metric is the best until age 2,
NGSC is the best until age 9, and
ID is better from then until age 20. The rescaled metrics perform worse if the significant patent’s age is more than 12 years old.
The above analysis shows that the rescaled metrics can earlier identify the significant green patents in the dataset. With the patent’s age increasing, the rescaled metrics’ scores are far less than the original metrics. This proves the validity of the rescaling procedure method, and analyzes the influence of the existence of the patent’s age bias on the ranking centrality results.
5.3. Further Explanation about the Results
To further understand the differences between the ranking metrics, we used the Spearman ranking correlation of all patents’ rankings to assess their pairwise similarity. In statistics, Spearman’s
is a non-parametric measure of ranking correlation. This method is based on the
L1 distance of the ranks of patents in two ranked lists and provides a quantitative measure to compare how similar these lists are. The value is
, the higher the absolute value, the better the similarity between the two rankings, and the value is 0 denoting there is no correlation. It can be computed using the popular formula like that:
where
N denotes the number of patents,
and
are the position of patent
i in different ranked lists.
In addition, we use hierarchical clustering analysis to cluster the results. It is an algorithm that groups similar objects into groups. The metrics’ hierarchical clustering is obtained by the unweighted pair-group method with the arithmetic means (
UPGMA) method. The result is shown in
Figure 8 together with metric clustering based on the received correlation metrics.
From
Figure 8, the following points should be noted. (1) The clustering of metrics is extremely stable in the
CNGP dataset. (2) The hierarchical clustering revealed that there are two groups of metrics that were ranked similarly to each other. The larger group includes four ranking metrics:
PR, LR,
ID, and
NGSC. The smaller group includes some of their rescaled variants:
R_ID,
R_PR, and
R_LR. (3) However,
R_NGSC is not clustered with other rescaled metrics, probably because the rescaling procedure has no effect on the
NGSC metric. (4) Within each of the two mentioned clusters, the pairwise Spearman’s rank correlation coefficients are rather higher (above 0.78 in our dataset), which indicates a high degree of similarity among the respective metrics.
5.4. Caveats of the Evaluation Indicators
As shown in
Figure 6 above, the ranking metrics are not good enough in the evaluation indicators of
IR and
AP assessment performance. In order to illustrate where they come from, we explore the age distribution of the significant patents in our dataset. First, check the distribution of the significant green patents according to their application date. Second, sort all green patents according to the application date and divide them into 40 equally-sized age groups, and count the distribution number of the significant patents according to groups, the results as
Figure 9 shown.
In
Figure 9a, we count the distribution of the significant patents in a span of five years, which shows the significant patents’ application dates mostly in the year scope of 2010–2014. Although before 2004, the time span was as long as 20 years, the proportion of the significant patents was only 10.96%, and the analysis found that the average application year is 2010. In addition, we sorted the complete dataset by patents’ application date, and split them into 40 equally-sized age groups (with groups 1 and 40 containing the oldest and the most recent patents, respectively). As
Figure 9b applies that the significant patents are distributed unevenly among the age groups. Most of them were distributed in the older age groups (about 73% in age group 1 to age group 10), and even less than 10 significant patents in the age group 24 to age group 40. Compare with
Figure 9a,b, the reason for the difference between them is the exponential increase in the number of new patents application each year (as shown in
Figure 3a). The number of recent new patents is so much larger that they “push” the significant patents to the earlier age groups, resulting in an uneven distribution, as shown in
Figure 9b.
According to the above analysis, the strong temporal non-uniformity of the significant patents can have decisive consequences. First, it is not beneficial to age-rescaled metrics, because the rescaled procedure method strives for a uniform representation of all age groups among the top-ranked patents. In addition, for the dataset in this paper, patents from age groups 20 to 40 can contribute only marginally to the evaluation scores of IR and AP, where there are only a few significant patents among them. By contrast, original non-rescaled metrics are generally biased towards older patents, and such metrics are more advantageous when a given set of significant patents has the same bias towards older patents.
The age bias of the significant patents in the
CNGP dataset is so strong that a simple ranking of nodes by age is achievable (we refer to this metric as Xu et al. [
40] named
AgeR, when ranking, old nodes are at the top) to outperform all other metrics. In this paper, we choose indegree (
ID), PageRank (
PR), and their rescaled metrics with
AgeR in identifying the significant patents, and the results of metrics’ performance as a function of the significant patents age as
Figure 10 shown.
Figure 10a clearly illustrates the selected metrics’ performance with the significant patents age in the evaluation indicator of
IR, the rescaled metrics (
R_ID and
R_PR) get better performance while the significant patents age is younger than 3. When the significant patents’ age is from 4 to 10,
ID and
PR have a better performance. The
AgeR metric receives a score of
IR zero when the significant patents are young, the reason for this is determined by the mechanism of this metric, which simply puts older patents at the top of the ranking. For example, as
Figure 9a shows that before 2000, the number of significant patents is particularly rare, at only 23. However, as the size of the network increased, the advantages of adopting this metric continued to emerge, significant patents begin to be identified continuously at the age of 9, and the
AgeR algorithm becomes the best metric after age 11. Moreover, if the age of the significant patents is more than 13 years, the
AgeR metric can identify 100% of them. This suggests that evaluating ranking metrics by their ranking of the significant patents is of limited relevance. For instance,
AgeR which metric completely ignores the actual influence of the patents, is finally able to outperform the other ranking metrics. In addition,
Figure 10b shows the results of the
AP evaluation indicator, which shows that the rescaled metrics exhibit better performance when the age of the significant patents is younger than 3. The original metrics’ were better than the rescaled metrics’ after that. After the significant patents age more than 9,
AgeR becomes the best metric, and the effect is absolutely significant as well.
5.5. Penalizing Age-Biased Metrics
From the above analysis, the age bias is implicit in the selected set of significant patents by experts. In this section, we need to apply an additional penalty for biased metrics when we use the indicators of
IR and
AP to evaluate the performance of ranking metrics. We adopt the normalized method as Mariani et al. [
10] applied which imposes a penalty on metrics that are age-biased. The specific description process could refer to the literature written by Xu et al. [
40]. In this paper, we’re measuring the evaluation indicators of
NIR and
NAP for the metrics’ performance on the complete dataset and the metrics’ performance as a function of the age of the significant patents as well.
To define the
NIR of a metric, at each computation time
we divide the
patents into 40 groups according to their age. The
NIR of metric
m is defined as:
where the above variables are denoted the same as those described in Equation (7), and
is a decreasing function of the fraction
of patents that belong to the same age group of patent
i and are ranked among the top
z by metrics. Denoting by
the expected value of
for an unbiased ranking, which is defined as follows:
Besides, the similar operation is used for NAP as well.
Figure 11 shows the performance of using the normalized method for evaluating selected metrics with the significant patents age, the results show that using this method indeed solves the problem encountered when using the ordinary evaluation metrics, such as in
Figure 10 AgeR metrics ultimately outperform others selected metrics. However, whilst penalizing age-biased metrics,
AgeR becomes the worst metric regardless of the significant patents age, which applies to ranking metrics that actually ignore the impact of patents in the network. With the function of the significant patents age, the rescaled metrics mostly get better performance than the original metrics. This suggests that the use of the normalized procedure weakens the mutually reinforcing link between the age-biased ranking metric and the age-biased sets of significant patents.
The reason for applying normalized processes, which is a significant patent, is just the tip of the iceberg of high-quality patents, it is inevitable to ignore other significant patents that are not identified by experts. In summary, the normalized method corresponds to the task of ranking the best patents for each age group, where the given significant patents are a potentially biased sample. Then, we assess the performance of the metrics using the normalized process on the complete heterogeneous networks as
Figure 12 shown.
Figure 12 shows the ranking metrics evaluated by their
NIR and
NAP on the complete heterogeneous networks. We observe something from
Figure 12a as follows. (1) The rescaled metrics generally perform better than their original counterparts here. (2) The
NIR scores are much lower than the previously reported
IR scores (From
Figure 6a). This is a direct effect of the penalty approach introduced by
NIR that severely penalizes biased ranking metrics and, unbiased ranking metrics are not good at identifying the biased significant patents.
Figure 12b shows that
R_ID has the best performance than other metrics. However, the
NAP scores of
PR,
LR, and
NGSC are not better than their rescaled metrics. Through comparison with their original performance, we find out the original metrics of
NAP are generally lower than
AP values, while the rescaled metrics of
NAP are much better than
AP values. The reason for this result may be the rescaled metrics can reduce the age-biased influence in our dataset.
It can be seen from the above analysis that the combination of rescaled metrics and normalized processing methods can effectively suppress the age deviation of patents and the age-biased cause by expert-selected significant patents, which is conducive to identifying significant patents at an earlier stage. From
Figure 12 we know that
R_ID has better performance both in
IR and
AP evaluation indicators, hence the results of the Top 10 patents as ranked by
R_ID score, as
Table 5 shows.
Table 5 shows that the top 10 patents by
R_ID span a wider temporal range (2000–2019) than the top 10 by
ID (2000–2016) in
Table 3, which is a direct result of the age-bias removal. In the meantime, this method can also identify those significant patents earlier, thereby contributing to sustainable development.
6. Discussion
To verify the validity of the heterogeneous applicant-citation networks built in this paper, we analyze the homogeneous patent citation network as well. This homogeneous information network only contains citation relationships of patents, not including applicant information. We use the same ranking metrics as described in
Section 4.2 to evaluate the complete network. The results of the comparison of the metrics’ performance in identifying the significant patents on the heterogeneous and homogeneous information network as
Table 6 shown.
From
Table 6 we can conclude that the heterogeneous network has better performance than the homogeneous networks under the same conditions. For instance, the
IR scores of the heterogeneous network are at least 12% larger than the
IR scores of the homogeneous network. In addition, after using the normalized method to penalize age-biased, the conclusion remains the same. Therefore, it can be seen from the above that the performance of the heterogeneous information network obtained by adding the applicant’s information to the patent citation network is better than that of the homogeneous information network. It is because those influential applicants attracted more attention in the network, and those significant patents can be better identified by analyzing the heterogeneous information network.
Since most of the previous literature identified significant patents by constructing patent citation network, for example, Mariani et al. [
31] and Xu et al. [
40] analyzed US patents and found that the
NIR scores of rescaled_PageRank and rescaled_LeaderRank in the static network in the top 1% rankings were about 38%, while our results have the best
NIR score is 15% in the top 5% rankings, which is surprisingly a huge difference. We found two possible reasons through analysis: (1) the citation relationship of Chinese patents is non-compulsory disclosure, so the constructed patent network is sparser and harder to identify by the centrality metrics than the US patent network; (2) the number of expert-selected significant patents we used is more than those literature used, and the method of selecting those significant patents is also different. Since their dataset does not contain information on patent applicants, the model proposed in this paper cannot be used to construct a heterogeneous information network to compare the performance of the Chinese and the US patent datasets horizontally. In conclusion, when we analyze a problem, we not only need to consider the importance of the algorithm but also need to deeply analyze and explore the impact caused by the original data on the results.
All the abbreviations and variables in this paper are shown in below.
7. Conclusions
In this study, we based on the Chinese green patent dataset from 1985 to 2020, and construct a CNGP heterogeneous applicant-citation network for identifying expert-based significant patents earlier. We use the rescaled method to suppress the age bias in citation-based ranking metrics, and construct static and dynamic citation networks to more comprehensively analyze the impact of patent age. To analyze the reasons for the poor model performance, we deeply analyze the source data and find that there is a strong age distribution bias in the expert-selected significant patents, so we use the normalized method to penalize the age bias of the evaluation indicator and obtain a reasonable evaluation performance. The experimental results show that the R_ID metric has the best performance and identifies significant patents earlier. In addition, compared to the patent citation network, the heterogeneous information network constructed by combining patent applicants is beneficial to improve the performance of identifying significant patents. Therefore, the analysis method in this paper not only evaluates the patent quality reasonably but also identifies significant patents earlier, which provides scientists with new methods to measure the importance of patents.
There are three major directions for extending this research. (1) When building heterogeneous applicant-citation networks, the citation layer excludes those patents that have no citation relationship. These patents may be old patents or newly applied patents belonging to isolated nodes. Other literature generally denotes that those patents have not been cited, indicating that the quality of those patents is very low. The above operations may cause bias in network analysis. Therefore, when building the heterogeneous information network, these patents need to be taken into account for further research. (2) In the heterogeneous application-citation networks analysis, we transform each applicant-citation relationship into bidirectional links and treat them as simple citation links. We use the unweighted links method to analyze this network. However, the actual situation should be more complicated than that, we can design a set of weight distribution principles to calculate the weight of applicant-citation links and patent citation links. In addition, we can try to use other ranking metrics to analyze the network. (3) As we know,
WIPO divides green patents into seven categories, we should consider both the age bias and category bias of the analyzed ranking metrics as done by Vaccario et al. [
64]. Since the number of patents in different categories varies greatly, for example, the category of waste management accounts for 25%, and nuclear power generation only for 1%, so it is extremely significant to add the green patent category bias to the analysis.