On the Potential of Taxonomic Graphs to Improve Applicability and Performance for the Classification of Biomedical Patents
Abstract
:1. Introduction
2. Related Work
2.1. Official Patent Classification Systems
2.2. Automated Text Categorization using Patents
2.3. Ensemble Classification
2.4. Feature Extraction from Graphs
3. Materials and Methods
3.1. Patent Data
3.2. Feature Transformation
3.2.1. Textual Data
3.2.2. Taxonomic Data
3.2.3. Tree Creation
3.2.4. Vector Space Embedding
3.2.5. Prototype Selection Methods
3.3. Experimental Settings
3.3.1. Classifier Selection and Hyperparameter Tuning
3.3.2. Fusion Methods
3.3.3. Experimental Design
4. Results and Discussion
4.1. Basic Evaluation
4.2. Ensemble Evaluation
4.3. Boosting
4.4. Outlook
4.5. Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Classifier | Optimized Hyperparameters | CV (BPR) | Test (BPR) |
---|---|---|---|
Feature: Text | |||
ANN | hidden_layer_sizes = (200,12), alpha = 1, max_iter = 10,000, n-gram = (1,1), norm = none, smooth_idf = true, sublinear_tf = true, use_idf = true | 84.1% (2) | 62.8% (3) |
kNN | k = 40, n-gram = (1,1), norm = L2, smooth_idf = false, sublinear_tf = true, use_idf = true | 79.3% (4) | 70.2% (1) |
LogReg * | c = 10, solver = saga, multi-class = multinomial, max_iter = 10,000, n-gram = (1,1), norm = none, smooth_idf = false, sublinear_tf = true, use_idf = true | 84.6% (1) | 60.6% (4) |
SVM ** | nu = 0.6, kernel = RBF, n-gram = (1,1), norm = none, smooth_idf = false, sublinear_tf = true, use_idf = true | 82.9% (3) | 69.1% (2) |
Feature: IPC Taxonomy | |||
ANN | hidden_layer_sizes = (100,12), alpha = 1, max_iter = 10,000, n_prototype = 200, prototype_selection = classwise_random | 79.8% (1) | 66.0% (2) |
kNN * | k = 5, n_prototype = 100, prototype_selection = classwise_spanning | 69.3% (4) | 54.3% (4) |
LogReg ** | c = 1, solver = saga, multi-class = multinomial, max_iter = 10,000, n_prototype = 200, prototype_selection = classwise_random | 80.1% (1) | 68.1% (1) |
SVM | nu = 0.2, kernel = RBF, n_prototype = 200, prototype_selection = classwise_random | 77.1% (3) | 64.9% (3) |
Feature: CPC Taxonomy | |||
ANN ** | hidden_layer_sizes = (50,12), alpha = 1, max_iter = 10,000, n_prototype = 200, prototype_selection = classwise_spanning | 74.4% (2) | 68.1% (1) |
kNN * | k = 5, n_prototype = 100, prototype_selection = classwise_spanning | 60.4% (4) | 53.2% (4) |
LogReg | c = 1, solver = saga, multi-class = multinomial, max_iter = 10,000, n_prototype = 100, prototype_selection = classwise_random | 75.1% (1) | 62.8% (3) |
SVM | nu = 0.2, kernel = RBF, n_prototype = 100, prototype_selection = classwise_spanning | 72.0% (3) | 64.9% (2) |
FEATURE | TEXT | |||||
---|---|---|---|---|---|---|
BPRbottom | BPRtop | |||||
Base Classifier | LogReg | SVM | ||||
Fusion | F1 | Hyperparameter | F1 | Hyperparameter | ||
TAXONOMIES | ||||||
SVM | 0.846 | nu = 0.85 | 0.841 | nu = 0.9 | ||
IPC | kNN | kNN | 0.843 | k = 50 | 0.841 | k = 20 |
LogReg | 0.846 | c = 1.0 | 0.846 | c = 0.01 | ||
ANN | 0.847 | alpha = 1, hl = (100,12) | 0.843 | alpha = 10, hl = (10,-) | ||
BPRbottom | ||||||
SVM | 0.837 | nu = 0.85 | 0.831 | nu = 0.7 | ||
CPC | kNN | kNN | 0.843 | k = 25 | 0.830 | k = 50 |
LogReg | 0.838 | c = 0.01 | 0.828 | c = 0.1 | ||
ANN | 0.841 | alpha = 10, hl = (50,-) | 0.831 | alpha = 0.1, hl = (10,12) | ||
SVM | 0.859 | nu = 0.9 | 0.845 | nu = 0.25 | ||
IPC | LogReg | kNN | 0.859 | k = 50 | 0.854 | k = 30 |
LogReg | 0.864 | c = 0.001 | 0.853 | c = 0.01 | ||
ANN | 0.863 | alpha = 10, hl = (10,-) | 0.853 | alpha = 10, hl = (100,12) | ||
BPRtop | ||||||
SVM | 0.844 | nu = 0.55 | 0.852 | nu = 0.25 | ||
CPC | ANN | kNN | 0.846 | k = 15 | 0.843 | k = 25 |
LogReg | 0.849 | c = 0.1 | 0.842 | c = 1.0 | ||
ANN | 0.850 | alpha = 0.1, hl = (10,-) | 0.844 | alpha = 10, hl = (10,-) |
References
- Kreuchauff, F.; Korzinov, V. A patent search strategy based on machine learning for the emerging field of service robotics. Scientometrics 2017, 111, 743–772. [Google Scholar] [CrossRef] [Green Version]
- Jaffe, A.B.; De Rassenfosse, G. Patent citation data in social science research: Overview and best practices. J. Assoc. Inf. Sci. Technol. 2017, 68, 1360–1374. [Google Scholar] [CrossRef]
- Leydesdorff, L.; Kogler, D.F.; Yan, B. Mapping patent classifications: Portfolio and statistical analysis, and the comparison of strengths and weaknesses. Scientometrics 2017, 112, 1573–1591. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Schmoch, U. Concept of a Technology Classification for Country Comparisons–Final Report to the World Intellectual Property Organisation (WIPO). Available online: https://www.wipo.int/export/sites/www/ipstats/en/statistics/patents/pdf/wipo_ipc_technology.pdf (accessed on 12 February 2020).
- Wolpert, D.H. The Supervised Learning No-Free-Lunch Theorems; Springer: Berlin/Heidelberg, Germany, 2002; pp. 25–42. [Google Scholar]
- Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. 2002, 34, 1–47. [Google Scholar] [CrossRef]
- Liu, D.-R.; Shih, M.-J. Hybrid-patent classification based on patent-network analysis. J. Am. Soc. Inf. Sci. Technol. 2010, 62, 246–256. [Google Scholar] [CrossRef]
- Duin, R.P.W. The Combining Classifier: To Train or Not to Train? In 16th International Conference on Pattern Recognition (ICPR 2002), Proceedings of the 16th International Conference on Pattern Recognition, Quebec, QC, Canada, 11–15 August 2002; IEEE Imprint: Los Alamitos, CA, USA, 2002; pp. 765–770. ISBN 0-7695-1695-X. [Google Scholar]
- Riesen, K.; Bunke, H. Graph Classification Based on Vector Space Embedding. Int. J. Pattern Recognit. Artif. Intell. 2009, 23, 1053–1081. [Google Scholar] [CrossRef]
- Trappey, A.; Hsu, F.-C.; Trappey, C.V.; Lin, C.-I. Development of a patent document classification and search platform using a back-propagation network. Expert Syst. Appl. 2006, 31, 755–765. [Google Scholar] [CrossRef]
- Fall, C.J.; Törcsvári, A.; Benzineb, K.; Karetka, G. Automated categorization in the international patent classification. ACM SIGIR Forum 2003, 37, 10–25. [Google Scholar] [CrossRef]
- Anne, C.; Mishra, A.; Hoque, T.; Tu, S. Multiclass patent document classification. Artif. Intell. Res. 2017, 7, 1. [Google Scholar] [CrossRef] [Green Version]
- Zhang, X. Interactive patent classification based on multi-classifier fusion and active learning. Neurocomputing 2014, 127, 200–205. [Google Scholar] [CrossRef]
- Tran, T.; Kavuluru, R. Supervised Approaches to Assign Cooperative Patent Classification (CPC) Codes to Patents. In MIKE; Springer: Berlin/Heidelberg, Germany, 2017; Volume 10682, pp. 22–34. [Google Scholar]
- Lee, J.-S.; Hsiang, J. Patent classification by fine-tuning BERT language model. World Pat. Inf. 2020, 61, 101965. [Google Scholar] [CrossRef]
- Woźniak, M.; Graña, M.; Corchado, E. A survey of multiple classifier systems as hybrid systems. Inf. Fusion 2014, 16, 3–17. [Google Scholar] [CrossRef] [Green Version]
- Tulyakov, S.; Jaeger, S.; Govindaraju, V.; Doermann, D. Review of Classifier Combination Methods. In Machine Learning in Document Analysis and Recognition; Springer: Berlin/Heidelberg, Germany, 2008; pp. 361–386. [Google Scholar]
- Rokach, L. Ensemble-based classifiers. Artif. Intell. Rev. 2010, 33, 1–39. [Google Scholar] [CrossRef]
- Kuncheva, L.I. Combining Pattern Classifiers; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2014; ISBN 9781118914564. [Google Scholar]
- Kuncheva, L.I.; Whitaker, C.J. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach. Learn. 2003, 51, 181–207. [Google Scholar] [CrossRef]
- Kuncheva, L. A theoretical study on six classifier fusion strategies. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 281–286. [Google Scholar] [CrossRef] [Green Version]
- Džeroski, S.; Ženko, B. Is Combining Classifiers with Stacking Better than Selecting the Best One? Mach. Learn. 2004, 54, 255–273. [Google Scholar] [CrossRef] [Green Version]
- Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
- Li, X.; Wang, L.; Sung, E. AdaBoost with SVM-based component classifiers. Eng. Appl. Artif. Intell. 2008, 21, 785–795. [Google Scholar] [CrossRef] [Green Version]
- Sun, J.; Li, H.; Fujita, H.; Fu, B.; Ai, W. Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf. Fusion 2020, 54, 128–144. [Google Scholar] [CrossRef]
- Santos, S.G.T.D.C.; De Barros, R.S.M. Online AdaBoost-based methods for multiclass problems. Artif. Intell. Rev. 2020, 53, 1293–1322. [Google Scholar] [CrossRef]
- Foggia, P.; Percannella, G.; Vento, M. Graph Matching and Learning in Pattern Recognition in the Last 10 Years. Int. J. Pattern Recognit. Artif. Intell. 2014, 28, 1450001. [Google Scholar] [CrossRef]
- Wilson, R.C.; Hancock, E.R.; Luo, B. Pattern vectors from algebraic graph theory. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1112–1124. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Riesen, K.; Ferrer, M.; Fischer, A. Building Classifier Ensembles Using Greedy Graph Edit Distance. In Multiple Classifier Systems, Proceedings of the 12th International Workshop, MCS 2015, Günzburg, Germany, 29 June–1 July 2015; Schwenker, F., Roli, F., Kittler, J., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 125–134. ISBN 978-3-319-20247-1. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Bukowski, M.; Geisler, S.; Schmitz-Rode, T.; Farkas, R. Feasibility of activity-based expert profiling using text mining of scientific publications and patents. Scientometrics 2020, 123, 579–620. [Google Scholar] [CrossRef]
- European Patent Office. PATSTAT. Available online: https://www.epo.org/searching-for-patents/business/patstat.html (accessed on 12 February 2020).
- Bille, P. A survey on tree edit distance and related problems. Theor. Comput. Sci. 2005, 337, 217–239. [Google Scholar] [CrossRef] [Green Version]
- European Patent Office and United States Patent and Trademark Office. CPC Scheme and Definitions. Available online: https://www.cooperativepatentclassification.org/cpcSchemeAndDefinitions (accessed on 12 February 2020).
- Zeng, Z.; Tung, A.K.H.; Wang, J.; Feng, J.; Zhou, L. Comparing stars: On Approximating Graph Edit Distance. Proc. VLDB Endow. 2009, 2, 25–36. [Google Scholar] [CrossRef] [Green Version]
- Zhang, K.; Shasha, D. Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM J. Comput. 1989, 18, 1245–1262. [Google Scholar] [CrossRef]
- Lee, J.; Kang, J.-H.; Jun, S.; Lim, H.; Jang, D.-S.; Park, S. Ensemble Modeling for Sustainable Technology Transfer. Sustainability 2018, 10, 2278. [Google Scholar] [CrossRef] [Green Version]
- Wickramaratna, J.; Holden, S.; Buxton, B. Performance Degradation in Boosting. In Multiple Classifier Systems, Proceedings of the Second International Workshop, MCS 2001, Cambridge, UK, 2–4 July 2001; Kittler, J., Ed.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 11–21. ISBN 978-3-540-42284-6. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Under-standing. 2018. Available online: http://arxiv.org/pdf/1810.04805v2 (accessed on 7 January 2021).
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representa-tions. 2018. Available online: http://arxiv.org/pdf/1802.05365v2 (accessed on 7 January 2021).
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2019, 36, 1234–1240. [Google Scholar] [CrossRef]
Class Name | # of Patents | # of CPC Codes | # of IPC Codes | |||
---|---|---|---|---|---|---|
Training | Testing | Training | Testing | Training | Testing | |
Imaging | 205 | 30 | 1380 | 129 | 838 | 84 |
Implants and prostheses | 205 | 15 | 1659 | 78 | 732 | 33 |
Telemedicine | 205 | 2 | 1064 | 5 | 827 | 6 |
Surgical intervention | 205 | 23 | 1351 | 98 | 805 | 59 |
In-vitro diagnostics | 205 | 6 | 922 | 18 | 1175 | 18 |
Special therapy systems | 205 | 18 | 1190 | 67 | 1024 | 50 |
Total | 1230 | 94 | 7566 | 395 | 5401 | 250 |
Level | Symbol | Classification and Description |
---|---|---|
Section | A | Human necessities |
Class | A61 | Medical or veterinary science; hygiene |
Subclass | A61B | diagnosis; surgery; identification |
Main Group | A61B 5/00 | Detecting, measuring or recording for diagnostic purposes; identification of persons |
Subgroup | (A61B 5/00 48) | • Detecting, measuring or recording by applying mechanical forces or stimuli |
Subgroup | A61B 5/00 55 | • • by applying suction |
Classifier | Hyperparameter | |
---|---|---|
Feature: Text | ||
SVM | ||
kNN | ||
LogReg | ||
ANN | ||
Feature: Taxonomy | ||
SVM | ||
kNN | ||
LogReg | ||
ANN | ||
Factor | Levels |
---|---|
Feature Source | Text; Taxonomy (IPC or CPC) |
Base Performance Ranking | Top; Bottom |
Fusion Method | Stacking (SVM, kNN, LogReg, ANN); Fixed Combination Rules (sum, product, min, max) |
FEATURE | TEXT | TEXT | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BPRbottom | BPRtop | BPRbottom | BPRtop | |||||||||||
Base Classifier | LogReg | SVM | LogReg | SVM | ||||||||||
F1 | 0.606 | 0.691 | 0.606 | 0.691 | ||||||||||
Fusion method | Stacking | Fixed Combination | ||||||||||||
IPC | Fusion | Fusion | ||||||||||||
BPRbottom | 0.691 | SVM | 0.702 | 0.681 | sum | 0.702 | ||||||||
kNN | 0.543 | 0.670 | kNN | 0.681 | 0.660 | product | 0.713 | |||||||
0.660 | LogReg | 0.702 | 0.649 | min | 0.681 | |||||||||
0.691 | ANN | 0.745 | 0.681 | max | 0.702 | |||||||||
BPRtop | 0.713 | SVM | 0.755 | 0.734 | sum | 0.745 | ||||||||
LogReg | 0.681 | 0.745 | kNN | 0.745 | 0.734 | product | 0.745 | |||||||
0.755 | LogReg | 0.755 | 0.702 | min | 0.713 | |||||||||
0.766 | ANN | 0.787 | 0.723 | max | 0.723 | |||||||||
CPC | ||||||||||||||
BPRbottom | 0.638 | SVM | 0.723 | 0.681 | sum | 0.702 | ||||||||
kNN | 0.532 | 0.691 | kNN | 0.713 | 0.660 | product | 0.681 | |||||||
0.670 | LogReg | 0.745 | 0.660 | min | 0.681 | |||||||||
0.670 | ANN | 0.745 | 0.638 | max | 0.691 | |||||||||
BPRtop | 0.681 | SVM | 0.766 | 0.766 | sum | 0.777 | ||||||||
ANN | 0.681 | 0.766 | kNN | 0.745 | 0.734 | product | 0.777 | |||||||
0.745 | LogReg | 0.777 | 0.713 | min | 0.745 | |||||||||
0.702 | ANN | 0.787 | 0.734 | max | 0.755 |
FEATURE | TEXT | TEXT | |||
---|---|---|---|---|---|
BPRbottom | BPRtop | BPRbottom | BPRtop | ||
Fusion method | Stacking | Fixed Combination | |||
avg-F1 | |||||
IPC | BPRbottom | 0.678 | 0.708 | 0,668 | 0.700 |
BPRtop | 0.745 | 0.761 | 0.723 | 0.732 | |
CPC | BPRbottom | 0.667 | 0.732 | 0.660 | 0.689 |
BPRtop | 0.724 | 0.769 | 0.737 | 0.764 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Frerich, K.; Bukowski, M.; Geisler, S.; Farkas, R. On the Potential of Taxonomic Graphs to Improve Applicability and Performance for the Classification of Biomedical Patents. Appl. Sci. 2021, 11, 690. https://doi.org/10.3390/app11020690
Frerich K, Bukowski M, Geisler S, Farkas R. On the Potential of Taxonomic Graphs to Improve Applicability and Performance for the Classification of Biomedical Patents. Applied Sciences. 2021; 11(2):690. https://doi.org/10.3390/app11020690
Chicago/Turabian StyleFrerich, Kai, Mark Bukowski, Sandra Geisler, and Robert Farkas. 2021. "On the Potential of Taxonomic Graphs to Improve Applicability and Performance for the Classification of Biomedical Patents" Applied Sciences 11, no. 2: 690. https://doi.org/10.3390/app11020690
APA StyleFrerich, K., Bukowski, M., Geisler, S., & Farkas, R. (2021). On the Potential of Taxonomic Graphs to Improve Applicability and Performance for the Classification of Biomedical Patents. Applied Sciences, 11(2), 690. https://doi.org/10.3390/app11020690