Normalization of Web of Science Institution Names Based on Deep Learning
Abstract
:1. Introduction
- Feature Extraction and Fusion: Institution data features can be categorized into two main types: text features and semantic relationship features. Text features primarily measure the literal similarity of institution names, which are effective in identifying names that are similar in their literal form. However, they may perform poorly in handling institution aliases and abbreviations. On the other hand, semantic relationship features focus on analyzing co-occurrence relationships and hierarchical similarity between institutions, which can better identify aliases and abbreviations. However, they may sometimes incorrectly merge structurally similar but distinct institutions. The current research often employs techniques such as term frequency, TF-IDF, string similarity, and the longest common substring to extract text features and deep learning models such as Word2Vec to extract semantic features. These features are then combined through rules or weighted fusion. This approach separates the association between text and semantic features and introduces uncertainty and subjectivity in feature combination and fusion weight allocation.
- Utilizing Multiple Contextual Information of Institution Entities: The current research often relies on single-context matching, where only the most similar context containing the institution entity is considered during the institution matching process. This approach fails to fully leverage the multiple contextual information that an institution may appear in, thereby limiting the recognition accuracy.
- Addressing the deficiencies in feature extraction and fusion for institution name standardization: This paper proposes the construction of an embedding layer that extracts and fuses two types of features. There may exist correlations and dependencies between different feature categories. By extracting features from different categories within a unified model, the model can share learned knowledge and representations, thereby improving generalization and effectiveness.
- Solving the issue of underutilizing multiple contextual information of institution entities: This paper introduces a method based on bidirectional matching and multi-context fusion. This approach effectively leverages the multiple contexts in which institution entities may appear. By considering and integrating information from different contexts, the model achieves a more comprehensive understanding of institution entities, leading to improved accuracy in recognition.
2. Related Works
- String similarity-based methods: Common algorithms, such as the edit distance, Jaccard coefficient, and TF-IDF, are used to measure the similarity between institution names. The edit distance represents the minimum number of edit operations (insertion, deletion, or substitution) required to transform one string into another. French [9] proposed the relative edit distance, which uses the edit distance divided by the minimum length of the two institution names to measure the similarity. To address syntactic variations in institution names, French also introduced the word-based edit distance, which splits institution names into words and calculates the edit distance based on approximate word matching.
- Statistical-based methods: These methods leverage the statistical characteristics of institution name occurrences, such as word frequency, co-occurrence relationships, and contextual features, to differentiate between different institutions. Onodera [22] assigned different weights to words based on their frequency and measured the similarity between two institution names by summing the weights of matching words. Jiang [16] proposed a clustering method using the Normalized Compression Distance (NCD) to match institution documents. The NCD utilizes data compression techniques to measure the similarity between two texts, assuming that if two texts are semantically similar, their compressed representations should exhibit high redundancy and similarity. Cuxac [23] addressed naming ambiguities, spelling errors, OCR errors, abbreviations, and omissions by employing two strategies: one utilizing a Naive Bayes model when training data are available and the other employing a semi-supervised approach combining soft clustering and Bayesian learning when no learning resources are present.
- Rule-based methods: These methods involve constructing rule libraries based on features derived from institution names (e.g., string similarity, substrings, word length, word order, and institution type) and additional features from the literature data (e.g., country, city, postal code, and author names) to merge institution name matches using feature-based rules. Huang [5] proposed a rule-based and edit distance-based approach for institution name standardization. They first constructed an institution–author table and used the author, country, postal code, and other features for potential institution name matching. Then, they calculated similarity by combining the Jaccard word similarity, substring matching, and the edit distance to identify institution name variants. Researchers from Bielefeld University developed over 50,000 pattern matching rules utilizing features such as institution the name, start and end dates, URL, postal code, sectors (name, URL, and sub-classification), and relationships between institutions to disambiguate the author addresses in WOS and Scopus.
- Entity linking-based methods: These methods resolve ambiguity by linking institution names in the literature to corresponding institutions in knowledge bases. Shao [20] proposed the ELAD framework, which utilizes knowledge graphs for entity linking, generating a candidate set of institution entities, and then selecting the most probable institution entity based on string similarity. Wang [19] introduced a framework that utilizes open data resources to assist institution name standardization and attribute enrichment. It involves normalizing institution names and enriching attributes using open data resources, constructing a data linking model for multidimensional attribute alignment, and proposing a dynamic management approach for open data.
- Deep learning-based methods: These methods utilize word embedding models to obtain distributed vectors containing rich semantic information from raw data. These vectors are then used in subsequent deep learning models or for vector similarity comparison. Sun [24] applied the Word2Vec word embedding model to semantically learn the SCI address field and disambiguate institution names based on the similarity of institution word vectors. Chen et al. [21] utilized the GloVe model to learn institution vector representations and applied DBSCAN clustering to institution names based on vector similarity and matching rules.
3. Institutional Synonym Recognition Model
3.1. Overview of the Proposed Model
3.2. Address Retriever
3.3. Multi-Granularity Feature Embedding Layer
3.4. The Multi-Context Fusion Layer Based on Bidirectional Matching
3.5. Training Objectives
4. Experiments
4.1. Evaluation Metrics
4.2. Datasets
4.3. Baselines
- Huang’s Method [5]: This method is considered representative in rule-based institution synonym recognition due to its emphasis on knowledge and rule completeness and generality. In the following sections, we refer to this method as “Huang’s method” for simplicity.
- Word2vec [28]: This method is commonly used in deep learning-based institution synonym recognition and serves as a baseline model in our comparison.
- SRN [29]: SRN is a character-level model that encodes entities as a sequence of characters using BiLSTM. The hidden states are averaged to obtain an entity representation, and cosine similarity is used in the training objective.
- MaLSTM [30]: MaLSTM is a word-level model that takes word sequences as input. Unlike SRN, which uses BiLSTM, MaLSTM employs unidirectional LSTM and utilizes the Euclidean norm to measure the distance between two entities.
4.4. Results
4.4.1. Results Analysis
- Dependency on a single context: These models rely solely on single-context information, which makes them susceptible to absorbing excessive noise during the learning process and limits their ability to fully utilize additional information provided by other relevant contexts. This approach struggles to effectively differentiate between complex scenarios with multiple similar institution names.
- Emphasis on sentence encoding: These models tend to use the encoding of the entire sentence as the final embedding output, without specifically highlighting the importance of the institution entity itself. For institution synonym recognition, the focus should be on the specific encoding of the institution entity rather than generic information from the entire sentence.
4.4.2. Error Analysis
4.4.3. Hyperparameters
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zeini, N.T.; Okasha, A.E.; Soliman, A.S. A review on social segregation research: Insights from bibliometric analysis. Kybernetes 2023, 1, 1–8. [Google Scholar] [CrossRef]
- Kaur, A.; Bhatia, M. Scientometric Analysis of Smart Learning. IEEE Trans. Eng. Manag. 2021, 71, 400–413. [Google Scholar] [CrossRef]
- Dodevska, Z. An Expanded Bibliometric Study of Articles on Emerging Markets. Management 2022, 29, 11–20. [Google Scholar] [CrossRef]
- Laura Cervi, N.S.; Calvo, S.T. Analysis of Journalism and Communication Studies in Europe’s Top Ranked Universities: Competencies, Aims and Courses. J. Pract. 2021, 15, 1033–1053. [Google Scholar] [CrossRef]
- Huang, S.; Yang, B.; Yan, S.; Rousseau, R. Institution name disambiguation for research assessment. Scientometrics 2014, 99, 823–838. [Google Scholar] [CrossRef]
- Zhang, S.; Xinhua, E.; Pan, T. A multi-level author name disambiguation algorithm. IEEE Access 2019, 7, 104250–104257. [Google Scholar] [CrossRef]
- Ding, X.; Zhang, H.; Guo, X. An unsupervised framework for author-paper linking in bibliographic retrieval system. In Proceedings of the 2018 14th International Conference on Semantics, Knowledge and Grids (SKG), Guangzhou, China, 12–14 September 2018; pp. 152–159. [Google Scholar]
- Falahati Qadimi Fumani, M.R.; Goltaji, M.; Parto, P. Inconsistent transliteration of Iranian university names: A hazard to Iran’s ranking in ISI Web of Science. Scientometrics 2013, 95, 371–384. [Google Scholar] [CrossRef]
- French, J.C.; Powell, A.L.; Schulman, E. Using clustering strategies for creating authority files. J. Am. Soc. Inf. Sci. 2000, 51, 774–786. [Google Scholar] [CrossRef]
- French, J.C.; Powell, A.L.; Schulman, E.; Pfaltz, J.L. Automating the construction of authority files in digital libraries: A case study. In Proceedings of the Research and Advanced Technology for Digital Libraries: First European Conference, ECDL’97, Pisa, Italy, 1–3 September 1997; 1997 Proceedings 1. Springer: Berlin/Heidelberg, Germany, 1997; pp. 55–71. [Google Scholar]
- Jonnalagadda, S.; Topham, P. NEMO: Extraction and normalization of organization names from PubMed affiliation strings. J. Biomed. Discov. Collab. 2010, 5, 50. [Google Scholar] [CrossRef]
- Backes, T.; Hienert, D.; Dietze, S. Towards hierarchical affiliation resolution: Framework, baselines, dataset. Int. J. Digit. Libr. 2022, 23, 267–288. [Google Scholar] [CrossRef]
- Backes, T.; Dietze, S. Connected Components for Scaling Partial-order Blocking to Billion Entities. ACM J. Data Inf. Qual. 2024, 16, 1–29. [Google Scholar] [CrossRef]
- Jacob, F.; Javed, F.; Zhao, M.; Mcnair, M. sCooL: A system for academic institution name normalization. In Proceedings of the 2014 International Conference on Collaboration Technologies and Systems (CTS), Minneapolis, MN, USA, 19–23 May 2014; pp. 86–93. [Google Scholar]
- Kronman, U.; Gunnarsson, M.; Karlsson, S. The Bibliometric Database at the Swedish Research Council-Contents, Methods and Indicators; Swedish Research Council: Stockholm, Sweden, 2010. [Google Scholar]
- Jiang, Y.; Zheng, H.T.; Wang, X.; Lu, B.; Wu, K. Affiliation disambiguation for constructing semantic digital libraries. J. Am. Soc. Inf. Sci. Technol. 2011, 62, 1029–1041. [Google Scholar] [CrossRef]
- Abramo, G.; Cicero, T.; D’Angelo, C.A. A field-standardized application of DEA to national-scale research assessment of universities. J. Inf. 2011, 5, 618–628. [Google Scholar] [CrossRef]
- Huang, Y.; Li, J.; Sun, T.; Xian, G. Institution information specification and correlation based on institutional PIDs and IND tool. Scientometrics 2020, 122, 381–396. [Google Scholar] [CrossRef]
- Wang, L.; Hu, J.; Wang, Q.; Yang, Y.; Lou, P.; Fang, A. Big Open Data Aided Institutions’ Name Normalization and Attribute Enrichment. In Proceedings of the 2022 3rd Information Communication Technologies Conference (ICTC), Nanjing, China, 6–8 May 2022; pp. 173–177. [Google Scholar]
- Shao, Z.; Cao, X.; Yuan, S.; Wang, Y. ELAD: An entity linking based affiliation disambiguation framework. IEEE Access 2020, 8, 70519–70526. [Google Scholar] [CrossRef]
- Chen, Y.; Li, X.; Li, A.; Li, Y.; Yang, X.; Lin, Z.; Yu, S.; Tang, X. A Deep Learning Model for the Normalization of Institution Names by Multisource Literature Feature Fusion: Algorithm Development Study. JMIR Form. Res. 2023, 7, e47434. [Google Scholar] [CrossRef] [PubMed]
- Onodera, N.; Iwasawa, M.; Midorikawa, N.; Yoshikane, F.; Amano, K.; Ootani, Y.; Kodama, T.; Kiyama, Y.; Tsunoda, H.; Yamazaki, S. A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. J. Am. Soc. Inf. Sci. Technol. 2011, 62, 677–690. [Google Scholar] [CrossRef]
- Cuxac, P.; Lamirel, J.C.; Bonvallot, V. Efficient supervised and semi-supervised approaches for affiliations disambiguation. Scientometrics 2013, 97, 47–58. [Google Scholar] [CrossRef]
- Sun, Y. Research on SCI Address Field Data Cleaning Method Based on Word2Vec. J. Intell. 2019, 38, 195–200. [Google Scholar]
- Kim, Y.; Jernite, Y.; Sontag, D.; Rush, A. Character-aware neural language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Zhang, C.; Li, Y.; Du, N.; Fan, W.; Yu, P.S. Entity synonym discovery via multipiece bilateral context matching. arXiv 2018, arXiv:1901.00056. [Google Scholar]
- Zhang, J.; Cao, Y.; Hou, L.; Li, J.Z.; Zheng, H. XLink: An Unsupervised Bilingual Entity Linking System. In Proceedings of the China National Conference on Chinese Computational Linguistics, Nanjing, China, 13–15 October 2017. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Neculoiu, P.; Versteegh, M.; Rotaru, M. Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, 11 August 2016; pp. 148–157. [Google Scholar]
- Mueller, J.; Thyagarajan, A. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI conference on artificial intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Official Name | Variant Name | Description |
---|---|---|
Soochow Univ | Suzhou Univ | alias |
Soochow Univers | word abbreviation | |
SUDA | acronym | |
SoChow Univ | spelling errors | |
SooChow | missing agency identifiers | |
SooChow Univ | uppercase/lowercase variation | |
University Soochow | syntactic arrangements | |
Soochow Univ affiliated Hosp | nested entities |
Method | Advantages | Disadvantages |
---|---|---|
String Similarity-based | Simple and easy to implement; effective in handling spelling errors or minor variations. | Less effective in handling semantically similar but structurally different names or high literal similarity of distinct institution names. |
Statistical-based | Can adapt to various fields and languages of bibliographic data because they rely on contextual information and statistical features. | Depend on the selection and construction of statistical features, requiring substantial data support, and may be influenced by data quality. |
Rule-based | Simple and intuitive; effective in handling obvious cases. | Rule formulation can be complex and require human involvement; less effective in handling complex cases. |
Entity Linking-based | Can leverage rich information in knowledge bases; effective in handling complex cases; enables automated construction of institution standard files. | Require high-quality knowledge base support; less effective in handling new institutions not present in the knowledge base. |
Deep Learning-based | Can automatically learn and extract features and perform feature combinations; effective in handling complex cases. | Require a significant amount of annotated data; training and fine-tuning the models can be complex. |
Count | Anchor Sample | Positive Sample | Negative Sample |
---|---|---|---|
Entity | 1400 | 1400 | 1400 |
Context | 7,586,714 | 861,632 | 3,577,596 |
Vocab | 217,436 | 58,129 | 108,021 |
Methods | Precision | Recall | F1 |
---|---|---|---|
Huang’s method | 69.53 | 77.20 | 73.17 |
Word2vec | 56.34 | 92.20 | 69.95 |
SRN | 46.07 | 60.29 | 52.23 |
MaLSTM | 57.72 | 52.20 | 54.82 |
Ours | 77.55 | 88.37 | 82.61 |
No Highway | 73.88 | 89.92 | 81.12 |
No Char-CNN | 74.65 | 82.17 | 78.23 |
No Word2vec | 55.67 | 88.60 | 68.07 |
No bidirectional matching | 64.60 | 80.62 | 71.72 |
Error Type | Reason | Improvement Direction |
---|---|---|
Different institutions that are similar both literally and semantically. | The two institutions Universtil Ben Turin and Universtil Ben Turku have similar names and have a large number of the same sub-institutions, such as Compters, Czes, Depatment, and Deputklinxczens. | Introduce better features, such as author name features and geographic attribute features. |
Ambiguous institution aliases in authority files. | The alias of “Perbright Institute” is “Institute Animal Health”, which is the same as the aliases of several secondary institutions, resulting in a mismatch of models. | Construct hierarchical relationships and identify synonymous institutions from top to bottom to avoid identical or similar aliases. |
The choice of institutional context is not reasonable. | Contextual selection methods are not perfect enough, which affects model matching. | Further refine the contextual selection method. |
Hyperparameter | VALUE |
---|---|
epoch | 10 |
lstm_size | 512 |
word_size | 200 |
context_number | |
context length | |
filter_size | |
filter_amount | |
m (margin) | |
optimizer | |
loss function | Siamese loss |
batch size | 2 |
learning rate | , , , |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jia, Z.; Fang, Z.; Zhang, H. Normalization of Web of Science Institution Names Based on Deep Learning. Algorithms 2024, 17, 312. https://doi.org/10.3390/a17070312
Jia Z, Fang Z, Zhang H. Normalization of Web of Science Institution Names Based on Deep Learning. Algorithms. 2024; 17(7):312. https://doi.org/10.3390/a17070312
Chicago/Turabian StyleJia, Zijie, Zhijian Fang, and Huaxiong Zhang. 2024. "Normalization of Web of Science Institution Names Based on Deep Learning" Algorithms 17, no. 7: 312. https://doi.org/10.3390/a17070312
APA StyleJia, Z., Fang, Z., & Zhang, H. (2024). Normalization of Web of Science Institution Names Based on Deep Learning. Algorithms, 17(7), 312. https://doi.org/10.3390/a17070312