Research on the Automatic Subject-Indexing Method of Academic Papers Based on Climate Change Domain Ontology

Yang, Heng; Wang, Nan; Yang, Lina; Liu, Wei; Wang, Sili

doi:10.3390/su15053919

Open AccessArticle

Research on the Automatic Subject-Indexing Method of Academic Papers Based on Climate Change Domain Ontology

by

Heng Yang

,

Nan Wang

,

Lina Yang

,

Wei Liu

and

Sili Wang

^*

Chinese Academy of Sciences, Northwest Institute of Eco-Environment and Resources, Lanzhou 730000, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(5), 3919; https://doi.org/10.3390/su15053919

Submission received: 29 December 2022 / Revised: 3 February 2023 / Accepted: 20 February 2023 / Published: 21 February 2023

Download

Browse Figures

Versions Notes

Abstract

:

It is important to classify academic papers in a fine-grained manner to uncover deeper implicit themes and semantics in papers for better semantic retrieval, paper recommendation, research trend prediction, topic analysis, and a series of other functions. Based on the ontology of the climate change domain, this study used an unsupervised approach to combine two methods, syntactic structure and semantic modeling, to build a framework of subject-indexing techniques for academic papers in the climate change domain. The framework automatically indexes a set of conceptual terms as research topics from the domain ontology by inputting the titles, abstracts and keywords of the papers using natural language processing techniques such as syntactic dependencies, text similarity calculation, pre-trained language models, semantic similarity calculation, and weighting factors such as word frequency statistics and graph path calculation. Finally, we evaluated the proposed method using the gold standard of manually annotated articles and demonstrated significant improvements over the other five alternative methods in terms of precision, recall and F1-score. Overall, the method proposed in this study is able to identify the research topics of academic papers more accurately, and also provides useful references for the application of domain ontologies and unsupervised data annotation.

Keywords:

ontology; automatic subject indexing; climate change; semantic; deep mining

1. Introduction

In recent years, with the rapid growth in the number of academic studies and the continuous refinement of scientific research fields, higher requirements have been proposed for the accuracy of automatic subject indexing of scientific and technological literature [1,2]. Automatic indexing of literature topics is an effective means of organizing digital resources. Indexing quality directly affects the quality and utilization of digital resources. This is also a key problem to be solved in knowledge-based services. It has important research significance and comparatively high practical value [3,4].

Fine-grained indexing of academic resources by means of research themes offers numerous benefits, including the enrichment of academic resource metadata, increased accuracy of semantic retrieval, and more precise content recommendations. For the general public, fine-grained theme indexing allows for a quicker comprehension of a paper’s content and a clearer understanding of its key information, facilitating its evaluation and comprehension. For researchers, this type of indexing facilitates the speedy extraction of crucial information from a paper, improves the evaluation of its quality, and enhances the efficiency of research work. It also helps researchers easily reference relevant information in the course of their research, thereby improving the readability and credibility of the research paper. Most scholars use the topic detection method. Blei et al. (2003) developed the latent Dirichlet distribution algorithm, which can identify topics hidden in text from a large amount of unstructured text [5]. LDA topic modeling technology has been widely used in research topic extraction, research trend analysis, text classification and other tasks [6,7,8]. However, LDA tends to produce noisier and difficult-to-interpret results, and its parameter specification setting is also a significant problem. Other state-of-the-art approaches either classify papers in a top-down fashion, taking advantage of pre-existent categories from domain vocabularies, such as Medical Subject Headings (MeSH), and Library of Congress Subject Headings (LCSH) [9,10]. The advantage of this method is that it relies on a set of artificially predefined research subject vocabularies and then uses trained machine learning classifiers to transform the labels of the nearest neighbors into predicted terms. Note that the machine learning approach relies on a large amount of labeled data to predict new abstracts.

To address the limitations, this study proposes a method for fine-grained topic classification of academic papers combined with a climate change ontology. Ontology is an ideal choice for network information organization and retrieval. The purpose of information organization is to facilitate information retrieval and utilization. The introduction of ontology can promote the transformation from information organization to knowledge organization. Information organization can better provide people with information retrieval and utilization services. “A Handbook of Climate Change Domain Ontology” is a large-scale and fine-grained research domain ontology which systematically organizes the knowledge system and domain terminology in the climate change domain, including 2028 research topics and 2386 semantic relationships after our supplement [11]. On this basis, we processed it electronically to form a climate change ontology (CCO) that can be easily annotated and utilized.

In this paper, subject indexing of academic papers in the field of climate change using CCO is a new approach. Since the CCO is not yet routinely used by researchers, it is not possible to adopt supervised machine learning algorithms that would require a good number of examples for all the relevant categories. For this reason, we focus instead on unsupervised solutions.

The present study takes the title, abstract and keywords of CCO and academic papers as input, and returns the concept words extracted from the CCO as the research topic of the paper through three steps of word similarity calculation, semantic similarity calculation and ranking model calculation. Finally, we evaluate the results produced by our algorithm on human-annotated research papers and demonstrate a significant improvement over alternative methods.

The major contributions of this paper are as follows:

Create an ontology in the field of climate change. We manually digitized and formatted “A Handbook of Climate Change Domain Ontology”, and supplemented its different levels according to emerging research directions, climate change assessment reports, and online public datasets in recent years.
Construct a dataset of academic papers in the field of climate change. We collected 306,211 pieces of data from WOS, and formed a standard climate change field database after data cleaning, filtering and sorting.
A set of algorithmic process frameworks is proposed, which uses the CCO to label academic papers with fine-grained topics in the climate change dataset, and the results are verified.

2. Literature Review

2.1. Automatic Subject Indexing

Topic modeling is a type of statistical approach for discovering topics that occur in a collection of documents. One of the most acclaimed approaches is the latent Dirichlet analysis developed by Blei et al. (2003) [5]. This technique can be used as an unsupervised model for extracting underlying or hidden topics from large amounts of unstructured text. Since its introduction, LDA topic modeling techniques have been widely used in many fields such as software engineering, political science, medicine, and language science [12]. Jacobi et al. (2016) propose the use of LDA topic modeling for the quantitative analysis of a large amount of journalistic texts [13]. Fang et al. (2018) use LDA models to identify research themes and trend evolution from a large number of research abstracts [6]. Qiang et al. (2022) proposed a short textual topic modeling technique to analyze short texts using a Dirichlet polynomial mixture-based approach [14]. Other approaches that fall within the topic modeling category are the latent semantic analysis (LSA) and the probabilistic latent semantic analysis (pLSA) [15,16]. The advantage of these methods is that no training data is required using an unsupervised approach. However, the resulting topics usually require validation by domain experts and contain a large amount of noise. As a result, the number of topics is small and the classification is not fine-grained enough.

The second category of approaches aims to develop a multi-classification model in which each class deals with a research topic. Mai et al. (2018) propose a method for topic classification based on deep learning techniques and apply it to large-scale datasets in the fields of medicine (PubMed) and economics (EconBiz) [9]. Wartena and Franke-Maier (2018) studied the possibility of assigning LCSH automatically by training classifiers for terms used frequently in a large collection of abstracts of the literature on hand and by extracting headings from those abstracts by combining both methods [17]. Kazi et al. (2021) use web-scraping techniques to extract keywords from article sets in the Literature Analysis and Metrics Portal, and map these keywords to biologically relevant LCSH names to develop a gold standard dataset that demonstrates the feasibility of this approach for predicting LCSH of scholarly articles [10].

Another approach to literature classification uses citation networks, most of which are based on the principle of clustering scientific literature through co-citation analysis. Boyack and Klavans (2014) used co-citation technology to create an article-level scientific model and map to address planning-related problems such as the identification of emerging topics and determining which areas of science and technology are innovative and which are simply persisting [18]. Shiau et al. (2017) used the methods of co-citation analysis and cluster analysis to analyze the literature on social network-related topics published between January 1996 and December 2014, and verified seven factors about social networks [19]. Hou et al. (2018) analyzed emerging trends and new developments that emerged from 7574 articles published in 10 IS journals between 2009 and 2016 using a document co-citation approach [20]. The main drawback of the citation-based approach is that only one topic can be assigned to each document, and the documents are rarely monolithic.

2.2. Climate Change Ontology

Domain ontology can establish the knowledge structure of a specific field, make information knowledgeable and practical, and finally provide accurate and reliable knowledge and information for researchers, decision makers and general audiences. Climate change is a complex phenomenon, and it is impossible to fully define it using a standard or a single ontology, which needs to be formulated using multiple methods from different disciplines [21]. Over the years, researchers have developed ontologies covering different aspects of climate change. Chang et al. (2005) proposed a climate change seed ontology that uses data mining techniques to extend and refine the ontology system semi-automatically [22]. Kontopoulos et al. (2018) provide a more comprehensive lightweight ontology that focuses on climate crisis management and provides decision support [23]. Pileggi and Lamia (2020) developed a knowledge base of climate-change-related facts organized chronologically using their CCTL ontology [24]. Fonou et al. (2021) built a knowledge-based ontology for the Climate Smart Agriculture domain with a logic-based ontology representation [25].

In addition, Brugger and Crimmins (2013) use ontology to model the relationship between climate change adaptation and everyday life, particularly in a rural setting in the southwestern United States [26]. Kontopoulos et al. (2016) developed an ontology-based decision support tool to facilitate the use of domestic solar water-heating systems in buildings [27]. Bonacin et al. (2016) used ontologies to study the impact of agriculture and climate change on water resources [28]. However, ontologies related to climate change are mostly used in daily life and industry applications, and how to carry out subject indexing on climate change literature is seldom studied.

2.3. The Application of Automatic Subject Indexing in Climate Change Research

In practical applications, topic indexing is similar to topic extraction, text classification and other tasks, and it is to label an article with words or phrases that can express its related research topics. Dahal et al. (2019) used data analysis and text-mining techniques, such as topic modeling and sentiment analysis, on a large dataset of geotagged tweets containing certain keywords related to climate change to compare and contrast the nature of climate change discussions between countries and over time [29]. Li et al. (2020) analyzed bibliometrics, co-word biclustering, and strategy diagrams based on two decades of PubMed data to assess global scientific production, hot spots, and trends regarding climate change and infectious diseases [30]. Zeng (2022) studied the Chinese public’s discussion about climate change on the social media platform Weibo during the last six years through data mining and text analysis [31]. The analyses include volume analysis, keyword extraction, topic modeling, and sentiment analysis. In addition, Coro et al. (2016) used the AquaMaps model to automatically classify the effects of climate change on the distribution of marine species in 2050 [32]. Piaser and Villa (2022) compared machine learning techniques for aquatic vegetation classification using Sentinel-2 data [33].

3. Climate Change Ontology

In this paper, we use CCO for experiments and validation due to the accumulated data and modeling methods in the climate change domain. The ontology from “A Handbook of Climate Change Domain Ontology” is authoritative and provides a sufficient number of topics and hierarchical structure for fine-grained topic representation of academic papers. Future work will involve experiments with data from other domains to test the transferability of our findings.

3.1. Source of Ontology

This study uses the climate change domain ontology from the second edition of “A Handbook of Climate Change Domain Ontology”, edited by Wang Qinglin and Zhang Jiutian and published by the Beijing Institute of Technology Press. The book is based on the IPCC research report and other authoritative sources, and uses open data from the internet, such as encyclopedias, websites, news, and scientific papers, as the corpus for term and relationship analysis [11]. The handbook contains 1907 climate change-related terms, including superordinate, subordinate, antonym, and correlative terms, organized in a tree structure. The root node is “climate change,” which is divided into five categories: the physical science basis; impact, adaptation and vulnerability; mitigation of climate change; negotiations on climate change; capacity building.

To facilitate data transfer and computation in later experiments, we processed the book to create two versions of formatted CCO data in both Chinese and English, comparing the Chinese and English domain knowledge structures. We used the ontology editing and knowledge acquisition software Protégé 5.2.0 for storage and visualization of the ontology [34]. Protégé uses the OWL language for semantic description, which is user-friendly and supports efficient conceptual reasoning. The OWL format is also easily convertible to other formats [35,36]. Figure 1 shows a partial structure of the CCO stored in Protégé.

3.2. Ontology Extension

Since the publication of the second edition of “A Handbook of Climate Change Domain Ontology” in 2015, new conceptual terms have emerged in the field of climate change. Additionally, some related work has been carried out in the field of climate change in early stages, such as “Climate Change Scientific Data Knowledge Integration and Sharing Platform” (http://dakcc.llas.ac.cn, accessed on 13 July 2022), “Integrated Service Portal in Climate Change Domain” (http://gcip.llas.ac.cn, accessed on 20 August 2022), “Climate Change Disaster Management and Service System” (http://gcip.llas.ac.cn/disaster, accessed on 15 July 2022), and “Encyclopedia-based Knowledge Map of Climate Change”. A significant amount of public news data, policy data, and scientific data from the field of climate change was accumulated. However, some domain terms that do not exist in the books were also selected for supplementation, e.g., carbon peaking, ecological anxiety, climate refugees, net zero emissions, net negative emissions, hurricanes, tornadoes, and avalanches. To address this, we have added 21 new terms from the 2021 IPCC report (https://www.ipcc.ch/report/ar6/wg3/downloads/report/IPCC_AR6_WGIII_Full_Report.pdf, accessed on 25 May 2022), 36 terms from the public dataset in OpenKG (http://openkg.cn/, accessed on 4 May 2022), and 64 terms from our self-built database to the ontology, resulting in a total of 2028 conceptual terms. This updated ontology provides a more accurate semantic representation of the recent academic papers in the field of climate change [37].

4. Ontology-Based Automatic Indexing of Topics

Ontology indexing is a technique that allows the mapping of conceptual terms in an ontology to academic resources, such as academic papers. In the case of academic papers, the title, abstract, and keywords are usually used as the input, whereas a set of related conceptual terms are extracted from the ontology as the output. As the CCO covers thousands of research topics, there may be further updates and extensions in future. If a supervised machine learning algorithm is used, it would require a large amount of data annotation, which is labor-intensive and costly. Each time a new topic is added, it is necessary to re-label the data and train the model. Therefore, in this study, we propose an unsupervised approach, which only needs pretrained word embeddings and does not need to annotate the corpus in advance to automate the entire indexing process. When there are updates and expansions to the ontology, retraining the model and modifying the algorithm is not required. As the processes of automatic indexing of Chinese and English ontologies in this experiment were similar, we only described the automatic indexing of the English ontology. The entire indexing process includes three modules: structural module, semantic module, and selection module. The algorithm flow framework is shown in Figure 2.

To illustrate the algorithm flow, we use the study conducted by Huq (2011) as example data [38].

4.1. Structural Module

Words or phrases similar to ontology concepts are extracted from the paper and their similarity to the ontology is calculated to obtain a tailored set of ontology concepts for the paper.

4.1.1. Data Preprocessing

The text is preprocessed to keep relevant words/phrases only. Syntactic dependencies are identified and organized in a syntax tree with the root node (often an irrelevant verb) removed. Stop words are then removed, and the text is divided into phrases using stop words as separators.

4.1.2. Text Similarity Calculation

The obtained lists may include words/phrases or short sentences. An n-gram language model is used to compute their similarity to the ontology concepts (unigrams, bigrams, trigrams) [39]. Levenshtein distance, which is used to compute the minimum number of editing operations between two strings to convert from one to the other, also known as the edit distance, was chosen as the similarity algorithm [40]. Based on actual tests, the similarity threshold was set to 0.92. When the calculation result is greater than or equal to 0.92, it is considered to be similar to the current ontology concept term, and this ontology concept term is used as a topic candidate for the structure module.

The structure module generates candidate topic phrases: ”Adaptation”, “Climate Risk”, “Vulnerability”, “Economic Development Model”, “Infrastructure”, “Investment”.

We observed many instances of the same topic appearing at different levels in the ontology. As shown in Figure 3, “Infrastructure” is a hyponym of “Rural Areas” and “Networked infrastructure, including transportation, energy, water, and sanitation”. The structural module can only determine the similarity between phrases based on syntactic structure and cannot verify their hierarchy. Therefore, if there is a word with multiple topics, the results of the structure module need to be further verified with the semantic module.

4.2. Semantic Module

The paper’s content is analyzed for terms similar to those in the ontology, and their semantic similarity is calculated. This results in a selected set of ontology conceptual terms as topics for the paper.

4.2.1. Concept Extraction

The preprocessed text is filtered to create a list of candidate words. Based on statistical analysis, the conceptual terms in the ontology typically consist of combinations of nouns, adjectives and nouns, verbs and nouns, and some combinations with conjunctions and punctuation. To filter out irrelevant words, the parts of speech in the preprocessed text are annotated and only words or phrases that conform to the rule are kept to form a set of candidate phrases [41].

4.2.2. Pretraining Model

The goal of utilizing a pre-trained language model is to transform text into vector representations for semantic similarity calculation. We chose two models, Word2vec and Phrase-Bert, and fine-tuned them for better performance [42,43].

Word2Vec was selected as the first method for the experiment due to its low dependence on system hardware and fast training and prediction speeds. The training data was collected from 306,211 climate-change-related academic papers published by WOS between 2005 and 2021, using keywords as the retrieval method. After removing duplicates and articles without abstracts, 293,205 data points were used to train the Word2Vec model. Both paper titles and abstracts were used as training data since there were no restrictions on text length for pretraining the model. The algorithm and parameters used are listed in Table 1.

Phrase–BERT

The experiment also tested the BERT pretraining model, which is currently the most effective tool. However, the native BERT model struggles to represent the semantics of phrases accurately. Thus, the Phrase–BERT model, a BERT variant with better phrase semantic representation, was used. The Phrase–BERT pretrained model provided in the study is the result of pretraining on a general corpus, which performs averagely when applied to a specific domain. Using the principle of continuous training, the abovementioned 293,205 academic papers in the English climate change domain were introduced to improve it. The model was trained using only the abstracts of these papers, as they contain a lot of information about the paper’s topic and the model’s maximum sentence length is 512 words. Based on our statistics, as shown in Figure 4, 99.6% of the training data met this length requirement and the rest was truncated [44]. The model was trained on a Linux (Ubuntu 20.04.4) server with two NVIDIA TITAN RTX 24GB GPUs connected by NVLink and took 8.1 h [45]. The training parameters are listed in Table 2.

The continuous training process transformed the generic Phrase–BERT model into the Climate Change Phrase–BERT (CC–Phrase–BERT), which has better representation of phrase diversity and enhanced semantic understanding of phrases in the climate change domain. Further optimization of the pretrained model can improve model predictions, as reported by Sun et al. (2019) [46].

Table 3 shows the performance of the Word2Vec model, Phrase–BERT model, and CC–Phrase–BERT model in calculating phrase similarity.

The CC–Phrase–BERT model surpasses the other two models and delivers the desired results in the experiments. It outperforms expectations, as seen in Table 3. CC–Phrase–BERT is able to identify not just synonymous phrases, but also implicit and disciplinary correlations between some phrases, with just a few relevant words. However, the similarity threshold needs to be raised when using the CC–Phrase–BERT model, as only article-related words are needed in this study.

4.2.3. Semantic Similarity Calculation

The CC–Phrase–BERT model is utilized for vectorizing the semantic representation of each phrase in the set of extracted phrases and ontology concept terms. The cosine similarity algorithm is then applied to calculate the similarity between the two phrases. After manual validation, the algorithm sets the threshold value to 0.95, and if the result is greater than or equal to 0.95, it is considered similar to the current ontology’s concept term and included in the final set of topic candidates. In case there are multiple topics for the same phrase, the similarity between the hypernym and the candidate set is re-calculated, and the final determination of the topic is based on the highest similarity. The candidate topics for the sample papers, calculated by the semantic module, include “Climate Risk”, “Economics of adaptation”, “Addressing Climate Change”, “Impact and vulnerability”, “Adaptation decision”, “Infrastructure”, “industrialization development”, “Urbanization challenges and opportunities for climate change mitigation”, “Mitigation and adaptation, sustainable development”, “Climate Policy”, “Economic Development Model”, “Natural Calamities”.

4.3. Topic Ranking

Aggregating the results derived from the structural and semantic modules often yields a large number of topics, but some of these topics may be only marginally relevant to the article, or not precise to the most accurate topic hierarchy. Therefore, in this study, we construct several filtering strategies using weighting formulas to calculate and rank the weight of each ontology concept term as a topic of the paper. The filtering strategies are as follows:

If a topic is directly mentioned in the text, its weight is set to 1;
If a topic is identified multiple times, its weight is set to the probability of the word plus 0.1n, where n is the number of times identified;
If a topic is identified by both the structural module and the semantic module, its weight is set to 1;
The path calculation is performed according to the hierarchical structure of the ontology, and topics that are farther apart or have the longest paths tend to be less relevant. Therefore, the path calculation and weighting are performed on all topic words in pairs, and its weight is set to its reciprocal; the farther the topic is, the lower the weight.

After calculating and sorting according to the above rules, the topic collection obtained using the sample data is “Mitigation and adaptation, sustainable development”, “Climate Policy”, “Climate Risk”, “Impact and vulnerability”, “Adaptation decision”, “Economic Development Model”.

5. Experimental Verification and Analysis

5.1. Creation of the Gold Standard

Since there is no dataset that uses CCO to label the literature, we invited nine experts in the field of climate change to manually annotate 150 papers according to the topics in CCO, thus establishing a gold standard. From the climate change dataset we constructed, we selected 150 papers published in 15 fields including mitigation and adaptation, sectoral reduction, international agreements, and extreme weather. Each article is annotated by three different experts, that is to say, nine experts equally labelled fifty times. If a topic is considered relevant by at least two experts, it is added to the gold standard.

In order to simplify the process of data labeling, we use doccano (https://github.com/doccano/doccano, accessed on 10 August 2022) to support domain experts. Doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence-to-sequence tasks. We organized the papers and CCO in Doccano and loaded the processed data into the tool. Doccano displays each paper’s title, abstract, keywords and suggested tags to the annotators. Each paper is tagged with three–eight topics based on expert judgment, and experts can also add missing CCO topics to the tags if needed.

5.2. Experimental Setup

We evaluate the method proposed in this study against five common alternative methods for the task of annotating papers in the gold standard according to the topics in CCO. The results of the experiment are shown in Table 4. All experiments were implemented using Python 3.8.

The TF-IDF algorithm is a classic text keyword extraction algorithm. The TF-IDF score is calculated for each paper, and a set of sorted word lists is returned. Among them, the IDF is calculated on a dataset of 290,000 climate change domains we constructed.

During the process of training the LDA model, setting the number of topics will directly affect the recognition effect of the model. Therefore, we trained two versions of LDA on the same corpus, with the difference being that they have different number of topics. LDA-50 is trained on 50 topics, while LDA-100 is trained on 100 topics. Then, the two models were tested in practice and compared with the gold standard to determine the optimal threshold for selecting the number of topics for LDA and the number of words for the prediction result.

Word2vec is the method introduced in Section 4.2.2, which provides a vector with a fixed representation for each word, with the aim of establishing semantic relations between words. Similar to the method using CC–Phrase–Bert, the retrained Word2vec model is used to vectorize the predicted text and CCO, respectively, and the concept phrases similar to the window embeddings are then calculated through the cosine similarity algorithm. TF-IDF, LDA and Word2vec were implemented using Gensim and can be installed from PyPI (https://pypi.org/, accessed on 11 September 2022) using the following command: pip install gensim.

Bertopic is a deep learning topic modeling technique that combines bert embeddings and c-TF-IDF. The advantage of Bertopic is that it uses a pre-trained transformer-based language model to generate document embeddings with contextual semantic relations [47]. Bertopic can be installed using the command: pip install bertopic.

Finally, CCO-A is the default implementation of the CCO annotator presented in this paper.

Since the results returned by TF-IDF, LDA and Bertopic use the vocabulary from the paper, it is necessary to map these vocabularies to the CCO. Calculate the Levenshtein similarity between the returned result and the CCO subject, and the CCO subject with a calculated score higher than 0.75 is the subject of the paper. Levenshtein can be installed using the command: pip install python-Levenshtein.

We assessed the performance of these six approaches by means of precision, recall and f-score (F1). When indexing a given paper’s

p

subject, the value of precision

p r (p)

and recall

r e (p)

are computed as shown in Equation (1):

p r (p) = \frac{|M (p) \cap N (p)|}{|M (p)|} r e (p) = \frac{|M (p) \cap N (p)|}{|N (p)|}

(1)

where

M

identifies the topic returned by automatic indexing using the algorithmic model, and

N

is the gold standard obtained for that paper. The F1 is the harmonic mean of precision and recall.

5.3. Analysis of Results

As shown in Table 4, CCO-A obtained the highest values regarding precision, recall and F1 score in the subject-indexing task. Among all the models, it produced the best results. The methods based on TF-IDF and LDA perform poorly. Although higher precision can be obtained by increasing the Levenshtein similarity threshold for term to CCO topic mapping, the recall rate will also be reduced. Analysis of the topics returned by TF-IDF and LDA showed that most of these subject terms were domain-related words but could rarely express the research topic of the paper. In addition, these subject terms contain more generic terms. The results returned by Word2vec indicate that the problem of multiple meanings of words cannot be solved. Most of the results returned by Bertopic do not consider multiple topics in a single document; rather, they only consider the contextual representation of the document, while the subject words are still derived from the document; therefore, there will be a certain redundancy due to the possible high similarity of words in the topic. Despite a minor drop in computational efficiency, CCO-A delivers improved outcomes. This is achieved through its integration of expertise from the climate change domain, which enables it to efficiently condense contextual information in texts, grasp domain-specific language features and semantics, and overcome challenges faced by alternative approaches.

6. Conclusions

At present, the output of scientific research papers on the Internet is growing explosively, and the process of acquiring and analyzing data generates a large amount of interference. It is an important research task to filter out data information that is irrelevant to the topic to the maximum extent and realize the accurate analysis of scientific research trends. In this paper, we build a climate change ontology utilizing the “Climate Change Domain Ontology Handbook”, which is an authoritative and fine-grained ontology in the research field. Moreover, we propose a framework for automatic subject indexing of academic papers using a refined subject vocabulary in the ontology. Under this framework, we create a climate change dataset for model training and integrate NLP techniques such as syntactic dependencies, pre-trained language models, and similarity computation to form an unsupervised method for automatically indexing academic papers based on the ontology. The use of ontologies for topic labeling enhances the metadata of academic papers, making it easier for users and researchers to swiftly filter information and conduct multi-dimensional topic analysis. The study was evaluated using a gold standard of 150 human annotations and achieved improved precision, recall, and F1 scores compared to other algorithms. However, this study also has some limitations. For example, we will continue to study methods such as natural language processing and deep learning to propose better solutions to improve the accuracy and performance of CCO indexing. Second, we only used the field of climate change as the experimental object in this study, and did not test the feasibility in other fields. We plan to extend this research method to other research fields, such as medicine, economics and other fields with mature ontology structures.

Author Contributions

Conceptualization, H.Y. and S.W.; data curation, H.Y. and N.W.; formal analysis, H.Y. and L.Y.; methodology, H.Y. and W.L.; validation, H.Y. and N.W.; writing—original draft preparation, H.Y. and S.W.; writing—review and editing, N.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Youth Project of Gansu Provincial Social Science Planning (No. 2021QN050) and the General Project of Gansu Provincial Social Science Planning (No. 2021YB158).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shardlow, M.; Batista-Navarro, R.; Thompson, P.; Nawaz, R.; McNaught, J.; Ananiadou, S. Identification of research hypotheses and new knowledge from scientific literature. BMC Med. Inform. Decis. Mak. 2018, 18, 1–13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Safder, I.; Hassan, S.U. Bibliometric-enhanced information retrieval: A novel deep feature engineering approach for algorithm searching from full-text publications. Scientometrics 2019, 119, 257–277. [Google Scholar] [CrossRef]
Golub, K. Automatic Subject Indexing of Text. 2019. Available online: https://www.isko.org/cyclo/automatic (accessed on 28 December 2022).
Asula, M.; Makke, J.; Freienthal, L.; Kuulmets, H.; Sirel, R. Kratt: Developing an Automatic Subject Indexing Tool for the National Library of Estonia. Cat. Classif. Q. 2021, 59, 775–793. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y. Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Fang, D.; Yang, H.; Gao, B.; Li, X. Discovering research topics from library electronic references using latent Dirichlet allocation. Libr. Hi Technol. 2018, 36, 400–410. [Google Scholar] [CrossRef]
Kang, H.J.; Kim, C.; Kang, K. Analysis of the Trends in Biochemical Research Using Latent Dirichlet Allocation (LDA). Processes 2019, 7, 379. [Google Scholar] [CrossRef] [Green Version]
Jung, N.; Lee, G. Automated classification of building information modeling (BIM) case studies by BIM use based on natural language processing (NLP) and unsupervised learning. Adv. Eng. Inform. 2019, 41, 100917. [Google Scholar] [CrossRef]
Mai, F.; Galke, L.; Scherp, A. Using deep learning for title-based semantic subject indexing to reach competitive performance to full-text. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, Worth, TX, USA, 3–7 June 2018; pp. 169–178. [Google Scholar]
Kazi, N.; Lane, N.; Kahanda, I. Automatically cataloging scholarly articles using library of congress subject headings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, Kiev, Ukraine, 21–23 April 2021; pp. 43–49. [Google Scholar]
Qinglin, W.; Jiutian, Z. A Handbook of Climate Change Domain Ontology; Beijing Institute of Technology Press: Beijing, China, 2012. [Google Scholar]
Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X.; Jiang, X.; Li, Y.; Zhao, L. Latent Dirichlet Allocation (LDA) and Topic modeling: Models, applications, a survey. Multimed. Tools Appl. 2018, 78, 15169–15211. [Google Scholar] [CrossRef] [Green Version]
Jacobi, C.; Van Atteveldt, W.; Welbers, K. Quantitative analysis of large amounts of journalistic texts using topic modelling. Digit. J. 2016, 4, 89–106. [Google Scholar] [CrossRef]
Qiang, J.; Qian, Z.; Li, Y.; Yuan, Y.; Wu, X. Short Text Topic Modeling Techniques, Applications, and Performance: A Survey. IEEE Trans. Knowl. Data Eng. 2022, 34, 1427–1445. [Google Scholar] [CrossRef]
Mohammed, S.H.; Al-augby, S. Lsa & lda topic modeling classification: Comparison study on e-books. Indones. J. Electr. Eng. Comput. Sci. 2020, 19, 353–362. [Google Scholar]
Nagarajan, G.; Minu, R.I.; Jayanthila Devi, A. Optimal nonparametric bayesian model-based multimodal BoVW creation using multilayer pLSA. Circuits Syst. Signal Process. 2020, 39, 1123–1132. [Google Scholar] [CrossRef]
Wartena, C.; Franke-Maier, M. A hybrid approach to assignment of Library of Congress Subject Headings. Arch. Data Sci. Ser. A 2018, 4, 1–13. [Google Scholar] [CrossRef]
Boyack, K.W.; Klavans, R. Creation of a highly detailed, dynamic, global model and map of science. JASIST 2014, 65, 670–685. [Google Scholar] [CrossRef]
Shiau, W.-L.; Dwivedi, Y.K.; Yang, H.S. Co-Citation and Cluster Analyses of Extant Literature on Social Networks. Int. J. Inf. Manag. 2017, 37, 390–399. [Google Scholar] [CrossRef] [Green Version]
Hou, J.; Yang, X.; Chen, C. Emerging trends and new developments in information science: A document co-citation analysis (2009–2016). Scientometrics 2018, 115, 869–892. [Google Scholar] [CrossRef]
Esbjörn-Hargens, S. An ontology of climate change. J. Integral Theory Pract. 2010, 5, 143–174. [Google Scholar]
Chang, E.; Liu, W.; Scharl, A.; Weichselbraun, A. Semi-automatic ontology extension using spreading activation. J. Univers. Knowl. Manag. 2005, 1, 50–58. [Google Scholar]
Kontopoulos, E.; Mitzias, P.; Moßgraber, J.; Hertweck, P.; van der Schaaf, H.; Hilbring, D.; Lombardo, F.; Norbiato, D.; Ferri, M.; Karakostas, A.; et al. Ontology-Based Representation of Crisis Management Procedures for Climate Events. In Proceedings of the 15th International Conference on Information Systems for Crisis Response and Management, Rochester, NY, USA, 20–23 May 2018; ISCRAM Association: Brussels, Belgium, 2018. [Google Scholar]
Pileggi, S.F.; Lamia, S.A. Climate change timeline: An ontology to tell the story so far. IEEE Access 2020, 8, 65294–65312. [Google Scholar] [CrossRef]
Fonou Dombeu, J.; Naidoo, N.; Ramnanan, M.; Gowda, R.; Lawton, S.R. OntoCSA: A Climate-Smart Agriculture Ontology. Int. J. Agric. Environ. Inf. Syst. 2021, 12, 1–20. [Google Scholar] [CrossRef]
Brugger, J.; Crimmins, M. The art of adaptation: Living with climate change in the rural American Southwest. Glob. Environ. Chang. 2013, 23, 1830–1840. [Google Scholar] [CrossRef]
Kontopoulos, E.; Martinopoulos, G.; Lazarou, D.; Bassiliades, N. An ontology-based decision support tool for optimizing domestic solar hot water system selection. J. Clean. Prod. 2016, 112, 4636–4646. [Google Scholar] [CrossRef]
Bonacin, R.; Nabuco, O.F.; Junior, I.P. Ontology models of the impacts of agriculture and climate changes on water resources: Scenarios on interoperability and information recovery. Future Gener. Comput. Syst. 2016, 54, 423–434. [Google Scholar] [CrossRef] [Green Version]
Dahal, B.; Kumar, S.A.; Li, Z. Topic modeling and sentiment analysis of global climate change tweets. Soc. Netw. Anal. Min. 2019, 9, 24. [Google Scholar] [CrossRef]
Li, F.; Zhou, H.; Huang, D.S.; Guan, P. Global Research Output and Theme Trends on Climate Change and Infectious Diseases: A Restrospective Bibliometric and Co-Word Biclustering Investigation of Papers Indexed in PubMed (1999–2018). Int. J. Environ. Res. Public Health 2020, 17, 5228. [Google Scholar] [CrossRef]
Zeng, L. Chinese Public Perception of Climate Change on Social Media: An Investigation Based on Data Mining and Text Analysis. J. Environ. Public Health 2022, 2022, 6294436. [Google Scholar] [CrossRef]
Coro, G.; Magliozzi, C.; Ellenbroek, A.; Kaschner, K.; Pagano, P. Automatic classification of climate change effects on marine species distributions in 2050 using the AquaMaps model. Environ. Ecol. Stat. 2016, 23, 155–180. [Google Scholar] [CrossRef]
Piaser, E.; Villa, P. Comparing machine learning techniques for aquatic vegetation classification using Sentinel-2 data. In Proceedings of the 2022 IEEE 21st Mediterranean Electrotechnical Conference (MELECON), Palermo, Italy, 14–16 June 2022; IEEE: New York, NY, USA, 2022; pp. 465–470. [Google Scholar]
Rodler, P.; Jannach, D.; Schekotihin, K.; Fleiss, P. Are Query-Based Ontology Debuggers Really Helping Knowledge Engineers? Knowl. Based Syst. 2019, 179, 92–107. [Google Scholar] [CrossRef] [Green Version]
Knublauch, H.; Fergerson, R.W.; Noy, N.F.; Musen, M.A. The Protégé OWL plugin: An open development environment for semantic web applications. In International Semantic Web Conference; Springer: Berlin/Heidelberg, Germany, 2004; pp. 229–243. [Google Scholar]
Moreira, D.A.; Musen, M.A. OBO to OWL: A protege OWL tab to read/save OBO ontologies. Bioinformatics 2007, 23, 1868–1870. [Google Scholar] [CrossRef] [Green Version]
Lynn, J.; Peeva, N. Communications in the IPCC’s Sixth Assessment Report cycle. Clim. Chang 2021, 169, 1–10. [Google Scholar] [CrossRef]
Huq, S. Adaptation to Climate Change; Springer: Dordrecht, The Netherlands, 2011. [Google Scholar]
Brown, P.F.; Della Pietra, V.J.; Desouza, P.V.; Lai, J.C.; Mercer, R.L. Class-Based N-Gram Models of Natural Language. CL 1992, 18, 467–480. [Google Scholar]
Konaka, F.; Miura, T. Textual similarity for word sequences. In International Conference on Similarity Search and Applications; Springer: Cham, Switzerland, 2015; pp. 244–249. [Google Scholar]
Nguyen, D.Q.; Verspoor, K. From POS tagging to dependency parsing for biomedical event extraction. BMC Bioinform. 2019, 20, 1–13. [Google Scholar] [CrossRef] [PubMed]
Hu, K.; Luo, Q.; Qi, K.; Yang, S.; Mao, J.; Fu, X.; Zheng, J.; Wu, H.; Guo, Y.; Zhu, Q. Understanding the topic evolution of scientific literatures like an evolving city: Using Google Word2Vec model and spatial autocorrelation analysis. Inf. Process. Manag. 2019, 56, 1185–1203. [Google Scholar] [CrossRef]
Wang, S.; Thompson, L.; Iyyer, M. Phrase-bert: Improved phrase embeddings from bert with an application to corpus exploration. Arxiv Prepr. 2021, arXiv:2109.06304. [Google Scholar]
Shen, S.; Liu, J.; Lin, L.; Huang, Y.; Zhang, L.; Liu, C.; Feng, Y.; Wang, D. SsciBERT: A Pre-trained Language Model for Social Science Texts. Arxiv Prepr. 2022, arXiv:2206.04510. [Google Scholar] [CrossRef]
Li, A.; Song, S.L.; Chen, J.; Li, J.; Liu, X.; Tallent, N.R.; Barker, K.J. Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Trans. Parallel Distrib. Syst. 2019, 31, 94–110. [Google Scholar] [CrossRef] [Green Version]
Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to Fine-Tune BERT for Text Classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2019; Volume 11856, pp. 194–206. ISBN 9783030323806. [Google Scholar]
Abuzayed, A.; Al-Khalifa, H. BERT for Arabic topic modeling: An experimental study on BERTopic technique. Procedia Comput. Sci. 2021, 189, 191–194. [Google Scholar] [CrossRef]

Figure 1. Ontology part structure in the field of climate change.

Figure 2. Ontology-based topic automatic indexing method.

Figure 3. Same topic at different levels.

Figure 4. Paper abstract length.

Table 1. Word2vec model training parameters.

Method	Embedding Size	Window Size	Min Count Cutoff
Skipgram	128	10	10

Table 2. Phrase–Bert model training parameters.

Hyperparameter	CC–Phrase–Bert
max_seq_length	512
learning_rate	2 × 10⁻⁵
train_batch_size	64
num_train_epochs	3
eval_batch_size	64

Table 3. Model accuracy comparison.

Model	Word2vec	Phrase–Bert	CC–Phrase–Bert
“Natural Disasters” and “Natural Calamities”	0.5674	0.7128	0.9707
“Drought” and “Water Saving Irrigation”	0.6433	0.6278	0.9207
“Precipitation” and “Flood Season Rainfall”	0.8473	0.7781	0.9552
“climate change projections” and “Near-term climate change: projections and predictability”	0.7283	0.8114	0.9016

Table 4. Precision, recall and f-score values of different methods. In bold are the best results.

Method	Description	Prec.	Rec.	F1
TF-IDF	TF-IDF	28.5%	25.0%	26.6%
LDA-50	LDA with 50 topics	11.7%	14.6%	13.0%
LDA-100	LDA with 100 topics	16.0%	21.3%	18.3%
WORD2VEC	Word2vec (Section 4.2.2)	47.5%	39.5%	43.1%
BERTOPIC	BERTOPIC	60.1%	63.2%	61.6%
CCO-A	The CCO Annotator	74.5%	76.1%	75.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Wang, N.; Yang, L.; Liu, W.; Wang, S. Research on the Automatic Subject-Indexing Method of Academic Papers Based on Climate Change Domain Ontology. Sustainability 2023, 15, 3919. https://doi.org/10.3390/su15053919

AMA Style

Yang H, Wang N, Yang L, Liu W, Wang S. Research on the Automatic Subject-Indexing Method of Academic Papers Based on Climate Change Domain Ontology. Sustainability. 2023; 15(5):3919. https://doi.org/10.3390/su15053919

Chicago/Turabian Style

Yang, Heng, Nan Wang, Lina Yang, Wei Liu, and Sili Wang. 2023. "Research on the Automatic Subject-Indexing Method of Academic Papers Based on Climate Change Domain Ontology" Sustainability 15, no. 5: 3919. https://doi.org/10.3390/su15053919

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on the Automatic Subject-Indexing Method of Academic Papers Based on Climate Change Domain Ontology

Abstract

1. Introduction

2. Literature Review

2.1. Automatic Subject Indexing

2.2. Climate Change Ontology

2.3. The Application of Automatic Subject Indexing in Climate Change Research

3. Climate Change Ontology

3.1. Source of Ontology

3.2. Ontology Extension

4. Ontology-Based Automatic Indexing of Topics

4.1. Structural Module

4.1.1. Data Preprocessing

4.1.2. Text Similarity Calculation

4.2. Semantic Module

4.2.1. Concept Extraction

4.2.2. Pretraining Model

4.2.3. Semantic Similarity Calculation

4.3. Topic Ranking

5. Experimental Verification and Analysis

5.1. Creation of the Gold Standard

5.2. Experimental Setup

5.3. Analysis of Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI