Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model

Wu, Kai; Chen, Zugang; Wu, Xinqian; Li, Guoqing; Li, Jing; Wang, Shaohua; Wang, Haodong; Feng, Hang

doi:10.3390/ijgi13070260

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model

by

Kai Wu

¹,

Zugang Chen

^2,*,

Xinqian Wu

¹,

Guoqing Li

²,

Jing Li

²,

Shaohua Wang

²,

Haodong Wang

³ and

Hang Feng

³

¹

School of Mathematics and Statistics, Henan University of Science and Technology, Luoyang 471023, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

³

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2024, 13(7), 260; https://doi.org/10.3390/ijgi13070260 (registering DOI)

Submission received: 8 April 2024 / Revised: 18 July 2024 / Accepted: 19 July 2024 / Published: 21 July 2024

(This article belongs to the Topic Geocomputation and Artificial Intelligence for Mapping)

Download Versions Notes

Abstract

Extracting geoscientific dataset names from the literature is crucial for building a literature–data association network, which can help readers access the data quickly through the Internet. However, the existing named-entity extraction methods have low accuracy in extracting geoscientific dataset names from unstructured text because geoscientific dataset names are a complex combination of multiple elements, such as geospatial coverage, temporal coverage, scale or resolution, theme content, and version. This paper proposes a new method based on the hierarchical temporal memory (HTM) model, a brain-inspired neural network with superior performance in high-level cognitive tasks, to accurately extract geoscientific dataset names from unstructured text. First, a word-encoding method based on the Unicode values of characters for the HTM model was proposed. Then, over 12,000 dataset names were collected from geoscience data-sharing websites and encoded into binary vectors to train the HTM model. We conceived a new classifier scheme for the HTM model that decodes the predictive vector for the encoder of the next word so that the similarity of the encoders of the predictive next word and the real next word can be computed. If the similarity is greater than a specified threshold, the real next word can be regarded as part of the name, and a successive word set forms the full geoscientific dataset name. We used the trained HTM model to extract geoscientific dataset names from 100 papers. Our method achieved an F1-score of 0.727, outperforming the GPT-4- and Claude-3-based few-shot learning (FSL) method, with F1-scores of 0.698 and 0.72, respectively.

Keywords: geoscientific dataset; named-entity recognition; hierarchical temporal memory; word encoding

Share and Cite

MDPI and ACS Style

Wu, K.; Chen, Z.; Wu, X.; Li, G.; Li, J.; Wang, S.; Wang, H.; Feng, H. Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model. ISPRS Int. J. Geo-Inf. 2024, 13, 260. https://doi.org/10.3390/ijgi13070260

AMA Style

Wu K, Chen Z, Wu X, Li G, Li J, Wang S, Wang H, Feng H. Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model. ISPRS International Journal of Geo-Information. 2024; 13(7):260. https://doi.org/10.3390/ijgi13070260

Chicago/Turabian Style

Wu, Kai, Zugang Chen, Xinqian Wu, Guoqing Li, Jing Li, Shaohua Wang, Haodong Wang, and Hang Feng. 2024. "Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model" ISPRS International Journal of Geo-Information 13, no. 7: 260. https://doi.org/10.3390/ijgi13070260

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model

Abstract

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI