Natural Language Processing and Text Mining

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Systems".

Deadline for manuscript submissions: closed (31 May 2019) | Viewed by 54663

Special Issue Editors


E-Mail Website
Guest Editor
Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, 15782 Galiza, Spain
Interests: natural language processing; distributional semantics; information extraction; dependency parsing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
LyS (Language and Information Society) Group, University of A Coruna, A Coruna, Spain
Interests: natural language processing; computational linguistics; linguistics

Special Issue Information

Dear Colleagues,

Natural language processing (NLP) encompasses a set of linguistically motivated strategies focused on building an interpretable representation from free text. NLP typically makes use of linguistic tasks such as lemmatization, PoS tagging, syntactic analysis, anaphora resolution, semantic role labeling, and so on. Text mining, on the other hand, is a set of Text2Data techniques for discovering and extracting relevant and salient knowledge from large amounts of unstructured text. Its main objective typically is not to understand all or even a large part of what a given speaker/writer has uttered, but rather to extract items of knowledge or regular patterns across a large number of documents, especially Web content and social media.  Following recent advances in NLP, machine learning, neural-based deep learning and big data, text mining is now an even more valuable method for connecting linguistic theories with real-world NLP applications aimed at building organized data from unstructured text. Both hidden and new knowledge can be discovered by making use of NLP techniques and text mining methods, by relying on supervised or unsupervised learning strategies within big data environments.

Authors are invited to submit their papers on any of the following topics (or other related topics):

  • relation extraction (including approaches to open information extraction)
  • named entity recognition
  • entity linking
  • analysis of opinions, emotions and sentiments
  • text clustering, topic modelling, and classification
  • summarization and text simplification
  • co-reference resolution
  • distributional models and semantics
  • multiword and/or terminological extraction
  • entailment and paraphrases
  • discourse analysis
  • question-answering applications

Particular emphasis will be placed to work that makes use of new technologies and innovative linguistic models, or carrying out studies with cross-lingual approaches and minority languages.

Dr. Pablo Gamallo
Dr. Marcos Garcia
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Natural Language Processing
  • Text Mining
  • Information Extraction
  • Mining Web and Social Media Contents
  • Text2Data
  • Language Technologies

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

3 pages, 149 KiB  
Editorial
Editorial for the Special Issue on “Natural Language Processing and Text Mining”
by Pablo Gamallo and Marcos Garcia
Information 2019, 10(9), 279; https://doi.org/10.3390/info10090279 - 06 Sep 2019
Cited by 1 | Viewed by 2405
Abstract
Natural language processing (NLP) and Text Mining (TM) are a set of overlapping strategies working on unstructured text [...] Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)

Research

Jump to: Editorial

17 pages, 944 KiB  
Article
Transfer Learning for Named Entity Recognition in Financial and Biomedical Documents
by Sumam Francis, Jordy Van Landeghem and Marie-Francine Moens
Information 2019, 10(8), 248; https://doi.org/10.3390/info10080248 - 26 Jul 2019
Cited by 33 | Viewed by 9528
Abstract
Recent deep learning approaches have shown promising results for named entity recognition (NER). A reasonable assumption for training robust deep learning models is that a sufficient amount of high-quality annotated training data is available. However, in many real-world scenarios, labeled training data is [...] Read more.
Recent deep learning approaches have shown promising results for named entity recognition (NER). A reasonable assumption for training robust deep learning models is that a sufficient amount of high-quality annotated training data is available. However, in many real-world scenarios, labeled training data is scarcely present. In this paper we consider two use cases: generic entity extraction from financial and from biomedical documents. First, we have developed a character based model for NER in financial documents and a word and character based model with attention for NER in biomedical documents. Further, we have analyzed how transfer learning addresses the problem of limited training data in a target domain. We demonstrate through experiments that NER models trained on labeled data from a source domain can be used as base models and then be fine-tuned with few labeled data for recognition of different named entity classes in a target domain. We also witness an interest in language models to improve NER as a way of coping with limited labeled data. The current most successful language model is BERT. Because of its success in state-of-the-art models we integrate representations based on BERT in our biomedical NER model along with word and character information. The results are compared with a state-of-the-art model applied on a benchmarking biomedical corpus. Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
Show Figures

Figure 1

15 pages, 7236 KiB  
Article
Assisting Forensic Identification through Unsupervised Information Extraction of Free Text Autopsy Reports: The Disappearances Cases during the Brazilian Military Dictatorship
by Patricia Martin-Rodilla, Marcia L. Hattori and Cesar Gonzalez-Perez
Information 2019, 10(7), 231; https://doi.org/10.3390/info10070231 - 05 Jul 2019
Cited by 3 | Viewed by 4054
Abstract
Anthropological, archaeological, and forensic studies situate enforced disappearance as a strategy associated with the Brazilian military dictatorship (1964–1985), leaving hundreds of persons without identity or cause of death identified. Their forensic reports are the only existing clue for people identification and detection of [...] Read more.
Anthropological, archaeological, and forensic studies situate enforced disappearance as a strategy associated with the Brazilian military dictatorship (1964–1985), leaving hundreds of persons without identity or cause of death identified. Their forensic reports are the only existing clue for people identification and detection of possible crimes associated with them. The exchange of information among institutions about the identities of disappeared people was not a common practice. Thus, their analysis requires unsupervised techniques, mainly due to the fact that their contextual annotation is extremely time-consuming, difficult to obtain, and with high dependence on the annotator. The use of these techniques allows researchers to assist in the identification and analysis in four areas: Common causes of death, relevant body locations, personal belongings terminology, and correlations between actors such as doctors and police officers involved in the disappearances. This paper analyzes almost 3000 textual reports of missing persons in São Paulo city during the Brazilian dictatorship through unsupervised algorithms of information extraction in Portuguese, identifying named entities and relevant terminology associated with these four criteria. The analysis allowed us to observe terminological patterns relevant for people identification (e.g., presence of rings or similar personal belongings) and automate the study of correlations between actors. The proposed system acts as a first classificatory and indexing middleware of the reports and represents a feasible system that can assist researchers working in pattern search among autopsy reports. Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
Show Figures

Figure 1

25 pages, 415 KiB  
Article
Multilingual Open Information Extraction: Challenges and Opportunities
by Daniela Barreiro Claro, Marlo Souza, Clarissa Castellã Xavier and Leandro Oliveira
Information 2019, 10(7), 228; https://doi.org/10.3390/info10070228 - 02 Jul 2019
Cited by 27 | Viewed by 6240
Abstract
The number of documents published on the Web in languages other than English grows every year. As a consequence, the need to extract useful information from different languages increases, highlighting the importance of research into Open Information Extraction (OIE) techniques. Different OIE methods [...] Read more.
The number of documents published on the Web in languages other than English grows every year. As a consequence, the need to extract useful information from different languages increases, highlighting the importance of research into Open Information Extraction (OIE) techniques. Different OIE methods have dealt with features from a unique language; however, few approaches tackle multilingual aspects. In those approaches, multilingualism is restricted to processing text in different languages, rather than exploring cross-linguistic resources, which results in low precision due to the use of general rules. Multilingual methods have been applied to numerous problems in Natural Language Processing, achieving satisfactory results and demonstrating that knowledge acquisition for a language can be transferred to other languages to improve the quality of the facts extracted. We argue that a multilingual approach can enhance OIE methods as it is ideal to evaluate and compare OIE systems, and therefore can be applied to the collected facts. In this work, we discuss how the transfer knowledge between languages can increase acquisition from multilingual approaches. We provide a roadmap of the Multilingual Open IE area concerning state of the art studies. Additionally, we evaluate the transfer of knowledge to improve the quality of the facts extracted in each language. Moreover, we discuss the importance of a parallel corpus to evaluate and compare multilingual systems. Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
Show Figures

Figure 1

21 pages, 988 KiB  
Article
Large Scale Linguistic Processing of Tweets to Understand Social Interactions among Speakers of Less Resourced Languages: The Basque Case
by Joseba Fernandez de Landa, Rodrigo Agerri and Iñaki Alegria
Information 2019, 10(6), 212; https://doi.org/10.3390/info10060212 - 13 Jun 2019
Cited by 3 | Viewed by 6405
Abstract
Social networks like Twitter are increasingly important in the creation of new ways of communication. They have also become useful tools for social and linguistic research due to the massive amounts of public textual data available. This is particularly important for less resourced [...] Read more.
Social networks like Twitter are increasingly important in the creation of new ways of communication. They have also become useful tools for social and linguistic research due to the massive amounts of public textual data available. This is particularly important for less resourced languages, as it allows to apply current natural language processing techniques to large amounts of unstructured data. In this work, we study the linguistic and social aspects of young and adult people’s behaviour based on their tweets’ contents and the social relations that arise from them. With this objective in mind, we have gathered over 10 million tweets from more than 8000 users. First, we classified each user in terms of its life stage (young/adult) according to the writing style of their tweets. Second, we applied topic modelling techniques to the personal tweets to find the most popular topics according to life stages. Third, we established the relations and communities that emerge based on the retweets. We conclude that using large amounts of unstructured data provided by Twitter facilitates social research using computational techniques such as natural language processing, giving the opportunity both to segment communities based on demographic characteristics and to discover how they interact or relate to them. Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
Show Figures

Figure 1

15 pages, 422 KiB  
Article
Event Extraction and Representation: A Case Study for the Portuguese Language
by Paulo Quaresma, Vítor Beires Nogueira, Kashyap Raiyani and Roy Bayot
Information 2019, 10(6), 205; https://doi.org/10.3390/info10060205 - 08 Jun 2019
Cited by 6 | Viewed by 5805
Abstract
Text information extraction is an important natural language processing (NLP) task, which aims to automatically identify, extract, and represent information from text. In this context, event extraction plays a relevant role, allowing actions, agents, objects, places, and time periods to be identified and [...] Read more.
Text information extraction is an important natural language processing (NLP) task, which aims to automatically identify, extract, and represent information from text. In this context, event extraction plays a relevant role, allowing actions, agents, objects, places, and time periods to be identified and represented. The extracted information can be represented by specialized ontologies, supporting knowledge-based reasoning and inference processes. In this work, we will describe, in detail, our proposal for event extraction from Portuguese documents. The proposed approach is based on a pipeline of specialized natural language processing tools; namely, a part-of-speech tagger, a named entities recognizer, a dependency parser, semantic role labeling, and a knowledge extraction module. The architecture is language-independent, but its modules are language-dependent and can be built using adequate AI (i.e., rule-based or machine learning) methodologies. The developed system was evaluated with a corpus of Portuguese texts and the obtained results are presented and analysed. The current limitations and future work are discussed in detail. Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
Show Figures

Figure 1

9 pages, 738 KiB  
Article
Spelling Correction of Non-Word Errors in Uyghur–Chinese Machine Translation
by Rui Dong, Yating Yang and Tonghai Jiang
Information 2019, 10(6), 202; https://doi.org/10.3390/info10060202 - 06 Jun 2019
Cited by 4 | Viewed by 7034
Abstract
This research was conducted to solve the out-of-vocabulary problem caused by Uyghur spelling errors in Uyghur–Chinese machine translation, so as to improve the quality of Uyghur–Chinese machine translation. This paper assesses three spelling correction methods based on machine translation: 1. Using a Bilingual [...] Read more.
This research was conducted to solve the out-of-vocabulary problem caused by Uyghur spelling errors in Uyghur–Chinese machine translation, so as to improve the quality of Uyghur–Chinese machine translation. This paper assesses three spelling correction methods based on machine translation: 1. Using a Bilingual Evaluation Understudy (BLEU) score; 2. Using a Chinese language model; 3. Using a bilingual language model. The best results were achieved in both the spelling correction task and the machine translation task by using the BLEU score for spelling correction. A maximum F1 score of 0.72 was reached for spelling correction, and the translation result increased the BLEU score by 1.97 points, relative to the baseline system. However, the method of using a BLEU score for spelling correction requires the support of a bilingual parallel corpus, which is a supervised method that can be used in corpus pre-processing. Unsupervised spelling correction can be performed by using either a Chinese language model or a bilingual language model. These two methods can be easily extended to other languages, such as Arabic. Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
Show Figures

Figure 1

20 pages, 627 KiB  
Article
An Improved Word Representation for Deep Learning Based NER in Indian Languages
by Ajees A P, Manju K and Sumam Mary Idicula
Information 2019, 10(6), 186; https://doi.org/10.3390/info10060186 - 30 May 2019
Cited by 6 | Viewed by 5216
Abstract
Named Entity Recognition (NER) is the process of identifying the elementary units in a text document and classifying them into predefined categories such as person, location, organization and so forth. NER plays an important role in many Natural Language Processing applications like information [...] Read more.
Named Entity Recognition (NER) is the process of identifying the elementary units in a text document and classifying them into predefined categories such as person, location, organization and so forth. NER plays an important role in many Natural Language Processing applications like information retrieval, question answering, machine translation and so forth. Resolving the ambiguities of lexical items involved in a text document is a challenging task. NER in Indian languages is always a complex task due to their morphological richness and agglutinative nature. Even though different solutions were proposed for NER, it is still an unsolved problem. Traditional approaches to Named Entity Recognition were based on the application of hand-crafted features to classical machine learning techniques such as Hidden Markov Model (HMM), Support Vector Machine (SVM), Conditional Random Field (CRF) and so forth. But the introduction of deep learning techniques to the NER problem changed the scenario, where the state of art results have been achieved using deep learning architectures. In this paper, we address the problem of effective word representation for NER in Indian languages by capturing the syntactic, semantic and morphological information. We propose a deep learning based entity extraction system for Indian languages using a novel combined word representation, including character-level, word-level and affix-level embeddings. We have used ‘ARNEKT-IECSIL 2018’ shared data for training and testing. Our results highlight the improvement that we obtained over the existing pre-trained word representations. Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
Show Figures

Figure 1

17 pages, 514 KiB  
Article
Istex: A Database of Twenty Million Scientific Papers with a Mining Tool Which Uses Named Entities
by Denis Maurel, Enza Morale, Nicolas Thouvenin, Patrice Ringot and Angel Turri
Information 2019, 10(5), 178; https://doi.org/10.3390/info10050178 - 22 May 2019
Cited by 3 | Viewed by 4809
Abstract
Istex is a database of twenty million full text scientific papers bought by the French Government for the use of academic libraries. Papers are usually searched for by the title, authors, keywords or possibly the abstract. To authorize new types of queries of [...] Read more.
Istex is a database of twenty million full text scientific papers bought by the French Government for the use of academic libraries. Papers are usually searched for by the title, authors, keywords or possibly the abstract. To authorize new types of queries of Istex, we implemented a system of named entity recognition on all papers and we offer users the possibility to run searches on these entities. After the presentation of the French Istex project, we detail in this paper the named entity recognition with CasEN, a cascade of graphs, implemented on the Unitex Software. CasEN exists in French, but not in English. The first challenge was to build a new cascade in a short time. The results of its evaluation showed a good Precision measure, even if the Recall was not very good. The Precision was very important for this project to ensure it did not return unwanted papers by a query. The second challenge was the implementation of Unitex to parse around twenty millions of documents. We used a dockerized application. Finally, we explain also how to query the resulting Named entities in the Istex website. Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
Show Figures

Figure 1

Back to TopTop