AI for Text Understanding

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: closed (15 January 2023) | Viewed by 16322

Special Issue Editors


E-Mail Website
Guest Editor
LT³ (Language and Translation Technology Team), Department of Translation, Interpreting and Communication, Ghent University, Ghent, Belgium
Interests: natural language processing (NLP); machine learning of natural language and multilingual natural language processing with a special interest in computational semantics, cross-lingual transfer learning, and multilingual terminology extraction; sentiment analysis, hate speech detection, argumentation mining in social media, and automatic detection of irony in online text

E-Mail Website
Guest Editor
LT³ (Language and Translation Technology Team), Department of Translation, Interpreting and Communication, Ghent University, Ghent, Belgium
Interests: natural language processing (NLP); machine learning of natural language and multilingual natural language processing with a special interest in cross-lingual transfer learning, classifier optimization, and multilingual terminology extraction; sentiment analysis, emotion detection, event detection, coreference resolution, and the automatic detection of irony in online text

Special Issue Information

Dear Colleagues, 

This Special Issue on “Artificial Intelligence for Text Understanding” aims to focus on the different aspects of AI involved when free unstructured text is automatically converted into structured data including, but not limited to, commonsense reasoning, automated reasoning and inference, ethical AI, heuristic search, knowledge representation, machine learning, and natural language processing.

In the field of natural language processing, the recent introduction of the Transformer model has truly changed the way we model textual data and has advanced the state-of-the-art to include a wide range of NLP tasks. These transformer models, as well as other deep learning approaches, however, also raise a number of issues and concerns within the NLP community. Although these machine learning models often lead to very exciting results, they function as black boxes, i.e., the resulting models and output are hard to interpret. In addition, NLP has evolved from an academic research field to a widely adopted technology in industry, marketing, recruiting, media, and politics in the last decade. As a result, researchers have started asking questions about possible harmful applications and bias in both the training data and in the application of machine learning models in and outside of academia. The most recent generation of machine learning approaches are also extremely data-greedy, making it  often hard for low(er)-resourced languages to keep pace with NLP advancements for English and other well-resourced languages. Finally, we observe the limitations of purely lexical language modeling for more complex linguistic tasks such as irony detection, event detection, or coreference  resolution, which clearly require more advanced commonsense or deep linguistic knowledge.

For this Special Issue, we welcome papers describing state-of-the-art approaches for all of the different aspects of AI for text understanding, with a special focus on approaches addressing the aforementioned topics:

  • Overview papers of the current state-of-the-art of NLP tasks;
  • Explainable machine learning approaches;
  • Approaches integrating linguistic information or commonsense reasoning into deep learning networks;
  • Cross-lingual approaches for transfer of knowledge from well-resourced to low(er)-resourced languages, including code-mixing;
  • Transfer approaches for domain-specific data and applications;
  • Handling bias in AI methodologies.

Dr. Els Lefever
Prof. Dr. Veronique Hoste
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • cross-lingual NLP
  • transfer learning in NLP
  • explainable machine learning for NLP
  • ethics in NLP
  • commonsense reasoning
  • integration of linguistic and commonsense information in deep learning for NLP

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

16 pages, 417 KiB  
Article
Fuzzy Rough Nearest Neighbour Methods for Aspect-Based Sentiment Analysis
by Olha Kaminska, Chris Cornelis and Veronique Hoste
Electronics 2023, 12(5), 1088; https://doi.org/10.3390/electronics12051088 - 22 Feb 2023
Cited by 4 | Viewed by 1494
Abstract
Fine-grained sentiment analysis, known as Aspect-Based Sentiment Analysis (ABSA), establishes the polarity of a section of text concerning a particular aspect. Aspect, sentiment, and emotion categorisation are the three steps that make up the configuration of ABSA, which we looked into for the [...] Read more.
Fine-grained sentiment analysis, known as Aspect-Based Sentiment Analysis (ABSA), establishes the polarity of a section of text concerning a particular aspect. Aspect, sentiment, and emotion categorisation are the three steps that make up the configuration of ABSA, which we looked into for the dataset of English reviews. In this work, due to the fuzzy nature of textual data, we investigated machine learning methods based on fuzzy rough sets, which we believe are more interpretable than complex state-of-the-art models. The novelty of this paper is the use of a pipeline that incorporates all three mentioned steps and applies Fuzzy-Rough Nearest Neighbour classification techniques with their extension based on ordered weighted average operators (FRNN-OWA), combined with text embeddings based on transformers. After some improvements in the pipeline’s stages, such as using two separate models for emotion detection, we obtain the correct results for the majority of test instances (up to 81.4%) for all three classification tasks. We consider three different options for the pipeline. In two of them, all three classification tasks are performed consecutively, reducing data at each step to retain only correct predictions, while the third option performs each step independently. This solution allows us to examine the prediction results after each step and spot certain patterns. We used it for an error analysis that enables us, for each test instance, to identify the neighbouring training samples and demonstrate that our methods can extract useful patterns from the data. Finally, we compare our results with another paper that performed the same ABSA classification for the Dutch version of the dataset and conclude that our results are in line with theirs or even slightly better. Full article
(This article belongs to the Special Issue AI for Text Understanding)
Show Figures

Figure 1

17 pages, 781 KiB  
Article
Distilling Monolingual Models from Large Multilingual Transformers
by Pranaydeep Singh, Orphée De Clercq and Els Lefever
Electronics 2023, 12(4), 1022; https://doi.org/10.3390/electronics12041022 - 18 Feb 2023
Cited by 2 | Viewed by 2203
Abstract
Although language modeling has been trending upwards steadily, models available for low-resourced languages are limited to large multilingual models such as mBERT and XLM-RoBERTa, which come with significant overheads for deployment vis-à-vis their model size, inference speeds, etc. We attempt to tackle this [...] Read more.
Although language modeling has been trending upwards steadily, models available for low-resourced languages are limited to large multilingual models such as mBERT and XLM-RoBERTa, which come with significant overheads for deployment vis-à-vis their model size, inference speeds, etc. We attempt to tackle this problem by proposing a novel methodology to apply knowledge distillation techniques to filter language-specific information from a large multilingual model into a small, fast monolingual model that can often outperform the teacher model. We demonstrate the viability of this methodology on two downstream tasks each for six languages. We further dive into the possible modifications to the basic setup for low-resourced languages by exploring ideas to tune the final vocabulary of the distilled models. Lastly, we perform a detailed ablation study to understand the different components of the setup better and find out what works best for the two under-resourced languages, Swahili and Slovene. Full article
(This article belongs to the Special Issue AI for Text Understanding)
Show Figures

Figure 1

20 pages, 729 KiB  
Article
A Benchmark for Dutch End-to-End Cross-Document Event Coreference Resolution
by Loic De Langhe, Thierry Desot, Orphée De Clercq and Veronique Hoste
Electronics 2023, 12(4), 850; https://doi.org/10.3390/electronics12040850 - 8 Feb 2023
Cited by 2 | Viewed by 1051
Abstract
In this paper, we present a benchmark result for end-to-end cross-document event coreference resolution in Dutch. First, the state of the art of this task in other languages is introduced, as well as currently existing resources and commonly used evaluation metrics. We then [...] Read more.
In this paper, we present a benchmark result for end-to-end cross-document event coreference resolution in Dutch. First, the state of the art of this task in other languages is introduced, as well as currently existing resources and commonly used evaluation metrics. We then build on recently published work to fully explore end-to-end event coreference resolution for the first time in the Dutch language domain. For this purpose, two well-performing transformer-based algorithms for the respective detection and coreference resolution of Dutch textual events are combined in a pipeline architecture and compared to baseline scores relying on feature-based methods. The results are promising and comparable to similar studies in higher-resourced languages; however, they also reveal that in this specific NLP domain, much work remains to be done. In order to gain more insights, an in-depth analysis of the two pipeline components is carried out to highlight and overcome possible shortcoming of the current approach and provide suggestions for future work. Full article
(This article belongs to the Special Issue AI for Text Understanding)
Show Figures

Figure 1

17 pages, 1516 KiB  
Article
Extractive Arabic Text Summarization-Graph-Based Approach
by Yazan Alaya AL-Khassawneh and Essam Said Hanandeh
Electronics 2023, 12(2), 437; https://doi.org/10.3390/electronics12020437 - 14 Jan 2023
Cited by 5 | Viewed by 2957
Abstract
With the noteworthy expansion of textual data sources in recent years, easy, quick, and precise text processing has become a challenge for key qualifiers. Automatic text summarization is the process of squeezing text documents into shorter summaries to facilitate verification of their basic [...] Read more.
With the noteworthy expansion of textual data sources in recent years, easy, quick, and precise text processing has become a challenge for key qualifiers. Automatic text summarization is the process of squeezing text documents into shorter summaries to facilitate verification of their basic contents, which must be completed without losing vital information and features. The most difficult information retrieval task is text summarization, particularly for Arabic. In this research, we offer an automatic, general, and extractive Arabic single document summarizing approach with the goal of delivering a sufficiently informative summary. The proposed model is based on a textual graph to generate a coherent summary. Firstly, the original text is converted to a textual graph using a novel formulation that takes into account sentence relevance, coverage, and diversity to evaluate each sentence using a mix of statistical and semantic criteria. Next, a sub-graph is built to reduce the size of the original text. Finally, unwanted and less weighted phrases are removed from the summarized sentences to generate a final summary. We used Recall-Oriented Research to Evaluate Main Idea (RED) as an evaluative metric to review our proposed technique and compare it with the most advanced methods. Finally, a trial on the Essex Arabic Summary Corpus (EASC) using the ROUGE index showed promising results compared with the currently available methods. Full article
(This article belongs to the Special Issue AI for Text Understanding)
Show Figures

Figure 1

15 pages, 2345 KiB  
Article
User OCEAN Personality Model Construction Method Using a BP Neural Network
by Xiaomei Qin, Zhixin Liu, Yuwei Liu, Shan Liu, Bo Yang, Lirong Yin, Mingzhe Liu and Wenfeng Zheng
Electronics 2022, 11(19), 3022; https://doi.org/10.3390/electronics11193022 - 23 Sep 2022
Cited by 88 | Viewed by 4136
Abstract
In the era of big data, the Internet is enmeshed in people’s lives and brings conveniences to their production and lives. The analysis of user preferences and behavioral predictions of user data can provide references for optimizing information structure and improving service accuracy. [...] Read more.
In the era of big data, the Internet is enmeshed in people’s lives and brings conveniences to their production and lives. The analysis of user preferences and behavioral predictions of user data can provide references for optimizing information structure and improving service accuracy. According to the present research, user’s behavior on social networking sites has a great correlation with their personality, and the five characteristics of the OCEAN (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) personality model can cover all aspects of a user’s personality. It is important in identifying a user’s OCEAN personality model to analyze their digital footprints left on social networking sites and to extract the rules of users’ behavior, and then to make predictions about user behavior. In this paper, the Latent Dirichlet Allocation (LDA) topic model is first used to extract the user’s text features. Second, the extracted features are used as sample input for a BP neural network. The results of the user’s OCEAN personality model obtained by a questionnaire are used as sample output for a BP neural network. Finally, the neural network is trained. A mapping model between the probability of the user’s text topic and their OCEAN personality model is established to predict the latter. The results show that the present approach improves the efficiency and accuracy of such a prediction. Full article
(This article belongs to the Special Issue AI for Text Understanding)
Show Figures

Figure 1

28 pages, 3622 KiB  
Article
Estimation of Demographic Traits of the Deputies through Parliamentary Debates Using Machine Learning
by Huseyin Polat and Mesut Korpe
Electronics 2022, 11(15), 2374; https://doi.org/10.3390/electronics11152374 - 29 Jul 2022
Cited by 1 | Viewed by 1669
Abstract
One of the most impressive applications of the combined use of natural language processing (NLP), classical machine learning, and deep learning (DL) approaches is the estimation of demographic traits from the text. Author Profiling (AP) is the analysis of a text to identify [...] Read more.
One of the most impressive applications of the combined use of natural language processing (NLP), classical machine learning, and deep learning (DL) approaches is the estimation of demographic traits from the text. Author Profiling (AP) is the analysis of a text to identify the demographics or characteristics of its author. So far, most researchers in this field have focused on using social media data in the English language. This article aims to expand the predictive potential of demographic traits by focusing on a more diverse dataset and language. Knowing the background of deputies is essential for citizens, political scientists and policymakers. In this study, we present the application of NLP and machine learning (ML) approaches to Turkish parliamentary debates to estimate the demographic traits of the deputies. Seven traits were determined: gender, age, education, occupation, election region, party, and party status. As a first step, a corpus was compiled from Turkish parliamentary debates between 2012 and 2020. Document representations (feature extraction) were performed using various NLP techniques. Then, we created sub-datasets containing the extracted features from the corpus. These sub-datasets were used by different ML classification algorithms. The best classification accuracy rates were more than 31%, 27%, 35%, 41%, 29%, 59%, and 32% according to the majority baseline for gender, age, education, occupation, election region, party, and party status, respectively. The experimental results show that the demographics of deputies can be estimated effectively using NLP, classical ML, and DL approaches. Full article
(This article belongs to the Special Issue AI for Text Understanding)
Show Figures

Figure 1

22 pages, 821 KiB  
Article
Corpus Statistics Empowered Document Classification
by Farid Uddin, Yibo Chen, Zuping Zhang and Xin Huang
Electronics 2022, 11(14), 2168; https://doi.org/10.3390/electronics11142168 - 11 Jul 2022
Viewed by 1380
Abstract
In natural language processing (NLP), document classification is an important task that relies on the proper thematic representation of the documents. Gaussian mixture-based clustering is widespread for capturing rich thematic semantics but ignores emphasizing potential terms in the corpus. Moreover, the soft clustering [...] Read more.
In natural language processing (NLP), document classification is an important task that relies on the proper thematic representation of the documents. Gaussian mixture-based clustering is widespread for capturing rich thematic semantics but ignores emphasizing potential terms in the corpus. Moreover, the soft clustering approach causes long-tail noise by putting every word into every cluster, which affects the natural thematic representation of documents and their proper classification. It is more challenging to capture semantic insights when dealing with short-length documents where word co-occurrence information is limited. In this context, for long texts, we proposed Weighted Sparse Document Vector (WSDV), which performs clustering on the weighted data that emphasizes vital terms and moderates the soft clustering by removing outliers from the converged clusters. Besides the removal of outliers, WSDV utilizes corpus statistics in different steps for the vectorial representation of the document. For short texts, we proposed Weighted Compact Document Vector (WCDV), which captures better semantic insights in building document vectors by emphasizing potential terms and capturing uncertainty information while measuring the affinity between distributions of words. Using available corpus statistics, WCDV sufficiently handles the data sparsity of short texts without depending on external knowledge sources. To evaluate the proposed models, we performed a multiclass document classification using standard performance measures (precision, recall, f1-score, and accuracy) on three long- and two short-text benchmark datasets that outperform some state-of-the-art models. The experimental results demonstrate that in the long-text classification, WSDV reached 97.83% accuracy on the AgNews dataset, 86.05% accuracy on the 20Newsgroup dataset, and 98.67% accuracy on the R8 dataset. In the short-text classification, WCDV reached 72.7% accuracy on the SearchSnippets dataset and 89.4% accuracy on the Twitter dataset. Full article
(This article belongs to the Special Issue AI for Text Understanding)
Show Figures

Figure 1

Back to TopTop