Natural Language Processing for Social Media

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: closed (31 December 2020) | Viewed by 50314

Special Issue Editor


E-Mail Website
Guest Editor
Department of Computer Science, Edge Hill University, St Helens Road, Ormskirk L39 4QP, Lancashire, UK
Interests: natural language processing; text mining; text mining for open source software; data science; data analysis; machine learning for text analysis; multiword expression recognition
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Social media has revolutionised the way users can express themselves, discuss, and exchange information online, generating vast amounts of textual data. Not only researchers in natural language processing, text analysis, information science, psychology, and social science, but also companies and organisations worldwide have shown vigorous interest in structuring information, processing text, and extracting trends from social media data.
Natural language in social media is different from traditional resources such as newspapers, books, and scientific articles. It usually contains abbreviations, emoticons, neologisms, connotations, humour, and sarcasm. On the other hand, companies and organisations are interested in analysing social media text due to the value of public opinion and its reflection on profits.
This Special Issue focuses on the dissemination of original contributions to discuss challenges, novel tasks, approaches, and evaluation methods for social media analysis.

Topics of interest include, but are not limited to, the following:

  • Spam and noise detection;
  • Language/dialect identification;
  • NLP tools: text normalisation tokenisation, art-of-speech tagging, chunking, and parsing;
  • Machine translation;
  • Automatic summarisation;
  • Sentiment analysis;
  • Emotion detection;
  • Data collection and annotation;
  • Natural language visualisation for social media;
  • Applied social media text analysis: healthcare, retail, marketing, finance, media, politics, disaster monitoring, security, and defence.

Dr. Yannis Korkontzelos
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Social media analysis
  • Natural language processing
  • Text analysis
  • Social media text
  • Machine/deep learning approaches for social media text

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 878 KiB  
Article
Anonymity and Inhibition in Newspaper Comments
by Magnus Knustad and Christer Johansson
Information 2021, 12(3), 106; https://doi.org/10.3390/info12030106 - 3 Mar 2021
Cited by 2 | Viewed by 2712
Abstract
Newspaper comment sections allow readers to voice their opinion on a wide range of topics, provide feedback for journalists and editors and may enable public debate. Comment sections have been criticized as a medium for toxic comments. Such behavior in comment sections has [...] Read more.
Newspaper comment sections allow readers to voice their opinion on a wide range of topics, provide feedback for journalists and editors and may enable public debate. Comment sections have been criticized as a medium for toxic comments. Such behavior in comment sections has been attributed to the effect of anonymity. Several studies have found a relationship between anonymity and toxic comments, based on laboratory conditions or the comparison of comments from different sites or platforms. The current study uses real-world data sampled from The Washington Post and The New York Times, where anonymous and non-anonymous users comment on the same articles. This sampling strategy decreases the possibility of interfering variables, ensuring that any observed differences between the two groups can be explained by anonymity. A small but significant relationship between anonymity and toxic comments was found, though the effects of both the newspaper and the direction of the comment were stronger. While it is true that non-anonymous commenters write fewer toxic comments, we observed that many of the toxic comments were directed at others than the article or author of the original article. This may indicate a way to restrict toxic comments, while allowing anonymity, by restricting the reference to others, e.g., by enforcing writers to focus on the topic. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

15 pages, 3391 KiB  
Article
The Evolution of Language Models Applied to Emotion Analysis of Arabic Tweets
by Nora Al-Twairesh
Information 2021, 12(2), 84; https://doi.org/10.3390/info12020084 - 17 Feb 2021
Cited by 25 | Viewed by 3628
Abstract
The field of natural language processing (NLP) has witnessed a boom in language representation models with the introduction of pretrained language models that are trained on massive textual data then used to fine-tune downstream NLP tasks. In this paper, we aim to study [...] Read more.
The field of natural language processing (NLP) has witnessed a boom in language representation models with the introduction of pretrained language models that are trained on massive textual data then used to fine-tune downstream NLP tasks. In this paper, we aim to study the evolution of language representation models by analyzing their effect on an under-researched NLP task: emotion analysis; for a low-resource language: Arabic. Most of the studies in the field of affect analysis focused on sentiment analysis, i.e., classifying text into valence (positive, negative, neutral) while few studies go further to analyze the finer grained emotional states (happiness, sadness, anger, etc.). Emotion analysis is a text classification problem that is tackled using machine learning techniques. Different language representation models have been used as features for these machine learning models to learn from. In this paper, we perform an empirical study on the evolution of language models, from the traditional term frequency–inverse document frequency (TF–IDF) to the more sophisticated word embedding word2vec, and finally the recent state-of-the-art pretrained language model, bidirectional encoder representations from transformers (BERT). We observe and analyze how the performance increases as we change the language model. We also investigate different BERT models for Arabic. We find that the best performance is achieved with the ArabicBERT large model, which is a BERT model trained on a large dataset of Arabic text. The increase in F1-score was significant +7–21%. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

22 pages, 4592 KiB  
Article
Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings
by Eshete Derb Emiru, Shengwu Xiong, Yaxing Li, Awet Fesseha and Moussa Diallo
Information 2021, 12(2), 62; https://doi.org/10.3390/info12020062 - 3 Feb 2021
Cited by 14 | Viewed by 4215
Abstract
Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This [...] Read more.
Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

17 pages, 9497 KiB  
Article
Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya
by Awet Fesseha, Shengwu Xiong, Eshete Derb Emiru, Moussa Diallo and Abdelghani Dahou
Information 2021, 12(2), 52; https://doi.org/10.3390/info12020052 - 25 Jan 2021
Cited by 66 | Viewed by 6165
Abstract
This article studies convolutional neural networks for Tigrinya (also referred to as Tigrigna), which is a family of Semitic languages spoken in Eritrea and northern Ethiopia. Tigrinya is a “low-resource” language and is notable in terms of the absence of comprehensive and free [...] Read more.
This article studies convolutional neural networks for Tigrinya (also referred to as Tigrigna), which is a family of Semitic languages spoken in Eritrea and northern Ethiopia. Tigrinya is a “low-resource” language and is notable in terms of the absence of comprehensive and free data. Furthermore, it is characterized as one of the most semantically and syntactically complex languages in the world, similar to other Semitic languages. To the best of our knowledge, no previous research has been conducted on the state-of-the-art embedding technique that is shown here. We investigate which word representation methods perform better in terms of learning for single-label text classification problems, which are common when dealing with morphologically rich and complex languages. Manually annotated datasets are used here, where one contains 30,000 Tigrinya news texts from various sources with six categories of “sport”, “agriculture”, “politics”, “religion”, “education”, and “health” and one unannotated corpus that contains more than six million words. In this paper, we explore pretrained word embedding architectures using various convolutional neural networks (CNNs) to predict class labels. We construct a CNN with a continuous bag-of-words (CBOW) method, a CNN with a skip-gram method, and CNNs with and without word2vec and FastText to evaluate Tigrinya news articles. We also compare the CNN results with traditional machine learning models and evaluate the results in terms of the accuracy, precision, recall, and F1 scoring techniques. The CBOW CNN with word2vec achieves the best accuracy with 93.41%, significantly improving the accuracy for Tigrinya news classification. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

13 pages, 444 KiB  
Article
Narrative Construction of Product Reviews Reveals the Level of Post-Decisional Cognitive Dissonance
by Tibor Pólya, Gabriella Judith Kengyel and Tímea Budai
Information 2021, 12(1), 46; https://doi.org/10.3390/info12010046 - 19 Jan 2021
Cited by 3 | Viewed by 3154
Abstract
Social media platforms host an increasing amount of costumer reviews on a wide range of products. While most studies on product reviews focus on the sentiments expressed or helpfulness judged by readers and on their impact on subsequent buying this study aims at [...] Read more.
Social media platforms host an increasing amount of costumer reviews on a wide range of products. While most studies on product reviews focus on the sentiments expressed or helpfulness judged by readers and on their impact on subsequent buying this study aims at uncovering the psychological state of the persons making the reviews. More specifically, the study applies a narrative approach to the analysis of product reviews and addresses the question what the narrative construction of product reviews reveals about the level of post-decisional cognitive dissonance experienced by reviewers. The study involved 94 participants, who were asked to write a product review on their recently bought cell phones. The level of cognitive dissonance was measured by a self-report scale. The product reviews were analyzed by the Narrative Categorical Content Analytical Toolkit. The analysis revealed that agency, spatio-temporal perspective, and psychological perspective reflected the level of cognitive dissonance of the reviewers. The results are interpreted by elaborating on the idea that narratives have affordance to express affect. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

9 pages, 1620 KiB  
Article
Combating Fake News in “Low-Resource” Languages: Amharic Fake News Detection Accompanied by Resource Crafting
by Fantahun Gereme, William Zhu, Tewodros Ayall and Dagmawi Alemu
Information 2021, 12(1), 20; https://doi.org/10.3390/info12010020 - 7 Jan 2021
Cited by 30 | Viewed by 6586
Abstract
The need to fight the progressive negative impact of fake news is escalating, which is evident in the strive to do research and develop tools that could do this job. However, a lack of adequate datasets and good word embeddings have posed challenges [...] Read more.
The need to fight the progressive negative impact of fake news is escalating, which is evident in the strive to do research and develop tools that could do this job. However, a lack of adequate datasets and good word embeddings have posed challenges to make detection methods sufficiently accurate. These resources are even totally missing for “low-resource” African languages, such as Amharic. Alleviating these critical problems should not be left for tomorrow. Deep learning methods and word embeddings contributed a lot in devising automatic fake news detection mechanisms. Several contributions are presented, including an Amharic fake news detection model, a general-purpose Amharic corpus (GPAC), a novel Amharic fake news detection dataset (ETH_FAKE), and Amharic fasttext word embedding (AMFTWE). Our Amharic fake news detection model, evaluated with the ETH_FAKE dataset and using the AMFTWE, performed very well. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

16 pages, 416 KiB  
Article
Online Multilingual Hate Speech Detection: Experimenting with Hindi and English Social Media
by Neeraj Vashistha and Arkaitz Zubiaga
Information 2021, 12(1), 5; https://doi.org/10.3390/info12010005 - 22 Dec 2020
Cited by 47 | Viewed by 6070
Abstract
The last two decades have seen an exponential increase in the use of the Internet and social media, which has changed basic human interaction. This has led to many positive outcomes. At the same time, it has brought risks and harms. The volume [...] Read more.
The last two decades have seen an exponential increase in the use of the Internet and social media, which has changed basic human interaction. This has led to many positive outcomes. At the same time, it has brought risks and harms. The volume of harmful content online, such as hate speech, is not manageable by humans. The interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset. Having classified them into three classes, abusive, hateful or neither, we create a baseline model and improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool that identifies and scores a page with an effective metric in near-real-time and uses the same feedback to re-train our model. We prove the competitive performance of our multilingual model in two languages, English and Hindi. This leads to comparable or superior performance to most monolingual models. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

12 pages, 853 KiB  
Article
Determining the Age of the Author of the Text Based on Deep Neural Network Models
by Aleksandr Sergeevich Romanov, Anna Vladimirovna Kurtukova, Artem Alexandrovich Sobolev, Alexander Alexandrovich Shelupanov and Anastasia Mikhailovna Fedotova
Information 2020, 11(12), 589; https://doi.org/10.3390/info11120589 - 21 Dec 2020
Cited by 9 | Viewed by 3551
Abstract
This paper is devoted to solving the problem of determining the age of the author of the text based on models of deep neural networks. The article presents an analysis of methods for determining the age of the author of a text and [...] Read more.
This paper is devoted to solving the problem of determining the age of the author of the text based on models of deep neural networks. The article presents an analysis of methods for determining the age of the author of a text and approaches to determining the age of a user by a photo. This could be a solution to the problem of inaccurate data for training by filtering out incorrect user-specified age data. A detailed description of the author’s technique based on deep neural network models and the interpretation of the results is also presented. The study found that the proposed technique achieved 82% accuracy in determining the age of the author from Russian-language text, which makes it competitive in comparison with approaches for other languages. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

17 pages, 860 KiB  
Article
Document Summarization Based on Coverage with Noise Injection and Word Association
by Heechan Kim and Soowon Lee
Information 2020, 11(11), 536; https://doi.org/10.3390/info11110536 - 19 Nov 2020
Cited by 1 | Viewed by 1809
Abstract
Automatic document summarization is a field of natural language processing that is rapidly improving with the development of end-to-end deep learning models. In this paper, we propose a novel summarization model that consists of three methods. The first is a coverage method based [...] Read more.
Automatic document summarization is a field of natural language processing that is rapidly improving with the development of end-to-end deep learning models. In this paper, we propose a novel summarization model that consists of three methods. The first is a coverage method based on noise injection that makes the attention mechanism select only important words by defining previous context information as noise. This alleviates the problem that the summarization model generates the same word sequence repeatedly. The second is a word association method to update the information of each word by comparing the information of the current step with the information of all previous decoding steps. According to following words, this catches a change in the meaning of the word that has been already decoded. The third is a method using a suppression loss function that explicitly minimizes the probabilities of non-answer words. The proposed summarization model showed good performance on some recall-oriented understudy for gisting evaluation (ROUGE) metrics compared to the state-of-the-art models in the CNN/Daily Mail summarization task, and the results were achieved with very few learning steps compared to the state-of-the-art models. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

16 pages, 1058 KiB  
Article
Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
by Mubashar Mustafa, Feng Zeng, Hussain Ghulam and Hafiz Muhammad Arslan
Information 2020, 11(11), 518; https://doi.org/10.3390/info11110518 - 5 Nov 2020
Cited by 9 | Viewed by 5462
Abstract
Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their [...] Read more.
Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

19 pages, 4282 KiB  
Article
Automated Seeded Latent Dirichlet Allocation for Social Media Based Event Detection and Mapping
by Cornelia Ferner, Clemens Havas, Elisabeth Birnbacher, Stefan Wegenkittl and Bernd Resch
Information 2020, 11(8), 376; https://doi.org/10.3390/info11080376 - 25 Jul 2020
Cited by 15 | Viewed by 5383
Abstract
In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, [...] Read more.
In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, the results of topic models are difficult to interpret and require manual identification of one or more “disaster topics”. Immediate disaster response would benefit from a fully automated process for interpreting the modeled topics and extracting disaster relevant information. Initializing the topic model with a set of seed words already allows to directly identify the corresponding disaster topic. In order to enable an automated end-to-end process, we automatically generate seed words using older Tweets from the same geographic area. The results of two past events (Napa Valley earthquake 2014 and hurricane Harvey 2017) show that the geospatial distribution of Tweets identified as disaster related conforms with the officially released disaster footprints. The suggested approach is applicable when there is a single topic of interest and comparative data available. Full article
(This article belongs to the Special Issue Natural Language Processing for Social Media)
Show Figures

Figure 1

Back to TopTop