Information

Research

20 pages, 878 KiB

Open AccessArticle

Anonymity and Inhibition in Newspaper Comments

by Magnus Knustad and Christer Johansson

Information 2021, 12(3), 106; https://doi.org/10.3390/info12030106 - 3 Mar 2021

Cited by 3 | Viewed by 2865

Newspaper comment sections allow readers to voice their opinion on a wide range of topics, provide feedback for journalists and editors and may enable public debate. Comment sections have been criticized as a medium for toxic comments. Such behavior in comment sections has [...] Read more.

Newspaper comment sections allow readers to voice their opinion on a wide range of topics, provide feedback for journalists and editors and may enable public debate. Comment sections have been criticized as a medium for toxic comments. Such behavior in comment sections has been attributed to the effect of anonymity. Several studies have found a relationship between anonymity and toxic comments, based on laboratory conditions or the comparison of comments from different sites or platforms. The current study uses real-world data sampled from The Washington Post and The New York Times, where anonymous and non-anonymous users comment on the same articles. This sampling strategy decreases the possibility of interfering variables, ensuring that any observed differences between the two groups can be explained by anonymity. A small but significant relationship between anonymity and toxic comments was found, though the effects of both the newspaper and the direction of the comment were stronger. While it is true that non-anonymous commenters write fewer toxic comments, we observed that many of the toxic comments were directed at others than the article or author of the original article. This may indicate a way to restrict toxic comments, while allowing anonymity, by restricting the reference to others, e.g., by enforcing writers to focus on the topic. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

15 pages, 3391 KiB

Open AccessEditor’s ChoiceArticle

The Evolution of Language Models Applied to Emotion Analysis of Arabic Tweets

by Nora Al-Twairesh

Information 2021, 12(2), 84; https://doi.org/10.3390/info12020084 - 17 Feb 2021

Cited by 26 | Viewed by 3822

Abstract

The field of natural language processing (NLP) has witnessed a boom in language representation models with the introduction of pretrained language models that are trained on massive textual data then used to fine-tune downstream NLP tasks. In this paper, we aim to study [...] Read more.

The field of natural language processing (NLP) has witnessed a boom in language representation models with the introduction of pretrained language models that are trained on massive textual data then used to fine-tune downstream NLP tasks. In this paper, we aim to study the evolution of language representation models by analyzing their effect on an under-researched NLP task: emotion analysis; for a low-resource language: Arabic. Most of the studies in the field of affect analysis focused on sentiment analysis, i.e., classifying text into valence (positive, negative, neutral) while few studies go further to analyze the finer grained emotional states (happiness, sadness, anger, etc.). Emotion analysis is a text classification problem that is tackled using machine learning techniques. Different language representation models have been used as features for these machine learning models to learn from. In this paper, we perform an empirical study on the evolution of language models, from the traditional term frequency–inverse document frequency (TF–IDF) to the more sophisticated word embedding word2vec, and finally the recent state-of-the-art pretrained language model, bidirectional encoder representations from transformers (BERT). We observe and analyze how the performance increases as we change the language model. We also investigate different BERT models for Arabic. We find that the best performance is achieved with the ArabicBERT large model, which is a BERT model trained on a large dataset of Arabic text. The increase in F1-score was significant +7–21%. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

22 pages, 4592 KiB

Open AccessArticle

Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings

by Eshete Derb Emiru, Shengwu Xiong, Yaxing Li, Awet Fesseha and Moussa Diallo

Information 2021, 12(2), 62; https://doi.org/10.3390/info12020062 - 3 Feb 2021

Cited by 15 | Viewed by 4398

Abstract

Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This [...] Read more.

Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

17 pages, 9497 KiB

Open AccessEditor’s ChoiceArticle

Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya

by Awet Fesseha, Shengwu Xiong, Eshete Derb Emiru, Moussa Diallo and Abdelghani Dahou

Information 2021, 12(2), 52; https://doi.org/10.3390/info12020052 - 25 Jan 2021

Cited by 68 | Viewed by 6420

Abstract

This article studies convolutional neural networks for Tigrinya (also referred to as Tigrigna), which is a family of Semitic languages spoken in Eritrea and northern Ethiopia. Tigrinya is a “low-resource” language and is notable in terms of the absence of comprehensive and free [...] Read more.

This article studies convolutional neural networks for Tigrinya (also referred to as Tigrigna), which is a family of Semitic languages spoken in Eritrea and northern Ethiopia. Tigrinya is a “low-resource” language and is notable in terms of the absence of comprehensive and free data. Furthermore, it is characterized as one of the most semantically and syntactically complex languages in the world, similar to other Semitic languages. To the best of our knowledge, no previous research has been conducted on the state-of-the-art embedding technique that is shown here. We investigate which word representation methods perform better in terms of learning for single-label text classification problems, which are common when dealing with morphologically rich and complex languages. Manually annotated datasets are used here, where one contains 30,000 Tigrinya news texts from various sources with six categories of “sport”, “agriculture”, “politics”, “religion”, “education”, and “health” and one unannotated corpus that contains more than six million words. In this paper, we explore pretrained word embedding architectures using various convolutional neural networks (CNNs) to predict class labels. We construct a CNN with a continuous bag-of-words (CBOW) method, a CNN with a skip-gram method, and CNNs with and without word2vec and FastText to evaluate Tigrinya news articles. We also compare the CNN results with traditional machine learning models and evaluate the results in terms of the accuracy, precision, recall, and F1 scoring techniques. The CBOW CNN with word2vec achieves the best accuracy with 93.41%, significantly improving the accuracy for Tigrinya news classification. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

13 pages, 444 KiB

Open AccessArticle

Narrative Construction of Product Reviews Reveals the Level of Post-Decisional Cognitive Dissonance

by Tibor Pólya, Gabriella Judith Kengyel and Tímea Budai

Information 2021, 12(1), 46; https://doi.org/10.3390/info12010046 - 19 Jan 2021

Cited by 4 | Viewed by 3304

Abstract

Social media platforms host an increasing amount of costumer reviews on a wide range of products. While most studies on product reviews focus on the sentiments expressed or helpfulness judged by readers and on their impact on subsequent buying this study aims at [...] Read more.

Social media platforms host an increasing amount of costumer reviews on a wide range of products. While most studies on product reviews focus on the sentiments expressed or helpfulness judged by readers and on their impact on subsequent buying this study aims at uncovering the psychological state of the persons making the reviews. More specifically, the study applies a narrative approach to the analysis of product reviews and addresses the question what the narrative construction of product reviews reveals about the level of post-decisional cognitive dissonance experienced by reviewers. The study involved 94 participants, who were asked to write a product review on their recently bought cell phones. The level of cognitive dissonance was measured by a self-report scale. The product reviews were analyzed by the Narrative Categorical Content Analytical Toolkit. The analysis revealed that agency, spatio-temporal perspective, and psychological perspective reflected the level of cognitive dissonance of the reviewers. The results are interpreted by elaborating on the idea that narratives have affordance to express affect. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

9 pages, 1620 KiB

Open AccessArticle

Combating Fake News in “Low-Resource” Languages: Amharic Fake News Detection Accompanied by Resource Crafting

by Fantahun Gereme, William Zhu, Tewodros Ayall and Dagmawi Alemu

Information 2021, 12(1), 20; https://doi.org/10.3390/info12010020 - 7 Jan 2021

Cited by 30 | Viewed by 6765

Abstract

The need to fight the progressive negative impact of fake news is escalating, which is evident in the strive to do research and develop tools that could do this job. However, a lack of adequate datasets and good word embeddings have posed challenges [...] Read more.

The need to fight the progressive negative impact of fake news is escalating, which is evident in the strive to do research and develop tools that could do this job. However, a lack of adequate datasets and good word embeddings have posed challenges to make detection methods sufficiently accurate. These resources are even totally missing for “low-resource” African languages, such as Amharic. Alleviating these critical problems should not be left for tomorrow. Deep learning methods and word embeddings contributed a lot in devising automatic fake news detection mechanisms. Several contributions are presented, including an Amharic fake news detection model, a general-purpose Amharic corpus (GPAC), a novel Amharic fake news detection dataset (ETH_FAKE), and Amharic fasttext word embedding (AMFTWE). Our Amharic fake news detection model, evaluated with the ETH_FAKE dataset and using the AMFTWE, performed very well. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

16 pages, 416 KiB

Open AccessArticle

Online Multilingual Hate Speech Detection: Experimenting with Hindi and English Social Media

by Neeraj Vashistha and Arkaitz Zubiaga

Information 2021, 12(1), 5; https://doi.org/10.3390/info12010005 - 22 Dec 2020

Cited by 48 | Viewed by 6302

Abstract

The last two decades have seen an exponential increase in the use of the Internet and social media, which has changed basic human interaction. This has led to many positive outcomes. At the same time, it has brought risks and harms. The volume [...] Read more.

The last two decades have seen an exponential increase in the use of the Internet and social media, which has changed basic human interaction. This has led to many positive outcomes. At the same time, it has brought risks and harms. The volume of harmful content online, such as hate speech, is not manageable by humans. The interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset. Having classified them into three classes, abusive, hateful or neither, we create a baseline model and improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool that identifies and scores a page with an effective metric in near-real-time and uses the same feedback to re-train our model. We prove the competitive performance of our multilingual model in two languages, English and Hindi. This leads to comparable or superior performance to most monolingual models. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

12 pages, 853 KiB

Open AccessArticle

Determining the Age of the Author of the Text Based on Deep Neural Network Models

by Aleksandr Sergeevich Romanov, Anna Vladimirovna Kurtukova, Artem Alexandrovich Sobolev, Alexander Alexandrovich Shelupanov and Anastasia Mikhailovna Fedotova

Information 2020, 11(12), 589; https://doi.org/10.3390/info11120589 - 21 Dec 2020

Cited by 9 | Viewed by 3718

Abstract

This paper is devoted to solving the problem of determining the age of the author of the text based on models of deep neural networks. The article presents an analysis of methods for determining the age of the author of a text and [...] Read more.

This paper is devoted to solving the problem of determining the age of the author of the text based on models of deep neural networks. The article presents an analysis of methods for determining the age of the author of a text and approaches to determining the age of a user by a photo. This could be a solution to the problem of inaccurate data for training by filtering out incorrect user-specified age data. A detailed description of the author’s technique based on deep neural network models and the interpretation of the results is also presented. The study found that the proposed technique achieved 82% accuracy in determining the age of the author from Russian-language text, which makes it competitive in comparison with approaches for other languages. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

17 pages, 860 KiB

Open AccessArticle

Document Summarization Based on Coverage with Noise Injection and Word Association

by Heechan Kim and Soowon Lee

Information 2020, 11(11), 536; https://doi.org/10.3390/info11110536 - 19 Nov 2020

Cited by 1 | Viewed by 1916

Abstract

Automatic document summarization is a field of natural language processing that is rapidly improving with the development of end-to-end deep learning models. In this paper, we propose a novel summarization model that consists of three methods. The first is a coverage method based [...] Read more.

Automatic document summarization is a field of natural language processing that is rapidly improving with the development of end-to-end deep learning models. In this paper, we propose a novel summarization model that consists of three methods. The first is a coverage method based on noise injection that makes the attention mechanism select only important words by defining previous context information as noise. This alleviates the problem that the summarization model generates the same word sequence repeatedly. The second is a word association method to update the information of each word by comparing the information of the current step with the information of all previous decoding steps. According to following words, this catches a change in the meaning of the word that has been already decoded. The third is a method using a suppression loss function that explicitly minimizes the probabilities of non-answer words. The proposed summarization model showed good performance on some recall-oriented understudy for gisting evaluation (ROUGE) metrics compared to the state-of-the-art models in the CNN/Daily Mail summarization task, and the results were achieved with very few learning steps compared to the state-of-the-art models. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

16 pages, 1058 KiB

Open AccessArticle

Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling

by Mubashar Mustafa, Feng Zeng, Hussain Ghulam and Hafiz Muhammad Arslan

Information 2020, 11(11), 518; https://doi.org/10.3390/info11110518 - 5 Nov 2020

Cited by 9 | Viewed by 5637

Abstract

Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their [...] Read more.

Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

19 pages, 4282 KiB

Open AccessArticle

Automated Seeded Latent Dirichlet Allocation for Social Media Based Event Detection and Mapping

by Cornelia Ferner, Clemens Havas, Elisabeth Birnbacher, Stefan Wegenkittl and Bernd Resch

Information 2020, 11(8), 376; https://doi.org/10.3390/info11080376 - 25 Jul 2020

Cited by 15 | Viewed by 5518

Abstract

In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, [...] Read more.

In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, the results of topic models are difficult to interpret and require manual identification of one or more “disaster topics”. Immediate disaster response would benefit from a fully automated process for interpreting the modeled topics and extracting disaster relevant information. Initializing the topic model with a set of seed words already allows to directly identify the corresponding disaster topic. In order to enable an automated end-to-end process, we automatically generate seed words using older Tweets from the same geographic area. The results of two past events (Napa Valley earthquake 2014 and hurricane Harvey 2017) show that the geospatial distribution of Tweets identified as disaster related conforms with the officially released disaster footprints. The suggested approach is applicable when there is a single topic of interest and comparative data available. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Natural Language Processing for Social Media

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Published Papers (11 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI