MDPI - Publisher of Open Access Journals

23 pages, 888 KB

Open AccessArticle

Explainable Deep Learning Model for ChatGPT-Rephrased Fake Review Detection Using DistilBERT

by Rania A. AlQadi, Shereen A. Taie, Amira M. Idrees and Esraa Elhariri

Big Data Cogn. Comput. 2025, 9(8), 205; https://doi.org/10.3390/bdcc9080205 - 11 Aug 2025

Viewed by 1031

Customers heavily depend on reviews for product information. Fake reviews may influence the perception of product quality, making online reviews less effective. ChatGPT’s (GPT-3.5 and GPT-4) ability to generate human-like reviews and responses to inquiries across several disciplines has increased recently. This leads [...] Read more.

Customers heavily depend on reviews for product information. Fake reviews may influence the perception of product quality, making online reviews less effective. ChatGPT’s (GPT-3.5 and GPT-4) ability to generate human-like reviews and responses to inquiries across several disciplines has increased recently. This leads to an increase in the number of reviewers and applications using ChatGPT to create fake reviews. Consequently, the detection of fake reviews generated or rephrased by ChatGPT has become essential. This paper proposes a new approach that distinguishes ChatGPT-rephrased reviews, considered fake, from real ones, utilizing a balanced dataset to analyze the sentiment and linguistic patterns that characterize both reviews. The proposed model further leverages Explainable Artificial Intelligence (XAI) techniques, including Local Interpretable Model-agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP) for deeper insights into the model’s predictions and the classification logic. The proposed model performs a pre-processing phase that includes part-of-speech (POS) tagging, word lemmatization, tokenization, and then fine-tuned Transformer-based Machine Learning (ML) model DistilBERT for predictions. The obtained experimental results indicate that the proposed fine-tuned DistilBERT, utilizing the constructed balanced dataset along with a pre-processing phase, outperforms other state-of-the-art methods for detecting ChatGPT-rephrased reviews, achieving an accuracy of 97.25% and F1-score of 97.56%. The use of LIME and SHAP techniques not only enhanced the model’s interpretability, but also offered valuable insights into the key factors that affect the differentiation of genuine reviews from ChatGPT-rephrased ones. According to XAI, ChatGPT’s writing style is polite, uses grammatical structure, lacks specific descriptions and information in reviews, uses fancy words, is impersonal, and has deficiencies in emotional expression. These findings emphasize the effectiveness and reliability of the proposed approach. Full article

(This article belongs to the Special Issue Natural Language Processing Applications in Big Data)

► Show Figures

Figure 1

23 pages, 978 KB

Open AccessArticle

Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons

by Ron Keinan, Efraim Margalit and Dan Bouhnik

Electronics 2025, 14(15), 3067; https://doi.org/10.3390/electronics14153067 - 31 Jul 2025

Viewed by 537

Abstract

This paper explores emotional analysis in Hebrew texts, focusing on improving machine learning techniques for depression detection by integrating psychological feature lexicons. Hebrew’s complex morphology makes emotional analysis challenging, and this study seeks to address that by combining traditional machine learning methods with [...] Read more.

This paper explores emotional analysis in Hebrew texts, focusing on improving machine learning techniques for depression detection by integrating psychological feature lexicons. Hebrew’s complex morphology makes emotional analysis challenging, and this study seeks to address that by combining traditional machine learning methods with sentiment lexicons. The dataset consists of over 350,000 posts from 25,000 users on the health-focused social network “Camoni” from 2010 to 2021. Various machine learning models—SVM, Random Forest, Logistic Regression, and Multi-Layer Perceptron—were used, alongside ensemble techniques like Bagging, Boosting, and Stacking. TF-IDF was applied for feature selection, with word and character n-grams, and pre-processing steps like punctuation removal, stop word elimination, and lemmatization were performed to handle Hebrew’s linguistic complexity. The models were enriched with sentiment lexicons curated by professional psychologists. The study demonstrates that integrating sentiment lexicons significantly improves classification accuracy. Specific lexicons—such as those for negative and positive emojis, hostile words, anxiety words, and no-trust words—were particularly effective in enhancing model performance. Our best model classified depression with an accuracy of 84.1%. These findings offer insights into depression detection, suggesting that practitioners in mental health and social work can improve their machine learning models for detecting depression in online discourse by incorporating emotion-based lexicons. The societal impact of this work lies in its potential to improve the detection of depression in online Hebrew discourse, offering more accurate and efficient methods for mental health interventions in online communities. Full article

(This article belongs to the Special Issue Techniques and Applications of Multimodal Data Fusion)

► Show Figures

Figure 1

22 pages, 3576 KB

Open AccessArticle

A Deep Learning Approach to Unveil Types of Mental Illness by Analyzing Social Media Posts

by Rajashree Dash, Spandan Udgata, Rupesh K. Mohapatra, Vishanka Dash and Ashrita Das

Math. Comput. Appl. 2025, 30(3), 49; https://doi.org/10.3390/mca30030049 - 3 May 2025

Viewed by 1459

Abstract

Mental illness has emerged as a widespread global health concern, often unnoticed and unspoken. In this era of digitization, social media has provided a prominent space for people to express their feelings and find solutions faster. Thus, this area of study with a [...] Read more.

Mental illness has emerged as a widespread global health concern, often unnoticed and unspoken. In this era of digitization, social media has provided a prominent space for people to express their feelings and find solutions faster. Thus, this area of study with a sheer amount of information, which refers to users’ behavioral attributes combined with the power of machine learning (ML), can be explored to make the entire diagnosis process smooth. In this study, an efficient ML model using Long Short-Term Memory (LSTM) is developed to determine the kind of mental illness a user may have using a random text made by the user on their social media. This study is based on natural language processing, where the prerequisites involve data collection from different social media sites and then pre-processing the collected data as per the requirements through stemming, lemmatization, stop word removal, etc. After examining the linguistic patterns of different social media posts, a reduced feature space is generated using appropriate feature engineering, which is further fed as input to the LSTM model to identify a type of mental illness. The performance of the proposed model is also compared with three other ML models, which includes using the full feature space and the reduced one. The optimal resulting model is selected by training and testing all of the models on the publicly available Reddit Mental Health Dataset. Overall, utilizing deep learning (DL) for mental health analysis can offer a promising avenue toward improved interventions, outcomes, and a better understanding of mental health issues at both the individual and population levels, aiding in decision-making processes. Full article

(This article belongs to the Section Engineering)

► Show Figures

Figure 1

24 pages, 1078 KB

Open AccessArticle

ICT Adoption in Education: Unveiling Emergency Remote Teaching Challenges for Students with Functional Diversity Through Topic Identification in Modern Greek Data

by Katia Lida Kermanidis, Spyridon Tzimiris, Stefanos Nikiforos, Maria Nefeli Nikiforos and Despoina Mouratidis

Appl. Sci. 2025, 15(9), 4667; https://doi.org/10.3390/app15094667 - 23 Apr 2025

Cited by 1 | Viewed by 757

Abstract

This study explores topic identification using text analysis techniques in Modern Greek interviews with parents of students with functional diversity during Emergency Remote Teaching. The analysis focused on identifying key educational themes and addressing challenges in processing Greek educational data. Machine learning models, [...] Read more.

This study explores topic identification using text analysis techniques in Modern Greek interviews with parents of students with functional diversity during Emergency Remote Teaching. The analysis focused on identifying key educational themes and addressing challenges in processing Greek educational data. Machine learning models, combined with Natural Language Processing techniques, were applied for topic identification, utilizing cross-validation and data balancing methods to enhance reliability. The findings revealed the impact of linguistic complexity on topic modeling and highlighted the educational implications of analyzing qualitative data in this context. Among the models tested, the Naïve Bayes (Kernel) algorithm performed best when combined with lemmatization-based preprocessing, confirming that text normalization significantly enhances classification accuracy in Greek educational data. The proposed framework contributes to the analysis of qualitative educational data by identifying key parental concerns related to Emergency Remote Teaching. It demonstrates how text analysis techniques could support data-driven decision-making and help guide policy development for the inclusive and effective integration of Information and Communication Technology in education. Full article

(This article belongs to the Special Issue ICT in Education, 2nd Edition)

► Show Figures

Figure 1

23 pages, 410 KB

Open AccessArticle

Towards AI-Generated Essay Classification Using Numerical Text Representation

by Natalia Krawczyk, Barbara Probierz and Jan Kozak

Appl. Sci. 2024, 14(21), 9795; https://doi.org/10.3390/app14219795 - 26 Oct 2024

Cited by 2 | Viewed by 1979

Abstract

The detection of essays written by AI compared to those authored by students is increasingly becoming a significant issue in educational settings. This research examines various numerical text representation techniques to improve the classification of these essays. Utilizing a diverse dataset, we undertook [...] Read more.

The detection of essays written by AI compared to those authored by students is increasingly becoming a significant issue in educational settings. This research examines various numerical text representation techniques to improve the classification of these essays. Utilizing a diverse dataset, we undertook several preprocessing steps, including data cleaning, tokenization, and lemmatization. Our system analyzes different text representation methods such as Bag of Words, TF-IDF, and fastText embeddings in conjunction with multiple classifiers. Our experiments showed that TF-IDF weights paired with logistic regression reached the highest accuracy of 99.82%. Methods like Bag of Words, TF-IDF, and fastText embeddings achieved accuracies exceeding 96.50% across all tested classifiers. Sentence embeddings, including MiniLM and distilBERT, yielded accuracies from 93.78% to 96.63%, indicating room for further refinement. Conversely, pre-trained fastText embeddings showed reduced performance, with a lowest accuracy of 89.88% in logistic regression. Remarkably, the XGBoost classifier delivered the highest minimum accuracy of 96.24%. Specificity and precision were above 99% for most methods, showcasing high capability in differentiating between student-created and AI-generated texts. This study underscores the vital role of choosing dataset-specific text representations to boost classification accuracy. Full article

► Show Figures

Figure 1

24 pages, 8284 KB

Open AccessArticle

Hybrid Natural Language Processing Model for Sentiment Analysis during Natural Crisis

by Marko Horvat, Gordan Gledec and Fran Leontić

Electronics 2024, 13(10), 1991; https://doi.org/10.3390/electronics13101991 - 20 May 2024

Cited by 8 | Viewed by 3611

Abstract

This paper introduces a novel natural language processing (NLP) model as an original approach to sentiment analysis, with a focus on understanding emotional responses during major disasters or conflicts. The model was created specifically for Croatian and is based on unigrams, but it [...] Read more.

This paper introduces a novel natural language processing (NLP) model as an original approach to sentiment analysis, with a focus on understanding emotional responses during major disasters or conflicts. The model was created specifically for Croatian and is based on unigrams, but it can be used with any language that supports the n-gram model and expanded to multiple word sequences. The presented model generates a sentiment score aligned with discrete and dimensional emotion models, reliability metrics, and individual word scores using affective datasets Extended ANEW and NRC WordEmotion Association Lexicon. The sentiment analysis model incorporates different methodologies, including lexicon-based, machine learning, and hybrid approaches. The process of preprocessing includes translation, lemmatization, and data refinement, utilized automated translation services as well as the CLARIN Knowledge Centre for South Slavic languages (CLASSLA) library, with a particular emphasis on diacritical mark correction and tokenization. The presented model was experimentally evaluated on three simultaneous major natural crises that recently affected Croatia. The study’s findings reveal a significant shift in emotional dimensions during the COVID-19 pandemic, particularly a decrease in valence, arousal, and dominance, which corresponded with the two-month recovery period. Furthermore, the 2020 Croatian earthquakes elicited a wide range of negative discrete emotions, including anger, fear, and sadness, with the recuperation period much longer than in the case of COVID-19. This study represents an advancement in sentiment analysis, particularly in linguistically specific contexts, and provides insights into the emotional landscape shaped by major societal events. Full article

(This article belongs to the Special Issue Emerging Theory and Applications in Natural Language Processing)

► Show Figures

Figure 1

17 pages, 6019 KB

Open AccessArticle

Digital Guardianship: Innovative Strategies in Preserving Armenian’s Epigraphic Legacy

by Hamest Tamrazyan and Gayane Hovhannisyan

Heritage 2024, 7(5), 2296-2312; https://doi.org/10.3390/heritage7050109 - 30 Apr 2024

Cited by 3 | Viewed by 2119

Abstract

In the face of geopolitical threats in Artsakh, the preservation of Armenia’s epigraphic heritage has become a mission of both historical and cultural urgency. This project delves deep into Armenian inscriptions, employing advanced digital tools and strategies like the Oxygen text editor and [...] Read more.

In the face of geopolitical threats in Artsakh, the preservation of Armenia’s epigraphic heritage has become a mission of both historical and cultural urgency. This project delves deep into Armenian inscriptions, employing advanced digital tools and strategies like the Oxygen text editor and EpiDoc guidelines to efficiently catalogue, analyze, and present these historical treasures. Amidst the adversities posed by Azerbaijan’s stance towards Armenian heritage in Artsakh, the digital documentation and preservation of these inscriptions have become a beacon of cultural resilience. The XML-based database ensures consistent data, promoting scholarly research and broadening accessibility. Integrating the Grabar Armenian dictionary addressed linguistic challenges, enhancing data accuracy. This initiative goes beyond merely preserving stone and text; it is a testament to the stories, hopes, and enduring spirit of the Armenian people in the face of external threats. Through a harmonious blend of technology and traditional knowledge, the project stands as a vanguard in the fight to ensure that Armenia’s rich epigraphic legacy, and the narratives they enshrine remain undiminished for future generations. Full article

(This article belongs to the Special Issue Cultural Heritage at Risk - Perspectives on Technologies, Materials, Modelling and Digitalization)

► Show Figures

Figure 1

28 pages, 529 KB

Open AccessArticle

A Novel Approach to Semic Analysis: Extraction of Atoms of Meaning to Study Polysemy and Polyreferentiality

by Vanessa Bonato, Giorgio Maria Di Nunzio and Federica Vezzani

Languages 2024, 9(4), 121; https://doi.org/10.3390/languages9040121 - 27 Mar 2024

Cited by 3 | Viewed by 1932

Abstract

Semic analysis is a linguistic technique aimed at methodically factorizing the meaning of terms into a collection of minimum non-decomposable atoms of meaning. In this study, we propose a methodology targeted at enhancing the systematicity of semic analysis of medical terminology in order [...] Read more.

Semic analysis is a linguistic technique aimed at methodically factorizing the meaning of terms into a collection of minimum non-decomposable atoms of meaning. In this study, we propose a methodology targeted at enhancing the systematicity of semic analysis of medical terminology in order to increase the quality of the creation of the set of atoms of meaning and improve the identification of concepts, as well as enhance specialized domain studies. Our approach is based on: (1) a semi-automatic domain-specific corpus-based extraction of semes, (2) the application of the property of termhood to address the diaphasic and the diastratic variations of language, (3) the automatic lemmatization of semes, and (4) seme weighting to establish the order of semes in the sememe. The paper explores the distinction between denotative and connotative semes, offering insights into polysemy and polyreferentiality in medical terminology. Full article

(This article belongs to the Special Issue Semantics and Meaning Representation)

9 pages, 1925 KB

Open AccessProceeding Paper

A New Approach for Carrying Out Sentiment Analysis of Social Media Comments Using Natural Language Processing

by Mritunjay Ranjan, Sanjay Tiwari, Arif Md Sattar and Nisha S. Tatkar

Eng. Proc. 2023, 59(1), 181; https://doi.org/10.3390/engproc2023059181 - 17 Jan 2024

Cited by 5 | Viewed by 7281

Abstract

Business and science are using sentiment analysis to extract and assess subjective information from the web, social media, and other sources using NLP, computational linguistics, text analysis, image processing, audio processing, and video processing. It models polarity, attitudes, and urgency from positive, negative, [...] Read more.

Business and science are using sentiment analysis to extract and assess subjective information from the web, social media, and other sources using NLP, computational linguistics, text analysis, image processing, audio processing, and video processing. It models polarity, attitudes, and urgency from positive, negative, or neutral inputs. Unstructured data make emotion assessment difficult. Unstructured consumer data allow businesses to market, engage, and connect with consumers on social media. Text data are instantly assessed for user sentiment. Opinion mining identifies a text’s positive, negative, or neutral opinions, attitudes, views, emotions, and sentiments. Text analytics uses machine learning to evaluate “unstructured” natural language text data. These data can help firms make money and decisions. Sentiment analysis shows how individuals feel about things, services, organizations, people, events, themes, and qualities. Reviews, forums, blogs, social media, and other articles use it. DD (data-driven) methods find complicated semantic representations of texts without feature engineering. Data-driven sentiment analysis is three-tiered: document-level sentiment analysis determines polarity and sentiment, aspect-based sentiment analysis assesses document segments for emotion and polarity, and data-driven (DD) sentiment analysis recognizes word polarity and writes positive and negative neutral sentiments. Our innovative method captures sentiments from text comments. The syntactic layer encompasses various processes such as sentence-level normalisation, identification of ambiguities at paragraph boundaries, part-of-speech (POS) tagging, text chunking, and lemmatization. Pragmatics include personality recognition, sarcasm detection, metaphor comprehension, aspect extraction, and polarity detection; semantics include word sense disambiguation, concept extraction, named entity recognition, anaphora resolution, and subjectivity detection. Full article

(This article belongs to the Proceedings of Eng. Proc., 2023, RAiSE-2023)

► Show Figures

Figure 1

14 pages, 3126 KB

Open AccessArticle

ConLBS: An Attack Investigation Approach Using Contrastive Learning with Behavior Sequence

by Jiawei Li, Ru Zhang and Jianyi Liu

Sensors 2023, 23(24), 9881; https://doi.org/10.3390/s23249881 - 17 Dec 2023

Cited by 2 | Viewed by 1648

Abstract

Attack investigation is an important research field in forensics analysis. Many existing supervised attack investigation methods rely on well-labeled data for effective training. While the unsupervised approach based on BERT can mitigate the issues, the high degree of similarity between certain real-world attacks [...] Read more.

Attack investigation is an important research field in forensics analysis. Many existing supervised attack investigation methods rely on well-labeled data for effective training. While the unsupervised approach based on BERT can mitigate the issues, the high degree of similarity between certain real-world attacks and normal behaviors makes it challenging to accurately identify disguised attacks. This paper proposes ConLBS, an attack investigation approach that combines the contrastive learning framework and multi-layer transformer network to realize the classification of behavior sequences. Specifically, ConLBS constructs behavior sequences describing behavior patterns from audit logs, and a novel lemmatization strategy is proposed to map the semantics to the attack pattern layer. Four different augmentation strategies are explored to enhance the differentiation between attack and normal behavior sequences. Moreover, ConLBS can perform unsupervised representation learning on unlabeled sequences, and can be trained either supervised or unsupervised depending on the availability of labeled data. The performance of ConLBS is evaluated in two public datasets. The results show that ConLBS can effectively identify attack behavior sequences in the cases of unlabeled data or less labeled data to realize attack investigation, and can achieve superior effectiveness compared to existing methods and models. Full article

(This article belongs to the Section Sensor Networks)

► Show Figures

Figure 1

19 pages, 886 KB

Open AccessArticle

Sentiment Analysis of Arabic Course Reviews of a Saudi University Using Support Vector Machine

by Ali Louati, Hassen Louati, Elham Kariri, Fahd Alaskar and Abdulaziz Alotaibi

Appl. Sci. 2023, 13(23), 12539; https://doi.org/10.3390/app132312539 - 21 Nov 2023

Cited by 12 | Viewed by 2731

Abstract

This study presents the development of a sentimental analysis system for high education students using Arabic text. There is a gap in the literature concerning understanding the perceptions and opinions of students in Saudi Arabia Universities regarding their education beyond COVID-19. The proposed [...] Read more.

This study presents the development of a sentimental analysis system for high education students using Arabic text. There is a gap in the literature concerning understanding the perceptions and opinions of students in Saudi Arabia Universities regarding their education beyond COVID-19. The proposed SVM Sentimental Analysis for Arabic Students’ Course Reviews (SVM-SAA-SCR) algorithm is a general framework that involves collecting student reviews, preprocessing them, and using a machine learning model to classify them as positive, negative, or neutral. The suggested technique for preprocessing and classifying reviews includes steps such as collecting data, removing irrelevant information, tokenizing, removing stop words, stemming or lemmatization, and using pre-trained sentiment analysis models. The classifier is trained using the SVM algorithm and performance is evaluated using metrics such as accuracy, precision, and recall. Fine-tuning is done by adjusting parameters such as kernel type and regularization strength to optimize performance. A real dataset provided by the deanship of quality at Prince Sattam bin Abdulaziz University (PSAU) is used and contains students’ opinions on various aspects of their education. We also compared our algorithm with CAMeLBERT, a state-of-the-art Dialectal Arabic model. Our findings show that while the CAMeLBERT model classified 70.48% of the reviews as positive, our algorithm classified 69.62% as positive which proves the efficiency of the suggested SVM-SAA-SCR. The results of the proposed model provide valuable insights into the challenges and obstacles faced by Arab Universities post-COVID-19 and can help to improve their educational experience. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

25 pages, 7032 KB

Open AccessArticle

COVID-19 Vaccine Hesitancy: A Global Public Health and Risk Modelling Framework Using an Environmental Deep Neural Network, Sentiment Classification with Text Mining and Emotional Reactions from COVID-19 Vaccination Tweets

by Miftahul Qorib, Timothy Oladunni, Max Denis, Esther Ososanya and Paul Cotae

Int. J. Environ. Res. Public Health 2023, 20(10), 5803; https://doi.org/10.3390/ijerph20105803 - 12 May 2023

Cited by 12 | Viewed by 3389

Abstract

Popular social media platforms, such as Twitter, have become an excellent source of information with their swift information dissemination. Individuals with different backgrounds convey their opinions through social media platforms. Consequently, these platforms have become a profound instrument for collecting enormous datasets. We [...] Read more.

Popular social media platforms, such as Twitter, have become an excellent source of information with their swift information dissemination. Individuals with different backgrounds convey their opinions through social media platforms. Consequently, these platforms have become a profound instrument for collecting enormous datasets. We believe that compiling, organizing, exploring, and analyzing data from social media platforms, such as Twitter, can offer various perspectives to public health organizations and decision makers in identifying factors that contribute to vaccine hesitancy. In this study, public tweets were downloaded daily from Tweeter using the Tweeter API. Before performing computation, the tweets were preprocessed and labeled. Vocabulary normalization was based on stemming and lemmatization. The NRCLexicon technique was deployed to convert the tweets into ten classes: positive sentiment, negative sentiment, and eight basic emotions (joy, trust, fear, surprise, anticipation, anger, disgust, and sadness). t-test was used to check the statistical significance of the relationships among the basic emotions. Our analysis shows that the p-values of joy–sadness, trust–disgust, fear–anger, surprise–anticipation, and negative–positive relations are close to zero. Finally, neural network architectures, including 1DCNN, LSTM, Multiple-Layer Perceptron, and BERT, were trained and tested in a COVID-19 multi-classification of sentiments and emotions (positive, negative, joy, sadness, trust, disgust, fear, anger, surprise, and anticipation). Our experiment attained an accuracy of 88.6% for 1DCNN at 1744 s, 89.93% accuracy for LSTM at 27,597 s, while MLP achieved an accuracy of 84.78% at 203 s. The study results show that the BERT model performed the best, with an accuracy of 96.71% at 8429 s. Full article

(This article belongs to the Section Public Health Statistics and Risk Assessment)

► Show Figures

Figure 1

14 pages, 2538 KB

Open AccessArticle

ERF-XGB: Ensemble Random Forest-Based XG Boost for Accurate Prediction and Classification of E-Commerce Product Review

by Daniyal M. Alghazzawi, Anser Ghazal Ali Alquraishee, Sahar K. Badri and Syed Hamid Hasan

Sustainability 2023, 15(9), 7076; https://doi.org/10.3390/su15097076 - 23 Apr 2023

Cited by 21 | Viewed by 4448

Abstract

Recently, the concept of e-commerce product review evaluation has become a research topic of significant interest in sentiment analysis. The sentiment polarity estimation of product reviews is a great way to obtain a buyer’s opinion on products. It offers significant advantages for online [...] Read more.

Recently, the concept of e-commerce product review evaluation has become a research topic of significant interest in sentiment analysis. The sentiment polarity estimation of product reviews is a great way to obtain a buyer’s opinion on products. It offers significant advantages for online shopping customers to evaluate the service and product qualities of the purchased products. However, the issues related to polysemy, disambiguation, and word dimension mapping create prediction problems in analyzing online reviews. In order to address such issues and enhance the sentiment polarity classification, this paper proposes a new sentiment analysis model, the Ensemble Random Forest-based XG boost (ERF-XGB) approach, for the accurate binary classification of online e-commerce product review sentiments. Two different Internet Movie Database (IMDB) datasets and the Chinese Emotional Corpus (ChnSentiCorp) dataset are used for estimating online reviews. First, the datasets are preprocessed through tokenization, lemmatization, and stemming operations. The Harris hawk optimization (HHO) algorithm selects two datasets’ corresponding features. Finally, the sentiments from online reviews are classified into positive and negative categories regarding the proposed ERF-XGB approach. Hyperparameter tuning is used to find the optimal parameter values that improve the performance of the proposed ERF-XGB algorithm. The performance of the proposed ERF-XGB approach is analyzed using evaluation indicators, namely accuracy, recall, precision, and F1-score, for different existing approaches. Compared with the existing method, the proposed ERF-XGB approach effectively predicts sentiments of online product reviews with an accuracy rate of about 98.7% for the ChnSentiCorp dataset and 98.2% for the IMDB dataset. Full article

► Show Figures

Figure 1

13 pages, 415 KB

Open AccessArticle

Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach

by Saima Shaukat, Muhammad Asad and Asmara Akram

Appl. Sci. 2023, 13(8), 5103; https://doi.org/10.3390/app13085103 - 19 Apr 2023

Cited by 3 | Viewed by 3346

Abstract

Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. [...] Read more.

Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. Previous studies in the literature have focused on developing lemmatizers using rule-based approaches for English and other highly-resourced languages. However, there have been no thorough efforts for the development of a lemmatizer for most South Asian languages, specifically Urdu. Urdu is a morphologically rich language with many inflectional and derivational forms. This makes the development of an efficient Urdu lemmatizer a challenging task. A standardized lemmatizer would contribute towards establishing much-needed methodological resources for this low-resourced language, which are required to boost the performance of many Urdu NLP applications. This paper presents a lemmatization system for the Urdu language, based on a novel dictionary lookup approach. The contributions made through this research are the following: (1) the development of a large benchmark corpus for the Urdu language, (2) the exploration of the relationship between parts of speech tags and the lemmatizer, and (3) the development of standard approaches for an Urdu lemmatizer. Furthermore, we experimented with the impact of Part of Speech (PoS) on our proposed dictionary lookup approach. The empirical results showed that we achieved the best accuracy score of 76.44% through the proposed dictionary lookup approach. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)

26 pages, 2872 KB

Open AccessArticle

Grapharizer: A Graph-Based Technique for Extractive Multi-Document Summarization

by Zakia Jalil, Muhammad Nasir, Moutaz Alazab, Jamal Nasir, Tehmina Amjad and Abdullah Alqammaz

Electronics 2023, 12(8), 1895; https://doi.org/10.3390/electronics12081895 - 17 Apr 2023

Cited by 13 | Viewed by 3296

Abstract

In the age of big data, there is increasing growth of data on the Internet. It becomes frustrating for users to locate the desired data. Therefore, text summarization emerges as a solution to this problem. It summarizes and presents the users with the [...] Read more.

In the age of big data, there is increasing growth of data on the Internet. It becomes frustrating for users to locate the desired data. Therefore, text summarization emerges as a solution to this problem. It summarizes and presents the users with the gist of the provided documents. However, summarizer systems face challenges, such as poor grammaticality, missing important information, and redundancy, particularly in multi-document summarization. This study involves the development of a graph-based extractive generic MDS technique, named Grapharizer (GRAPH-based summARIZER), focusing on resolving these challenges. Grapharizer addresses the grammaticality problems of the summary using lemmatization during pre-processing. Furthermore, synonym mapping, multi-word expression mapping, and anaphora and cataphora resolution, contribute positively to improving the grammaticality of the generated summary. Challenges, such as redundancy and proper coverage of all topics, are dealt with to achieve informativity and representativeness. Grapharizer is a novel approach which can also be used in combination with different machine learning models. The system was tested on DUC 2004 and Recent News Article datasets against various state-of-the-art techniques. Use of Grapharizer with machine learning increased accuracy by up to 23.05% compared with different baseline techniques on ROUGE scores. Expert evaluation of the proposed system indicated the accuracy to be more than 55%. Full article

(This article belongs to the Special Issue Big Data Analytics and Artificial Intelligence in Electronics)

► Show Figures

Graphical abstract

Search Results (35)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (35)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI