Next Article in Journal
E-Commerce in Brazil: An In-Depth Analysis of Digital Growth and Strategic Approaches for Online Retail
Previous Article in Journal
Optimizing Sentiment Analysis Models for Customer Support: Methodology and Case Study in the Portuguese Retail Sector
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Leveraging Stacking Framework for Fake Review Detection in the Hospitality Sector

by
Syed Abdullah Ashraf
1,2,*,
Aariz Faizan Javed
1,
Sreevatsa Bellary
1,
Pradip Kumar Bala
1 and
Prabin Kumar Panigrahi
3
1
Department of Information Systems & Business Analytics, Indian Institute of Management Ranchi, Prabandhan Nagar, Nayasarai Road, Ranchi 835303, Jharkhand, India
2
Department of Analytics & Operations, Delhi School of Business, Outer Ring Rd, AU Block, Jal Board Colony, Pitampura, New Delhi 110034, Delhi, India
3
Department of Information Systems, Indian Institute of Management Indore, Prabandh Shikhar, Rau-Pithampur Road, Indore 453556, Madhya Pradesh, India
*
Author to whom correspondence should be addressed.
J. Theor. Appl. Electron. Commer. Res. 2024, 19(2), 1517-1558; https://doi.org/10.3390/jtaer19020075
Submission received: 14 September 2023 / Revised: 11 November 2023 / Accepted: 19 November 2023 / Published: 15 June 2024

Abstract

:
Driven by motives of profit and competition, fake reviews are increasingly used to manipulate product ratings. This trend has caught the attention of academic researchers and international regulatory bodies. Current methods for spotting fake reviews suffer from scalability and interpretability issues. This study focuses on identifying suspected fake reviews in the hospitality sector using a review aggregator platform. By combining features and leveraging various classifiers through a stacking architecture, we improve training outcomes. User-centric traits emerge as crucial in spotting fake reviews. Incorporating SHAP (Shapley Additive Explanations) enhances model interpretability. Our model consistently outperforms existing methods across diverse dataset sizes, proving its adaptable, explainable, and scalable nature. These findings hold implications for review platforms, decision-makers, and users, promoting transparency and reliability in reviews and decisions.

1. Introduction

Effective decision-making heavily hinges on information search processes. Within the realm of e-commerce, consumer choices are significantly influenced not only by the information disseminated by companies but also by the evaluations of fellow product purchasers [1]. Gradually, these product reviews have transitioned into an integral facet of consumers’ purchasing decisions, with a staggering 90% of customers consulting online reviews prior to making purchase-related choices. Remarkably, 88% of consumers place a level of trust in online reviews akin to personal recommendations [2]. This escalating reliance on such reviews, however, unravels a concern of manipulation in the decision-making process through the injection of fabricated reviews [3].
Managers have realized the potential of reviews on consumer engagement intention, leading some of them to engage in review manipulation [4]. Fabricated reviews encompass two distinct categories: destructive and deceptive. Destructive reviews often serve as mere promotional content that bears no relation to the actual product experience. On the other hand, deceptive reviews are particularly harmful as they spread false information that can seriously harm businesses and result in significant financial loss [5]. Notably, even renowned platforms like Tripadvisor have struggled to grapple with the pervasive issue of counterfeit reviews. This is demonstrated by their multiple shifts in slogans over the years, underscoring the complexity of distinguishing genuine reviews from fraudulent ones. This issue has inspired our research into the detection of counterfeit reviews within the hospitality domain [6,7,8].
The magnitude of this predicament is starkly quantified, with the World Economic Forum estimating annual losses of an astonishing USD 152 billion due to the proliferation of counterfeit reviews, an economic loss that merits serious consideration [9]. The gravity of the problem has attracted the attention of international regulatory bodies, spanning from the European Union and the United States to Australia and India [10]. These authorities have enacted stringent measures to combat the endorsement and dissemination of counterfeit reviews. Regulations now mandate that platforms verify the authenticity of consumer reviews or face prosecution and penalties. Despite these efforts, the ground reality remains different [11,12,13].
The menace of counterfeit reviews has particularly targeted review aggregator platforms like Yelp, Tripadvisor, and more, compelling scholars’ engagement for well over a decade [14,15]. These inauthentic reviews extend beyond external sites to influential platforms like Amazon, Walmart, and Flipkart. Notably, establishments are found to be more prone to receiving fake positive reviews on external sites like Tripadvisor, Yelp, MouthShut, etc. [16]. However, in cases where establishments feature on both internal and external platforms, the ratings on external platforms tend to be lower, owing to the deliberate injection of negative counterfeit reviews by competitors [17]. The lack of a robust purchase validation mechanism renders external platforms more susceptible to the infiltration of fake reviews.
While fake reviews span various sectors, they are notably concentrated in entertainment, hospitality, and e-commerce [18]. The initial response involved manual identification, but this approach was proven to be sluggish, imprecise, and resource-intensive. This paved the way for the pioneering work of Jindal and Liu [19], who introduced the concept of automated fake review detection. Subsequently, machine learning techniques encompassing Support Vector Machines (SVM), Random Forest (RF), Naïve Bayes, and neural networks (NN) have gained significant prominence for detecting counterfeit reviews [20]. Notably, the scope of fake review detection extends beyond supervised learning methods, encompassing semi-supervised and unsupervised approaches [14].
The prevailing research landscape predominantly revolves around feature engineering and training classifiers to effectively distinguish between genuine and fraudulent reviews. Comparative analyses of classifiers, such as Naïve Bayes Classifier and Support Vector Machine (SVM), have garnered attention [18,21,22]. Paradoxically, despite the capabilities of machine and deep learning to handle large datasets, research often focuses on comparatively small datasets (as mentioned in Section 2) when reporting superior performance. This raises valid concerns about overfitting and lack of real-world adaptability in scenarios where millions of reviews are generated daily. These observations culminate in the formulation of the ensuing research questions.
RQ1. In the context of fake review detection in the hospitality sector, does combining the power of several base classifiers with a meta-classifier lead to a consistently better result?
RQ2. Will the performance of such a classifier vary from the scaling of input data size?
To answer RQ1, we have built upon the methods of [23,24] by creating an ensemble model by using several well-reported classifiers (XGBoost, Random Forest, artificial neural network, etc.) in the fake review detection domain [15,25]. In answering RQ2, we have used the Yelp dataset by [26]. The dataset consists of reviews from review aggregator site Yelp.com and corresponds to hotels and restaurants. They have created three databases with varying sizes ideal to test our classifier’s performance. The work offers some key theoretical contributions, including the proposition of a state-of-the-art framework using the staking ensemble technique for fake review detection, the demonstration of a comparable framework performance with the existing benchmark models irrespective of the size of the input, and the introduction of several new features (average rating provided to a product, subjectivity, average rating received by a product, etc.) with more distinguishing power. The study findings unequivocally demonstrate the enhancement in classification performance through the application of classifier ensembling. This observed outcome remains invariant across varying dataset magnitudes. The empirical verification encompasses diverse evaluation metrics, including, but not limited to, average precision, recall, area under the curve, F1 score, and accuracy. Through comprehensive benchmarking, it is established that our approach distinctly excels in cumulative performance, thereby substantiating its superior efficacy in the domain of counterfeit review identification.
The remainder of this paper is organized as follows: Section 2 reviews the existing literature work. Section 3 highlights the proposed model along with its features, learning techniques, and evaluation. Section 4 delves into findings, implications, and future research, while Section 5 throws light on the conclusions.

2. Literature Review

2.1. Firms and Fake Reviews

It has been established that users base their opinions on reviews of a product or a service, a fact that manufacturers and service providers are well aware of [27,28]. Research indicates that service providers are more likely to engage in fake review activities when their reputation is at stake due to a limited number of reviews or negative feedback. Moreover, restaurants experiencing heightened competition are found to be more vulnerable to receiving negative reviews [29]. A study by [16] revealed that independent or single-unit hotels and restaurants are the primary beneficiaries of review manipulation and are, therefore, more susceptible to it. Conversely, [30] found that even a small number of fake reviews (50) can be sufficient to surpass the competitors in certain markets. Refs. [31] and [1] have reported that consumers associate themselves with the review website rather than other participants, and this relationship is moderated by homophily and tie strength, which foster source credibility in the context of electronic word-of-mouth (eWOM) on review websites.
There have been cases where prominent brands were prosecuted for availing services of a third party to promote or defend them online [32]. The pervasiveness of this issue has garnered attention not only from academic circles but also from major news outlets such as the BBC and the New York Times, which reported on a photography company for defaming its competitors by posting fake negative reviews [33]. Trends have shown that fake reviews are increasing day by day across platforms, thus causing a problem for online information accuracy and potential market regulation [34].

2.2. Fake Review Detection

The prevalence of fake reviews has garnered significant attention in academic circles. A range of studies have aimed to detect fake reviews. Some of them have applied supervised learning techniques, while others have utilized alternative methods (semi-supervised learning, unsupervised learning, probabilistic models, graph-based models., and deep learning) [35]. Supervised learning is a machine learning technique where the model is trained on a labeled dataset, learning to map input data to corresponding output labels through iterative adjustments, enabling it to make predictions on new, unseen data, whereas semi-supervised learning is a type of machine learning where the model is trained on a combination of labeled and unlabeled data, leveraging both the provided answers and the patterns it discovers independently to make predictions. Unsupervised learning is a machine learning approach where the model explores patterns and structures in unlabeled data without explicit guidance, uncovering hidden relationships or grouping similar items based on inherent similarities. Ref. [19] tackled this issue by developing a rule based on similarity: reviews with more than a 90% similarity were deemed as spam. They extracted 24 review-centric features and trained the model using logistic regression. Ref. [36] have reported that deceptive reviews have greater lexical complexity, contain more frequent mentions of brands and first-person pronouns, and display a sentiment tone that is more positive towards a product or service.
Ref. [37] have approached the problem of fake review detection by proposing a probability-based framework. Ref. [38] identified suspected spammers who consistently wrote either positive or negative reviews for a specific brand using a rule-based approach. In their seminal work, Ref. [39] subdivided the fake review identification problem into three parts viz, text categorization, psycholinguistic features extraction, and genre identification. They claim that deceptive reviews are part of imaginative writing, and truthful reviews are part of informative writing.
Ref. [40] explored the problem of identifying fake reviewer groups. They postulated that fake reviewer groups are more damaging than individual fake reviewers. They further used several behavioral models to detect the fake reviewers’ group. They also concluded that although it is extremely difficult to label each review or reviewer as spam or truthful, it is relatively easy when it comes to reviewing groups. Ref. [41] found that parts of the speech feature strengthened with unigram features and context-free grammar is the most effective combination in enhancing the algorithm’s performance.
Ref. [42] reported that behavioral cues are more distinctive than linguistic features in detecting fraudulence. They further mentioned that suspected fake reviewers could be identified from the psycholinguistic cues that they leave behind. Ref. [43] proposed a novel graph-based model optimized with joint probability via belief propagation to detect fake reviews and reviewers simultaneously.
Ref. [3] proposed a latent spam model, which is a generative model for clustering. Their findings suggest that, unlike traditional falsehoods, web-based lies tend to include more first-person pronouns. In a comprehensive study, Ref. [44] revealed that fake reviews are comparatively more negative than genuine ones, lack objectivity, are less related to the purchased items, and exhibit linguistic cues of deception. They showed that even without any apparent ulterior motivation, fake reviewers indulge in such activities. Ref. [45] showed that linguistic features based on review readability, review genre, and review writing style effectively differentiate between genuine and fake reviews. They established that in terms of readability, deceptive reviews are more readable. Furthermore, to make the reviews more believable, fake reviews often use more verbs and function words. Ref. [16] have studied the establishment involved in review manipulation. They found that small businesses and individual independent hotel owners are more prone to review-related fraud.
Ref. [46] extracted 83 linguistic features from the review dataset. Unlike others, they have made use of reviews as well as review titles for feature engineering. They have subdivided those features based on understandability, level of detail, writing style, and cognition indicators. Ref. [26] introduced an unsupervised and semi-supervised model for a fake review, reviewer, and targeted product identification. The model was based on relational network architecture that can exploit prior knowledge of class distribution.
Ref. [47] provided evidence that using semantics along with emotional features can improve classification accuracy. Ref. [48] provided the first deep learning-based framework for the problem. Deep learning is a subset of machine learning that employs artificial neural networks with multiple layers (deep neural networks) to automatically learn and represent complex patterns in data, enabling the model to make sophisticated decisions or predictions. It is like teaching a computer to think by simulating the intricate connections of the human brain through layers of virtual neurons. They emphasized that product-related features, as opposed to review-related features, are more effective for this purpose. Ref. [35] have shown that rather than verbal features, non-verbal features such as membership length, helpful votes, friend count, etc., are more effective in detecting fake online reviews that are present on Yelp’s social network. They maintained that manipulating verbal features is easy, whereas copying and manipulating non-verbal features is time-consuming and challenging for the review spammers.
Ref. [49] provided a graph-based, unsupervised learning solution to the problem. The main advantage of their method is the ability to achieve commendable results even without a training dataset. Ref. [21] categorized the features extracted from the review dataset based on variance, temporal aspect, rating, textual, and burst dimensions. They found that burst-related features are more relevant in identifying fake reviews. Ref. [50] proposed a bidirectional gated recurrent neural network, which helps in capturing global semantic information that standard discrete features fail to grasp. A bidirectional gated recurrent neural network is a type of artificial neural network designed for processing sequential data capable of analyzing information both forward and backward through time. It uses special gates to selectively remember and forget past information, enhancing its ability to understand context and relationships in the data bidirectionally.
In the experiment conducted by [51], they revealed that a weak brand suffers more when it excessively adds fake positive reviews, and this raises suspicion among the users, leading to a loss of credibility [52]. They further reported that deleting negative reviews is more subtle and leaves fewer manipulation cues. Ref. [22] illustrated the use of univariate and multivariate distributions in improving classifier accuracy.
Ref. [53] developed multiple deep learning-based solutions to address the issue of variable-length review texts. They proposed two approaches: one using multi-instance learning and the other employing a hierarchical architecture, both aimed at effectively handling reviews of varying lengths. Ref. [17] developed a supplementary method called trust measure that determines the genuineness of a review based on strongly positive and negative terms. They reported that fake reviews are more prevalent on open reviewing platforms than on closed platforms.
Ref. [54] derived several micro-linguistic cues using Linguistic Inquiry Word Count (LIWC) and Coh-Metrix to study their impact on positive and negative reviews being either fake or genuine. Their findings revealed that single posts, reiterating posts, and generic feedback are useful clues in identifying spam. Ref. [55] leveraged the capability of deep learning architectures and proposed a high-dimensional model conflating n-gram, skip-gram, and emotion-based features. Refs. [28,56] have postulated that fake reviews have more social and affective cues as compared to genuine reviews.
Ref. [57] examined the temporal impact on classifier performance. They proposed that algorithms dealing with text should have the capability to periodically update their vocabulary with words used in general parlance. Ref. [58] have explored an interesting concept of inconsistency and its potential to enhance classifier performance, defining it as a disparity between review content and star ratings, differing sentiments for the same rating, or a change in a reviewer’s writing style for the same rating. Ref. [59] have stressed the presence of emotional cues in fake reviews. Refs. [60,61] have relied on feature engineering along with word embedding to identify fake and genuine reviewers.
A quick look at Table 1 solidifies our argument that the majority of the high-performing work has been reported on a relatively smaller dataset and often reported on a selective set of matrices, which can lead to a false sense of performance. Furthermore, combining the power of multiple classifiers has received less attention than warranted in academia, in contrast to machine learning contests where such architecture often features as best-performing models [62]. Additionally, training deep learning models is resource-intensive and time-consuming. To this end, we propose a staking ensemble-based classifier that is faster and easier to train and performs well, irrespective of the size of the input dataset.

3. Methodology

For this study, the Yelp datasets curated by [26] are selected. The reviews were collected over four years, between 2010 and 2014, from Yelp.com. The dataset comprises reviews given to restaurants and hotels in the US. The investigation encompasses three datasets, namely YelpChi, YelpNYC, and YelpZip. The YelpChi dataset has reviews of hotels and restaurants situated in Chicago, whereas YelpNYC comprises reviews of restaurants in New York City. YelpZip specifically gathers reviews from restaurants in New York City based on their zip codes. Yelp employs a filtering algorithm designed to identify and segregate potentially fake or suspicious reviews into a distinct filtered list. These filtered reviews are publicly accessible, with a business’ Yelp page showcasing recommended reviews, while a link at the page’s bottom allows users to peruse the filtered or unrecommended reviews. Although the Yelp anti-fraud filter is not infallible, it approximates a “near” ground truth and has demonstrated a propensity for accurate outcomes [63]. The datasets under consideration include both recommended and filtered reviews, denoted as genuine and potentially fraudulent, respectively. Table 2 presents the details of the datasets. The metadata includes features such as ‘user_id’, ‘prod_id’, ‘rating’, ‘label’, and ‘date’. These represent the encoded identifier of the user who submitted the review, the encoded identifier of the product being reviewed, the rating given by the user on a scale from 1 to 5, whether the review has been filtered out by the system, and the date on which the review was submitted. The ‘label’ feature has two values: ‘−1’ signifies that the review has been filtered, indicating that Yelp.com’s algorithm has marked it as fake or spam, while ‘1’ indicates that the review has not been filtered.
A series of preprocessing steps performed over the data for classification purposes are discussed in detail in Section 3.1. The proposed framework is shown in Figure 1. After the preprocessing was completed, feature engineering was performed, resulting in several features based on the previous literature. A total of six user-centric, six product-centric, and twenty-five review-centric features were derived. Details of these features are provided in Section 3.2.
As machine learning models do not accept textual data as they are, review text was embedded. The most popular types of embeddings are BERT and its variants DistilBERT, RoBERTa, and ALBERTA. These embeddings, along with the user/product/review-centric features, were used in the classification task. The evaluation was performed in terms of classification accuracy, AUC, F1 score, average precision, and recall.
Table 1. Review of the selected literature on fake review detection in hospitality.
Table 1. Review of the selected literature on fake review detection in hospitality.
AuthorsDataset
Used
DistributionMethodologyFeaturesDescriptionPerformance Reported
[39]Self 6400 positive, 400 negative deceptive reviews from Tripadvisor.comSupervised Machine LearningPsychologistic and n-gramParts-of-speech tagging, Linguistic Inquiry and Word Count (LIWC) 2007, uni-gram, bi-gram, tri-gramAcc 1 = 89.8%, P 2 = 89.8%, R 3 = 89.8%, F1 4= 89.8%,
[41][39], Self[39];
400 positive, 400 negative deceptive reviews from Tripadvisor.com;
400 positive, 400 negative deceptive reviews from Yelp.com
Supervised Machine LearningContext-free grammar and linguistic featuresBag-of-words, parts-of-speech, CFG-based production rulesAcc = 91.2% [Ott]
Acc = 76.6% [TA]
Acc = 64.3% [Yelp]
[42]SelfYelp Hotel: 4876 genuine, 802 (14.1%) fake reviews
Yelp Restaurant: 50,149 genuine, 8368 (14.4%) fake reviews
Supervised Machine LearningBehavioral and n-gramReview length, review deviation, content similarity, activity window, etc.Yelp Hotel: Acc = 85.1%, P = 86.9%, R = 82.2%, F1= 84.8%,
Yelp Restaurant: Acc = 86.5%, P = 84.5%, R = 87.8%, F1= 86.1%,
[64]Self3523 fake and 6242 genuine reviews from Dianping.comSupervised Machine Learning, collective classification modelsText features of reviews and behavioral features of users and IP addressesUni-gram, bi-gramF1 = 74.5%
[35][42]Yelp Restaurant: 50,149 genuine, 8368 (14.4%) fake reviewsSupervised Machine LearningVerbal and non-verbal featuresReview length, redundancy, capitalization, length of membership, average posting rate, etc.Acc = 87.81%, P = 87.12%, R = 89.63%, F1= 88.31%
[21][26]YelpZip: 512,905 genuine, 78,937 (13.3%) fake reviewsSupervised Machine LearningTextual, meta-features, reviewer-centric, temporal featuresNumber of words. Ratio of capital letters, rating, rating deviation, density, rating entropy, max rating per day, etc.YelpZip: Acc = 80.6%, P = 77.6%, R = 86.1%, F1= 81.6%,
[65][39]400 positive, 400 negative deceptive reviews from Tripadvisor.comSemi-Supervised Deep LearningFakeGAN, Fake Generative
Adversarial Network
FakeGAN uses two discriminator
models D, D’ and one generative model G.
Acc = 89.1%, P = 98%, R = 81%
[66][26]YelpChi: 67,392 genuine, 8919 (13.23%) fake reviews
YelpNYC: 359,052 genuine, 36,885 (10.27%) fake reviews
YelpZip: 512,905 genuine, 78,937 (13.3%) fake reviews
Deep LearningHFAN: Hierarchical Fusion Attention NetworkMulti-attention unit to extract user (product)-related review informationYelpChi: AP = 48.87%, AUC 5 = 83.24%
YelpNYC: AP = 53.82%, AUC = 84.78%
YelpZip: AP = 83.35%, AUC = 87.28%
[55][64,67]Hotel: 400 positive, 400 negative deceptive reviews from Tripadvisor.com
Restaurant: 200 positive genuine, 200 positive deceptive reviews from
Deep LearningDeep feed-forward neural network & Convolutional neural network is proposedModel captures the complex features hidden in high- dimensional word, sentence, and emotion representations.
Conducted by integrating word, sentence, and emotion representation
Hotel: Acc = 89.56%, AUC = 95.1%, F1 = 89.6% [DFFNN]
Restaurant: Acc = 89.80%, AUC = 96.5%, F1 = 90.1% [CNN]
[58]SelfYelp: 11,641 genuine and 12,898 fake reviewsMachine LearningContent-based and language styleNoun count, verb count, review length, subjectivity, lexical diversity, sentiment, etc.Acc = 92.1%, P = 93.1%, R = 90.9%, F1= 92.0%,
[60][26]YelpZip: 512,905 genuine, 78,937 (13.3%) fake reviewsDeep LearningDeep learning model on behavioral and sentiment featuresReview representation model based on behavioral and sentiment-dependent linguistic features like, average rating, rating deviation, early time frame, etc.AUC = 91.6%
F1 = 83.0%
where 1-Acc represents accuracy, 2-P represents precision, 3-R represents recall, 4-F1 represents F measure and 5-AUC represents area under the curve, 6-Self in Dataset Used column means that data have been curated by the author.
Table 2. Details of the Yelp datasets.
Table 2. Details of the Yelp datasets.
YelpChiYelpNYCYelpZip
Number of users38,050160,225255,903
Number of products (Hotels and Restaurants)2009235002
Number of reviews67,392359,052591,842
Number of fake reviews8919 (13.23%)36,885 (10.27%)78,937 (13.33%)
Number of spammers7737 (20.33%)28,496 (17.78%)61,210 (23.91%)

3.1. Pre-Processing

3.1.1. Data Balancing

Table 2 shows that the dataset is highly imbalanced, which would cause a model trained on it to be biased towards the majority class. To balance the dataset, the ‘imbalanced-learn’ python package by [68] is used. The ‘RandomUnderSampler’ technique brings down the majority class instances to that of the minority class by randomly removing instances from the majority class. This technique is employed for two reasons: Firstly, oversampling the dataset would have augmented the data point, making it difficult for the classifier to converge, especially when word embeddings were also used in the model. Secondly, synthetically created embeddings do not represent the actual reviews and would not have made any sense.

3.1.2. Text Pre-Processing

The standard pre-processing steps are followed: removal of stop words, removal of punctuations, removal of URLs, lowercasing the text, and lemmatization. The label of filtered reviews is changed from −1 to 0 as this is a binary classification problem, and some of the classifiers used expect classes to be labeled as 0 and 1 only. In this study, the NLTK package is used for preliminary processing. Lemmatization was performed using the spaCy package to give the best result in real-world settings [69].

3.2. Feature Engineering

Apart from the existing meta-features, a total of thirty-seven psycholinguistic features have been derived that can be classified into user-centric, product-centric, and review-centric features. Besides the features that have already been reported in the literature, we have engineered some new features intuitively and adopted some features from other fields of the literature. User-centric features are those that are concerned with user behavior. These features include the average rating provided by the user against all products, the total number of reviews written by a user, and their deviation. Product-centric features are those that describe the characteristics of the product from reviewers’ point of view. These include the number of reviews received by the product and the average rating given by the user to the product. Lastly, review-centric features are mainly concerned with the linguistic aspect of the reviews and are not limited to the presence of exclamation marks in sentences or the count of uppercase and lowercase letters. They also include various emotional variables, such as sentiment scores and variables highlighting anger, joy, and trust in the review text. The complete list of features, along with their description and categorization, is presented in Table 3.
Figure 2 shows the correlation matrix among the features of the YelpZip dataset. Most features show either a negative or negligible correlation (values below or closer to 0), suggesting that there is no multicollinearity issue with the acquired features. Figure 3 illustrates the cumulative distribution function for the engineered features from the YelpZip dataset. In this context, ‘Ham’ refers to genuine reviews and ‘Spam’ to fake reviews. Features like avg_Urating, day_Urating, Entropy, similarity, and day_entropy exhibit the greatest discriminatory power. For the correlation plot and CDF of the YelpChi and YelpNYC datasets, refer to Appendix B and Appendix C, respectively.

3.3. Text Embedding

We have considered transformer-based architecture, BERT (Bidirectional Encoder Representations from Transformers) embedding, and its popular variants, ALBERT, RoBERTa, and DistilBERT, to convert textual input into machine-understandable numerical form. The transformer technique utilizes attention mechanisms to process input data in parallel, capturing contextual information efficiently. BERT can produce meaningful embeddings because it is trained on large-scale real-world datasets. It leverages an attention mechanism that dynamically calculates the relationships between input words based on their context within a sentence. BERT comprehends context in language by considering both preceding and succeeding words, capturing intricate relationships for better understanding. ALBERT optimizes BERT’s efficiency by sharing parameters among layers, offering similar performance with fewer parameters, making it computationally more efficient. RoBERTa refines BERT by modifying training methods, removing the next sentence prediction task, and using larger mini-batches for improved natural language understanding. DistilBERT retains essential language representations with reduced complexity, enabling faster training and lower resource requirements while maintaining performance.

3.4. Fake Review Detection Model

Our approach utilizes a supervised learning framework known as the stacking-based ensemble technique. Table 1 lists various classifiers sourced from the literature, including the multi-layer perceptron classifier, Random Forest classifier (Bagging), logistic regression, k-nearest neighbor classifier (K = 3), and XGBoost classifier (Boosting). Multi-layer perceptron classifier is a neural network with multiple layers of interconnected nodes that learns complex patterns, making it effective for tasks like image recognition and classification. Random Forest classifier is an ensemble method that builds multiple decision trees during training and combines their predictions in classification tasks. Logistic regression is a statistical model used for binary classification, predicting the probability of an event occurring; it employs the logistic function to map input features into a probability range between 0 and 1. K-nearest neighbor classifier classifies data points based on the majority class among their k (=3)-nearest neighbors, making it straightforward and adaptable for various datasets. An optimized gradient boosting algorithm that combines weak learners (usually decision trees) to create a strong predictive model Boosting technique sequentially builds a series of models, with each new model focusing on correcting errors made by the previous ones, enhancing overall predictive performance.
Classifiers can be combined via two approaches: voting or stacking. As the name suggests, voting adds an extra layer that decides the final outcome based on the majority rule from the base classifiers’ predictions. Stacking, however, is a more complex approach where the base classifiers’ predictions are used as input for another classifier, creating a layered approach. Through experimentation, we chose the XGBoost classifier as the meta-classifier because it delivered the best results among the prospective classifiers. We have used the stacking method with probability over voting as it provides a better performance [77]. Furthermore, we have reported five-fold stratified cross-validation results across all the prominent metrics.

3.5. Performance Evaluation

In order to evaluate the performance of the classifier, several metrics are used. Table 4 shows a binary confusion matrix.
Then,
Accuracy is given by
Accuracy = ((TP + TN))/((TP + FN + FP + TN))
Precision is given by
Precision = TP/((TP + FP))
Recall (Sensitivity) is given by
Recall = TP/((TP + FN))
F1 score is given by
F1 = (2(Precision x Recall))/((Precision + Recall))
Specificity is given by
Specificity = TN/((TN + FN))
The area under the curve is a graph plotted between sensitivity and specificity at different threshold values. The closer this value is to 1, the better the classification.

4. Findings and Discussion

4.1. Model Evaluation

To address Research Questions 1 (RQ1) and 2 (RQ2), a series of empirical experiments were conducted, involving the systematic manipulation of feature sets and diverse embedding styles and dataset sizes. A comprehensive total of five distinct experiments were executed, delineated by variations in both embedding and feature configurations.
Specifically, Experiment 1 was designed to investigate classification performance solely utilizing engineered features. Subsequent experiments, namely Experiments 2 through 5, incorporated the fusion of features alongside word embeddings. Experiment 2 explored the amalgamation of features with BERT embeddings, while Experiments 3 and 4 sequentially incorporated ALBERTA and DistilBERT embeddings. Experiment 5 culminated with the utilization of RoBERTa embeddings. Within Experiment 1, a progressive strategy was adopted, commencing with the standalone utilization of the derived feature set, subsequently evaluating their combined effect. Detailed graphical representations of our model’s performance across varied evaluation metrics were generated for the YelpZip, YelpChi, and YelpNYC datasets (Figure 4, Figure 5 and Figure 6, respectively). Notably, the delineated feature sets encompass user-centric (FU), product-centric (FP), review-centric (FR), and the composite of all features (FA).
Figure 4 elucidates that the stacking technique consistently yielded optimal outcomes across diverse scenarios. Particularly, for the expansive YelpZip dataset, the model achieved notable performance metrics, with an accuracy of 83.89%, an average precision of 92.93%, a recall of 79.22%, an AUC of 91.46%, and an F1 score of 88.12%. Analogous trends were echoed in the context of YelpChi (Figure 5) and YelpNYC (Figure 6) datasets.
For a comprehensive assessment of the model’s performance, readers are encouraged to review Appendix A, Appendix B and Appendix C. Detailed examination of these sections indicates that feature engineering alone, as opposed to its combination with various large language models, is exceptionally effective. This underscores the idea that in the context of detecting counterfeit reviews, the use of feature engineering and the extraction of relevant features are fundamentally more appropriate than employing resource-heavy neural network-based large language models.
To explore the explainability aspect of our model, we plotted the feature importance using the full set of features. Since our model is based on stacking architecture, it is not advisable to depend upon the importance plot of individual classifiers such as XGBoost or Random Forest. Researchers propose various approaches, such as LIME [78] or DeepLIFT [79], to improve the interpretability of the model. For our study, we employed SHAP (Shapley Additive exPlanations), as recommended by [80]. SHAP is a comprehensive, model-agnostic method that amalgamates different feature importance techniques previously developed by researchers. In a nutshell, it performs sensitivity analysis based on accuracy. The SHAP values allow us to understand any prediction or classification as the sum effect of the features. Figure 7 shows the beeswarm plot of feature importance based on SHAP values.
As shown in Figure 7, user-centric features dominate the top five features in all three datasets. For YelpZip, the dataset plot suggests that an increased number of reviews written by a user in a single day can decrease the model’s accuracy by as much as 6%. This observation is consistent with the model performance, which indicates that user-centric features generally lead to improved performance of the model. Number of ratings provided by the user on a day, average rating provided by the user, and the average number of words written by the user in a review are among the top five features consistently.
The feature importance plot (Figure 7) also shows that behavioral features are more helpful in identifying fake and genuine reviews than linguistic features. The input variables are arranged in top-to-bottom order as per their mean absolute SHAP values for the first 1000 reviews in the test dataset. For Figure 7a, values for ‘day_Urating’ can be interpreted as if the number of ratings provided by a user in a day is higher, there is a lower chance of being a truthful user. Figure 8, Figure 9 and Figure 10 show the local explanation of a fake review record using the SHAP force plot in each of the datasets. The plot can be interpreted as the binary target was fake (Label = 0) or truthful (Label = 1). In the plot above, the bold 0.00 is the model’s score for this observation. Higher scores lead the model to predict 1, and lower scores lead the model to predict 0. The features that were important in making the prediction for this observation are shown in red and blue, with red representing features that pushed the model score higher and blue representing features that pushed the score lower. Features that had more of an impact on the score are located closer to the dividing boundary between red and blue, and the size of that impact is represented by the size of the bar. Again, behavioral features are significant in deciding to classify a record as fake or truthful.
Additionally, it can be noticed from Figure 4, Figure 5 and Figure 6 that the results of the stacking classifier were closest to the best-performing classifier. It was noted that user-related features contributed significantly to the classification performance, both when used alone and when combined with other features or embeddings.

4.2. Benchmarking

The proposed model is benchmarked with that of existing models and frameworks. The following popular works based on the Yelp dataset are considered:
  • SpEagle [26]: An unsupervised learning approach (Spam Eagle) capable of integrating and scaling with labeled data using metadata along with relational data. The authors were the ones who curated the datasets. The performance of their model was tested on all three datasets.
  • Ref. [21]: Proposed an effective multi-feature-based model. They suggested some new features and conducted a performance evaluation based on burst features. The authors have considered only the YelpZip dataset.
  • Ref. [22]: They have developed a novel hierarchical supervised learning approach that analyzes user features and their interactions in univariate and multivariate distributions. They have also used the YelpZip dataset for modeling purposes.
  • SPR2EP [81]: They proposed a semi-supervised framework (SPam Review REPresentation) for fake review detection, which uses the feature vectors extracted from reviews, reviewers, and products. After combining these vectors for detection purposes, they demonstrated the performance of their model on all three datasets.
  • HFAN [66]: Hierarchical Fusion Attention Network (HFAN) is a deep learning-based technique that automatically learns reviews’ semantics from the user and product information using a multi-attention unit.
  • Ref. [60]: Proposed a convolution neural network-based architecture connecting sentiment-dependent linguistic features and behavioral features via a fully connected layer to determine fake and genuine reviewers.
  • Ref. [82]: Proposed an integrated multi-view feature strategy, blending implicit and explicit features from review content, reviewer data, and product descriptions. They introduced a hybrid extraction method, combining word- and sentence-level techniques with attention. This extends to a classification framework with an ensemble classifier leveraging a convolutional neural network (CNN) for reviewer information, a deep neural network (DNN) for product-level analysis, and a Bidirectional-Long Short-Term Memory (Bi-LSTM) for review-level features. This comprehensive methodology aims to enhance analysis effectiveness across diverse dimensions.
  • Ref. [83]: Used the fine-tuned version of BERT to identify fake reviews just from review text.
As can be seen from Table 5, most of the work has been done on the YelpZip dataset—one reason for this could be the availability of a large amount of labeled data as compared to the other two datasets. Furthermore, the authors have not measured the performance of their model on other well-defined and accepted metrics such as accuracy, recall, and F1. Most of the authors have evaluated the performance of their model against the AUC metric.
As apparent from Table 5, our model has delivered better performance as compared to the other models on almost all the performance metrics across the three datasets. It is crucial to employ a model that exhibits robustness across various metrics, and our model successfully bridges this gap. The results also demonstrate the effectiveness of our approach of stacking the output of heterogeneous classifiers to reliably detect spam or fake reviews in both small and large datasets. Furthermore, our model is resource-efficient, unlike deep learning models that are complex to implement and scale, are resource-hungry, and require a heavy infrastructural investment. Our model’s performance is comparable to the advanced deep learning model proposed by [66]. These obvious advantages of our model over other models make our work more practical and applicable for real-world deployment.

5. Conclusions

This study is dedicated to addressing the pervasive challenge of detecting counterfeit reviews within the hospitality and restaurant domain. Our approach centers around the development of a stacking-based model, which ingeniously amalgamates the outputs of divergent base classifiers and harnesses a meta-classifier to discern the veracity of a given review—whether it is genuine or fabricated. The framework leverages a comprehensive suite of user-centric, product-centric, and review-centric features, collectively providing a multifaceted perspective on review authenticity.
The potency of our strategy resides in the stacking technique’s ability to consolidate the predictive capabilities of individual classifiers, culminating in an elevated overall efficiency. Evidently, the endeavor culminates in the establishment of a model proficient in discerning between fake and truthful reviews with commendable accuracy levels. Intriguingly, our model exhibits superior performance in comparison to well-established works across a spectrum of relevant performance metrics. Furthermore, the framework’s adaptability manifests in its ability to robustly operate across varying dataset sizes, rendering its utility independent of the dataset’s magnitude.
The ethical landscape pertaining to review authenticity is intricate, especially concerning the deployment of filtering algorithms designed to distinguish fraudulent or suspicious reviews. While these algorithms aim to uphold the quality and reliability of reviews, the potential for false positives—incorrectly classifying authentic reviews as fraudulent—poses a risk, potentially tarnishing the reputations of businesses or individuals. This gives rise to concerns about algorithmic bias and its implications for stakeholders. To mitigate this, we advocate for the labeling of reviews identified by the model as ‘not suggested reviews’ rather than outright deletion or labeling them fake/manipulated. Additionally, triangulating model findings with supplementary data sources, such as IP addresses, reviewer location, reviewer history, and businesses’ history, can further enhance accuracy and reduce false positives. Routine model retraining involving deliberate introduction of fake reviews contributes to increased accuracy and resilience against false detection.
Furthermore, ethical challenges extend to the transparency of the filtering process and the disclosure of filtered reviews. Users may lack full awareness of the criteria guiding review identification and filtering, hindering their ability to evaluate information reliability. Balancing platform integrity and user transparency emerges as a nuanced ethical consideration. Addressing this challenge, our model prioritizes interpretability, facilitating a clear understanding of features significantly contributing to fake review identification.
The implications of this investigation transcend the confines of the hospitality and restaurant sector, extending to domains like movie reviews and e-commerce. This affords a broader utility for our model’s insightful mechanisms. The hospitality and tourism review aggregators can embrace our model as an instrument to identify potentially inauthentic reviews, thus embarking on a path of enhanced credibility. Acknowledging the inherent uncertainty surrounding the authenticity of reviews, we advocate for a ‘not-suggested’ annotation for reviews flagged by our algorithm. This approach empowers website visitors to make more informed choices safeguarded against the ambiguity of potentially deceptive reviews.
As the reliance on reviews surges among consumers making purchasing decisions, it is imperative for stakeholders to proactively establish regulations and mechanisms to thwart the influence of counterfeit reviewers. Our recommendations extend beyond mere detection mechanisms. We propose a preemptive approach, suggesting the integration of a pop-up prompt before a review is submitted. This prompt would serve as a reminder to users, encouraging them to ensure their reviews adhere to the platform’s ethical guidelines. The wisdom advocated by [32] underscores the importance of consumer awareness and regulatory vigilance, effectively shielding prospective consumers from the deleterious impact of counterfeit reviews.

5.1. Theoretical Contributions

The proliferation of counterfeit reviews presents a dual menace, impacting both consumers and businesses. The increasing digital footprint of those who have grown up in the digital era heightens their vulnerability to misleading and deceptive information. This represents a significant concern as it has the capacity to profoundly skew decision-making processes by introducing cognitive biases. In light of this challenge, the imperative for a robust and scalable mechanism to counteract this influx of misinformation becomes strikingly evident.
Our contribution to the scholarly landscape is multifaceted. Firstly, our work extends the existing corpus of the deception detection literature through the development of an efficient system adept at identifying suspicious reviews. This accomplishment is underpinned by the formulation of a stacking-based framework adept at harnessing the capabilities of diverse underlying classifiers. The precedent established by [84], who validated the superior performance of this architecture through simulations using real-world data, highlights its effectiveness. Unlike deep learning-based architectures, these models demonstrate accelerated convergence and adaptability to varying data sizes. A notable shift from previous research emerges, affirming the pre-eminence of user-centric behavioral features over their linguistic counterparts. This underscores that the inherent characteristics of users exert a more profound influence on the classification process.
Secondly, the enhancement of model interpretability constitutes a pivotal facet of our approach. Rather than relying on conventional feature importance plots prevalent in the literature, our strategy leverages SHAP values. This choice enables a deeper level of insight, as SHAP-based importance plots not only unveil feature significance but also elucidate how variations in feature values impact the model’s outcomes. Furthermore, as corroborated by [85], the conventional feature importance plots are inherently sensitive to the chosen methodology, posing a concern over robustness.
Our endeavor encompasses the introduction of novel features, as underscored by the insights gleaned from Table 5. Importantly, the architectural framework we have devised operates as a modular entity, thereby accommodating the integration of more efficient classifiers as they emerge in the future research landscape. Such adaptability can be achieved through minimal code adjustments, rendering our model remarkably customizable at a negligible expense.
In summary, empirical tests of our model against recognized benchmarks confirm its effectiveness. It consistently surpasses standard models in various evaluation metrics, regardless of the size of the dataset. This empirical evidence highlights the strength, robustness, and dependability of our approach.

5.2. Managerial Implications

In terms of managerial implications, the pervasive issue of counterfeit reviews poses a significant threat to trust and credibility within the review ecosystem. Platforms like Yelp, reporting the filtration of nearly 25% of their reviews, underscore the gravity of the challenge [34]. Notable instances such as Oobah Butler’s experiment in 2017, where a fictitious profile became a top-rated restaurant on Tripadvisor, further accentuate the vulnerability of such platforms.
Our work provides a practical solution to the problem of counterfeit review identification. The proposed method distinguishes itself by its ease of implementation, circumventing the complexities often associated with academic solutions. Its lightweight nature, swift convergence, and ability to flag fraudulent reviews at an acceptable threshold align seamlessly with existing operational standards. Moreover, the ordered importance of features offers review aggregator platforms actionable insights for identifying cues in reviewer comments, augmenting their mechanisms to combat fraudulent activities.
Beyond its immediate application, the generalizability and scalability of our approach extend its potential to diverse domains grappling with the scourge of fake reviews. The far-reaching implications of our study empower review platforms to mitigate the incursion of counterfeit reviews. Such liberation from the influence of fraudulent reviewers engenders a greater degree of trust in the disseminated information, culminating in heightened user traffic and ultimately translating into enhanced revenue for businesses.
From the consumer perspective, our study holds the promise of furnishing them with credible and dependable information for decision-making purposes. This will help the consumer make a more informed purchase, leading to lesser post-purchase dissonance and satisfactory experience.
Our model has the potential to significantly enhance the trustworthiness of the internet, particularly in the context of user reviews. By effectively identifying and flagging suspicious reviews, it acts as a robust deterrent against the proliferation of counterfeit feedback, thus mitigating the distortion of online information. This not only empowers users to engage with more reliable content but also instills confidence in the integrity of the digital space. An additional layer of transparency allows users to discern between reviews that may lack authenticity. In turn, businesses that heavily depend on user reviews stand to benefit immensely. The model’s capability to distinguish between genuine and fraudulent reviews shields businesses from potential reputational harm caused by deceptive feedback. This safeguarding of reputations contributes to building and maintaining the credibility of businesses in the online sphere. Moreover, the adaptability of our approach across diverse dataset sizes ensures that businesses of varying scales can leverage the model to enhance their online presence.

5.3. Societal Implications

Our efforts to address counterfeit reviews wield profound societal impact by fostering a culture of authenticity and trust in the digital realm. In a broader context, our model not only benefits consumers and businesses but contributes to shaping responsible online behavior. By empowering users to make informed choices and safeguarding against misinformation, we align with societal goals of promoting consumer rights and digital literacy. The adaptability of our approach ensures that even smaller businesses integral to local economies can enjoy enhanced credibility. Ethical considerations embedded in our model, such as the ‘not-suggested’ annotation and transparency in the filtering process, reflect a commitment to responsible technology use. Our proactive approach, exemplified by the integration of a pop-up prompt, sets a precedent for ethical platform management, influencing the broader landscape. Moreover, our emphasis on interpretability contributes to increased transparency and accountability, enabling users to critically evaluate online information. In essence, our work goes beyond the technical realm, actively participating in the ongoing discourse on responsible technology development with the overarching aim of creating a digital landscape that positively impacts society at large.

5.4. Limitations

Our study encounters significant limitations originating from the choice of the foundational classifier. The configuration encompassing the count and nature of both the base classifiers and a meta-classifier wields substantial influence over the model’s performance. A subsequent limitation pertains to the temporal relevance of our database, which potentially renders it an imperfect representation of contemporary lexical usage. Notably, features such as punctuation count, lexical diversity, and lexical density are susceptible to this dynamic linguistic landscape. This phenomenon has been examined by [57], who unveiled its repercussions on classifier efficacy within a comparable context. An important limitation arises from the dataset’s balance or lack thereof. It is imperative to examine how our model will perform when confronted with an imbalanced dataset where the distribution of instances across different classes is uneven. Understanding the model’s behavior under such conditions is critical as it may impact its accuracy in real-world scenarios.

5.5. Future Work

For forthcoming endeavors, we are inclined to subject our model to assessment across alternate datasets. This proactive measure seeks to bolster the model’s adaptability and reinforce its empirical validity. Furthermore, our model awaits validation against expansive language models, such as GPT-4 and ChatGPT, which are capable of generating synthetic text dynamically. The cautionary standpoint underscored by [86] regarding the uncritical adoption of artificial intelligence without vigilant scrutiny for inherent biases is salient. To build a more resilient model, we intend to scrutinize our data for potential biases, a step aimed at enhancing its robustness in the face of inherent complexities. Yet another future direction could be to enhance the model to detect fake reviews posted by new users against whom the model lacks behavioral data.
An integral aspect of our future research trajectory involves the construction of a tangible regulatory framework firmly grounded in practical principles. This framework aspires to empower review aggregators in their efforts to combat the disruptive and malicious conduct of fake reviewers. Moreover, our model primarily integrates features derived from the textual content of reviews. Incorporating aspects of social networks, like reviewer interactions, could add a new layer of complexity to the model. This addition has the potential to enhance its performance even further.

Author Contributions

Conceptualization, S.A.A. and A.F.J.; methodology, S.A.A.; software, S.A.A.; validation, A.F.J., P.K.B., P.K.P. and S.B.; formal analysis, S.A.A.; investigation, S.A.A.; writing—original draft preparation, S.A.A.; writing—review and editing, S.A.A. and P.K.B.; visualization, S.A.A.; supervision, P.K.B. and P.K.P.; project administration, P.K.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors have not curated/collected any data. The dataset is available at https://odds.cs.stonybrook.edu/ (accessed on 2 June 2020).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Classifier performance on the YelpZip dataset.
Table A1. Classifier performance on the YelpZip dataset.
FeaturesMetricXGBoostMLP ClassifierKNNRandom ForestLogistic RegressionStacking
Experiment 1—Features Only
FAAcc0.7340.5390.6600.7350.6650.719
AP0.8330.5400.6680.8300.7280.845
Recall0.6790.3010.6260.6770.6490.769
AUC0.8240.5410.7070.8220.7220.836
F10.7080.2790.6420.7090.6520.719
Log loss−0.523−0.677−4.157−0.529−0.620−0.539
FPAcc0.6150.5120.5330.5860.5100.618
AP0.6560.5090.5270.6040.5210.660
Recall0.6590.2750.5330.5860.8250.654
AUC0.6700.5130.5430.6220.5370.673
F10.6280.2560.5330.5850.6170.627
Log loss−0.649−0.692−4.796−0.870−0.694−0.647
FPRAcc0.6790.5410.5620.6780.6020.673
AP0.7620.5590.5550.7540.6460.760
Recall0.6500.3270.5480.6310.4810.646
AUC0.7630.5570.5800.7560.6380.761
F10.6570.3080.5530.6500.5370.654
Log loss−0.589−0.679−4.707−0.594−0.665−0.594
FRAcc0.6590.6540.5710.6560.6150.664
AP0.7380.7280.5690.7270.6610.741
Recall0.5980.6010.5770.5950.4800.597
AUC0.7350.7270.5950.7270.6620.738
F10.6240.6240.5710.6240.5430.628
Log loss−0.605−0.611−4.582−0.612−0.657−0.602
FUAcc0.7540.5250.7470.7610.6920.839
AP0.8510.5260.7740.8710.7820.929
Recall0.6720.4630.8540.7540.6110.792
AUC0.8190.5250.8240.8410.7500.915
F10.7320.3580.7720.7590.6650.831
Log loss−0.501−0.684−3.976−0.575−0.590−0.359
FUPAcc0.7300.5240.6590.7640.6610.756
AP0.8270.5220.6650.8540.7210.862
Recall0.6740.2550.6280.7210.6630.673
AUC0.8190.5240.7070.8490.7180.856
F10.7040.2210.6420.7460.6550.717
Log loss−0.528−0.683−4.185−0.499−0.625−0.519
FURAcc0.7180.5190.7150.7130.6520.749
AP0.8170.5220.7360.8030.7170.853
Recall0.6650.2680.6700.6590.6510.725
AUC0.8080.5210.7720.7950.7140.846
F10.6930.2260.6940.6890.6440.729
Log loss−0.536−0.689−3.785−0.553−0.629−0.514
Experiment 2—BERT embedding along with features
BAcc0.6800.5320.6110.6950.5380.778
AP0.6590.6290.6110.6240.6140.659
Recall0.7370.6420.7070.5680.5910.810
AUC0.6230.6580.5890.5060.7130.717
F10.7320.6900.7040.5160.6980.776
Log loss−0.424−0.401−0.524−0.453−0.442−0.620
BUAcc0.5370.6170.5690.7050.5140.709
AP0.6080.5410.6470.7230.7140.788
Recall0.6770.5230.6310.5830.7430.667
AUC0.6200.6420.7200.6550.6120.802
F10.5680.7140.5970.5330.5750.652
Log loss−0.425−0.579−0.306−0.492−0.543−0.496
BPAcc0.5180.6060.5930.5120.5870.776
AP0.6140.6720.6500.6090.7300.765
Recall0.7350.6750.5610.6350.6640.769
AUC0.6600.5360.7000.7290.7400.724
F10.6870.7070.5490.5410.5740.718
Log loss−0.334−0.547−0.301−0.325−0.509−0.439
BRAcc0.5660.6060.6380.5980.5540.761
AP0.5750.5270.7370.5780.6030.857
Recall0.5770.5490.6960.5390.6720.679
AUC0.5170.7340.6670.6590.6220.847
F10.6350.6320.5710.5510.6630.733
Log loss−0.383−0.327−0.585−0.493−0.353−0.439
BUPAcc0.7050.5740.6720.5470.5750.680
AP0.6520.6110.6120.6000.6140.684
Recall0.7110.7440.5500.6920.6890.724
AUC0.6650.5770.5850.7280.5910.742
F10.6450.6670.6880.6250.7410.657
Log loss−0.495−0.458−0.472−0.510−0.441−0.352
BURAcc0.6890.7160.6140.5650.5470.650
AP0.6270.6110.6170.6200.6680.680
Recall0.5540.6220.6970.5280.6580.719
AUC0.6980.5830.7470.6040.5810.740
F10.6330.7060.5380.6120.6320.681
Log loss−0.367−0.591−0.394−0.608−0.528−0.277
BPRAcc0.6440.7190.6540.5340.6360.791
AP0.7130.5790.5240.5300.6450.790
Recall0.5240.6350.5980.6760.6810.671
AUC0.7490.7340.6440.6760.5630.774
F10.7060.6970.6620.5020.6820.716
Log loss−0.460−0.605−0.303−0.459−0.403−0.283
BAAcc0.5810.7120.5360.6400.7090.741
AP0.7440.7240.5880.5600.6740.690
Recall0.7350.6750.7410.5470.5380.762
AUC0.6590.5440.5550.7160.6020.775
F10.6010.7420.5100.7420.7330.795
Log loss−0.433−0.411−0.544−0.562−0.305−0.318
Experiment 3—ALBERT embedding along with features
AAcc0.4900.4860.4900.4890.4880.505
AP0.4890.4830.4920.4860.4860.509
Recall0.4820.4420.4930.4520.4390.448
AUC0.4850.4780.4860.4830.4810.509
F10.4850.4530.4910.4690.4580.472
Log loss−0.724−0.697−5.255−0.708−0.696−0.696
AUAcc0.7270.6450.7650.6790.6550.796
AP0.8300.6930.7920.7570.6860.882
Recall0.6800.5100.7140.6110.6020.747
AUC0.8160.7010.8230.7510.7030.875
F10.7070.5770.7470.6460.6270.779
Log loss−0.522−1.908−3.628−0.599−0.643−0.456
APAcc0.5990.5420.5330.5670.5100.585
AP0.6270.5560.5270.5790.5210.605
Recall0.6090.5270.5330.5600.8250.635
AUC0.6420.5650.5430.5990.5370.622
F10.5990.5140.5330.5620.6170.602
Log loss−0.670−0.903−4.791−0.678−0.694−0.677
ARAcc0.6440.6360.5710.6320.6170.645
AP0.7190.7050.5690.6990.6640.722
Recall0.5990.5180.5770.5450.4810.538
AUC0.7140.7010.5940.6940.6660.716
F10.6180.5720.5710.5850.5430.587
Log loss−0.626−0.630−4.585−0.632−0.655−0.620
AUPAcc0.7550.6440.6590.6950.6610.761
AP0.8570.6550.6660.7770.7210.857
Recall0.7090.5800.6280.6300.6640.679
AUC0.8470.6790.7070.7710.7180.847
F10.7370.6080.6420.6630.6550.733
Log loss−0.491−2.421−4.184−0.588−0.625−0.489
AURAcc0.7330.6530.7150.6880.6510.762
AP0.8370.7180.7360.7730.7180.861
Recall0.6880.4590.6690.6200.6550.691
AUC0.8240.7110.7720.7650.7140.851
F10.7140.5590.6940.6550.6450.734
Log loss−0.515−2.235−3.782−0.584−0.629−0.496
APRAcc0.6710.6340.5620.6550.6080.686
AP0.7490.6900.5550.7240.6510.767
Recall0.6470.5600.5470.5830.4870.694
AUC0.7490.6830.5800.7240.6460.762
F10.6540.5950.5530.6160.5430.684
Log loss−0.605−0.639−4.708−0.618−0.664−0.585
AAAcc0.7590.6490.6600.6990.6650.747
AP0.8640.7110.6680.7860.7270.845
Recall0.7140.4880.6260.6330.6470.665
AUC0.8530.7040.7070.7800.7210.833
F10.7400.5690.6420.6660.6510.720
Log loss−0.484−3.609−4.157−0.575−0.621−0.499
Experiment 4—DistilBERT embedding along with features
DAcc0.6440.6360.5710.6320.6170.645
AP0.7190.7050.5690.6990.6640.722
Recall0.5990.5180.5770.5450.4810.538
AUC0.7140.7010.5940.6940.6660.716
F10.6180.5720.5710.5850.5430.587
Log loss−0.626−0.630−4.585−0.632−0.655−0.620
DUAcc0.7330.6530.7150.6880.6510.762
AP0.8370.7180.7360.7730.7180.861
Recall0.6880.4590.6690.6200.6550.691
AUC0.8240.7110.7720.7650.7140.851
F10.7140.5590.6940.6550.6450.734
Log loss−0.515−2.235−3.782−0.584−0.629−0.496
DPAcc0.4900.4860.4900.4890.4880.505
AP0.4890.4830.4920.4860.4860.509
Recall0.4820.4420.4930.4520.4390.448
AUC0.4850.4780.4860.4830.4810.509
F10.4850.4530.4910.4690.4580.472
Log loss−0.724−0.697−5.255−0.708−0.696−0.696
DRAcc0.7550.6440.6590.6950.6610.761
AP0.8570.6550.6660.7770.7210.857
Recall0.7090.5800.6280.6300.6640.679
AUC0.8470.6790.7070.7710.7180.847
F10.7370.6080.6420.6630.6550.733
Log loss−0.491−2.421−4.184−0.588−0.625−0.489
DUPAcc0.7270.6450.7650.6790.6550.796
AP0.8300.6930.7920.7570.6860.882
Recall0.6800.5100.7140.6110.6020.747
AUC0.8160.7010.8230.7510.7030.875
F10.7070.5770.7470.6460.6270.779
Log loss−0.522−1.908−3.628−0.599−0.643−0.456
DURAcc0.5990.5420.5330.5670.5100.585
AP0.6270.5560.5270.5790.5210.605
Recall0.6090.5270.5330.5600.8250.635
AUC0.6420.5650.5430.5990.5370.622
F10.5990.5140.5330.5620.6170.602
Log loss−0.670−0.903−4.791−0.678−0.694−0.677
DPRAcc0.6710.6340.5620.6550.6080.686
AP0.7490.6900.5550.7240.6510.767
Recall0.6470.5600.5470.5830.4870.694
AUC0.7490.6830.5800.7240.6460.762
F10.6540.5950.5530.6160.5430.684
Log loss−0.605−0.639−4.708−0.618−0.664−0.585
DAAcc0.7590.6490.6600.6990.6650.747
AP0.8640.7110.6680.7860.7270.845
Recall0.7140.4880.6260.6330.6470.665
AUC0.8530.7040.7070.7800.7210.833
F10.7400.5690.6420.6660.6510.720
Log loss−0.484−3.609−4.157−0.575−0.621−0.499
Experiment 5—RoBERTa embedding along with features
RAcc0.5060.6220.5630.7110.6480.784
AP0.7140.5040.7230.6470.6940.726
Recall0.6190.5250.5520.5320.6400.745
AUC0.7000.6690.5620.5710.6950.732
F10.6260.5060.6170.6560.5060.714
Log loss−0.448−0.417−0.367−0.437−0.376−0.357
RUAcc0.7450.7050.6410.6160.6810.720
AP0.5170.6950.6660.5410.5120.738
Recall0.5760.6450.7440.6970.6800.788
AUC0.5000.6080.6910.6780.6160.717
F10.6640.5080.6070.5740.5390.692
Log loss−0.458−0.392−0.433−0.401−0.434−0.288
RPAcc0.6220.5530.5050.7350.5450.735
AP0.5940.5180.5200.7020.7360.770
Recall0.5230.5140.7140.6760.5050.712
AUC0.6520.5670.7410.6530.6030.689
F10.7030.5180.5830.7150.5670.788
Log loss−0.376−0.395−0.411−0.366−0.452−0.353
RRAcc0.6410.5110.5320.5370.7420.761
AP0.5710.7290.5160.5600.6090.857
Recall0.5210.6990.7250.7390.5450.679
AUC0.6070.7470.7140.6420.6340.847
F10.5370.6940.7400.7240.5170.733
Log loss−0.416−0.427−0.376−0.417−0.364−0.408
RUPAcc0.5200.5430.7110.5450.7130.765
AP0.7080.6160.5720.5200.5060.654
Recall0.6420.5360.7230.5600.5150.769
AUC0.6620.7340.7420.5710.6580.659
F10.5160.5260.5230.5340.6190.783
Log loss−0.409−0.410−0.439−0.443−0.425−0.415
RURAcc0.5590.7200.5250.5230.6390.704
AP0.6190.5280.7390.6540.7420.761
Recall0.5660.6700.7480.6060.6670.694
AUC0.5740.7360.7220.6320.5160.806
F10.5910.6470.7380.5630.6300.726
Log loss−0.380−0.446−0.370−0.450−0.452−0.347
RPRAcc0.6810.6130.6520.7360.6600.690
AP0.7030.7000.6620.6770.6880.737
Recall0.5480.5090.5680.6230.6550.742
AUC0.6990.6610.5590.5670.6460.690
F10.6250.5290.6790.6480.6760.754
Log loss−0.394−0.444−0.369−0.393−0.358−0.413
RAAcc0.6140.6810.6360.5060.7160.717
AP0.6320.6470.5330.5800.6710.704
Recall0.5470.5570.7230.6380.5560.798
AUC0.5710.5490.6380.6100.5950.682
F10.6670.7170.5920.5100.6470.677
Log loss−0.380−0.417−0.425−0.458−0.420−0.309
where
FAAll features
FPProduct-based features only
FPRProduct- and review-based features only
FRReview-based features only
FUUser-based features only
FUPUser- and product-based features only
FURUser- and review-based features only
B/D/A/RBERT/DistilBERT/ALBERT/RoBERTa embeddings only
B/D/A/R UEmbedding type with user-based features only
B/D/A/R PEmbedding type with product-based features only
B/D/A/R REmbedding type with review-based features only
B/D/A/R UPEmbedding type with user- and product-based features only
B/D/A/R UREmbedding type with user- and review-based features only
B/D/A/R PREmbedding type with product- and review-based features only
B/D/A/R AEmbedding type along with all features

Appendix B

Figure A1. YelpChi correlation plot.
Figure A1. YelpChi correlation plot.
Jtaer 19 00075 g0a1
Figure A2. Cumulative distribution function of YelpChi dataset features.
Figure A2. Cumulative distribution function of YelpChi dataset features.
Jtaer 19 00075 g0a2
Table A2. Classifier performance on the YelpChi dataset.
Table A2. Classifier performance on the YelpChi dataset.
FeaturesMetricXGBoostMLP
Classifier
KNNRandom
Forest
Logistic
Regression
Stacking
Experiment 1—Features Only
FUAcc0.711130.50.7651850.6911770.6545220.800676
AP0.8089270.4999970.791860.775330.6856250.880453
Recall0.6551390.543870.7144290.6782980.6015020.773096
AUC0.7993020.4999940.8231250.7688830.702670.875608
F10.6851830.567430.7466770.6819410.6270780.791612
Log loss−0.54263−0.69337−3.63036−0.66948−0.64286−0.44127
FPAcc0.597430.539690.530380.562280.535490.57013
AP0.629260.550210.524070.582180.556780.59872
Recall0.607580.409580.528310.560490.674720.59815
AUC0.639500.544720.538490.596160.560500.60374
F10.601430.462700.529410.561430.587250.58063
Log loss−0.67369−0.75793−4.86395−0.97317−0.68334−0.69055
FRAcc0.623610.586940.568340.642840.619410.61363
AP0.685490.646470.563440.698270.668640.66947
Recall0.607010.416820.576180.602750.497580.55846
AUC0.676670.634270.588700.693430.669780.66006
F10.616190.423680.571090.626590.565460.58490
Log loss−0.66959−0.72660−4.59863−0.63181−0.65315−0.66721
FUPAcc0.653830.537390.516700.635610.609650.62888
AP0.711740.562600.515350.680210.615130.68704
Recall0.646920.389590.498600.633020.585370.61430
AUC0.715130.557320.520480.692000.638310.68712
F10.650330.393990.507560.633350.599070.62252
Log loss−0.63640−1.26942−4.92633−0.66012−0.66680−0.64466
FURAcc0.644300.561550.541040.660160.619850.63791
AP0.711450.621650.535800.724400.658510.70726
Recall0.636160.468730.527970.635940.504420.61363
AUC0.705880.622140.554630.721320.662510.69799
F10.640420.466270.534480.650200.568280.62698
Log loss−0.65150−0.90026−4.95076−0.61356−0.65748−0.64313
FPRAcc0.659550.593060.562390.667510.613580.65271
AP0.726510.623740.553600.727480.654490.71574
Recall0.655220.542540.562500.629320.544780.62372
AUC0.721570.624990.580050.727240.652370.70867
F10.657140.569580.562020.653090.583510.64123
Log loss−0.63693−0.71511−4.58457−0.61016−0.65758−0.63310
FAAcc0.664310.554320.539180.670810.617670.65428
AP0.738320.596750.531510.737100.652820.72696
Recall0.656010.527370.521470.650630.557790.61800
AUC0.732240.593320.548450.734810.652120.71915
F10.660590.508200.530480.663070.591350.64049
Log loss−0.62650−1.14665−4.78246−0.60312−0.65784−0.62072
Experiment 2—BERT embedding along with features
BAcc0.493830.484640.493670.495520.482170.49540
AP0.496230.482700.497500.499750.480810.49824
Recall0.539100.239810.655690.431080.392950.55054
AUC0.495190.475830.494940.498810.475000.49809
F10.471910.231960.538410.416140.362510.47902
Log loss−0.71741−0.70104−6.37769−0.99291−0.69750−0.72700
BUAcc0.629610.519960.524220.611730.585210.61066
AP0.674360.566760.522220.644830.597900.65503
Recall0.614410.652140.512500.598940.594450.54927
AUC0.681770.566230.533270.655140.618160.66025
F10.622600.523120.518450.605080.587480.58398
Log loss−0.66111−1.34999−5.60317−0.71427−0.67780−0.66742
BPAcc0.592780.543220.531110.564250.538290.56514
AP0.623800.557480.524570.580810.556660.59568
Recall0.602980.352610.528980.558360.659250.58201
AUC0.632580.558080.539470.595470.562790.60011
F10.596810.431080.530110.561610.583810.57150
Log loss−0.67896−0.73740−4.83348−0.79076−0.68289−0.68931
BRAcc0.624560.600070.568280.637850.622320.62182
AP0.681580.672720.563400.694770.670030.67664
Recall0.610040.748410.576400.593220.501170.60432
AUC0.674620.673380.588610.689640.670440.66903
F10.618100.642910.571160.619500.569040.61400
Log loss−0.66977−0.69504−4.60242−0.63447−0.65263−0.67041
BUPAcc0.646990.542940.516700.640090.608200.63224
AP0.703740.575350.515350.682450.612900.68689
Recall0.639520.433470.498600.632350.575060.61071
AUC0.706250.569890.520480.695820.634710.68880
F10.643450.463050.507560.636020.594040.62277
Log loss−0.64318−1.11987−4.92633−0.64184−0.66831−0.64627
BURAcc0.642560.545190.541040.658540.618000.64183
AP0.708520.604860.535800.724220.656220.70850
Recall0.634700.643530.527970.632350.494890.61699
AUC0.703440.598670.554630.720170.660070.70066
F10.638340.557230.534480.647770.561880.63025
Log loss−0.65133−1.03051−4.95076−0.61468−0.65853−0.63997
BPRAcc0.656740.600010.562390.665990.612060.64699
AP0.723670.633450.553600.725130.652880.71094
Recall0.650510.538870.562500.629660.552410.63325
AUC0.719190.630850.580050.726770.650480.70725
F10.653760.564360.562020.652370.585550.64143
Log loss−0.63726−0.77768−4.58457−0.61157−0.65776−0.63340
BAAcc0.658260.544460.539180.669240.617280.65086
AP0.733310.586110.531510.733410.654090.72270
Recall0.648500.636700.521470.644910.552410.62899
AUC0.726610.588030.548450.733040.652530.71329
F10.654000.572990.530480.659850.588770.64209
Log loss−0.63099−1.05104−4.78246−0.60592−0.65750−0.62920
Experiment 3—ALBERTA embedding along with features
AAcc0.483690.473710.482060.473150.472760.50987
AP0.485330.467520.487290.475850.475340.51442
Recall0.481450.380100.481110.436600.416310.52001
AUC0.475150.454490.476450.461890.462190.51184
F10.481910.413200.481320.452000.435590.51475
Log loss−0.83122−0.79217−5.47128−0.71731−0.70086−0.71996
AUAcc0.604830.564070.524330.596810.587570.59513
AP0.643080.591950.522320.621770.600940.63293
Recall0.597370.351240.512610.570460.598160.54770
AUC0.651140.592290.533370.637690.621440.63945
F10.600220.409200.518570.583510.590310.57287
Log loss−0.73088−0.84285−5.60492−0.66523−0.67647−0.67737
APAcc0.549560.538290.531170.514800.544290.55169
AP0.565680.563740.525090.516000.566440.56396
Recall0.558470.536040.528200.490300.617010.57092
AUC0.571340.568900.539840.526050.577340.57387
F10.553450.511010.529770.502090.552440.55695
Log loss−0.76990−0.76927−4.81465−0.69993−0.68179−0.70383
ARAcc0.593280.626410.568340.606570.619630.60903
AP0.642170.684790.563180.641520.668590.65755
Recall0.574840.564640.576520.569340.490070.56452
AUC0.634980.683100.588430.644170.669800.65538
F10.584220.594100.571220.590140.561090.58881
Log loss−0.74441−0.64562−4.60443−0.66054−0.65389−0.66765
AUPAcc0.615820.555610.516650.606290.608920.61520
AP0.660300.590650.515240.628840.612390.65382
Recall0.609140.570910.498710.581220.576740.58033
AUC0.666610.589430.520330.648520.634530.65897
F10.612150.541880.507590.594330.595200.60009
Log loss−0.71584−0.81014−4.92646−0.66074−0.66843−0.66356
AURAcc0.619970.596200.540920.634660.616490.62193
AP0.681660.643510.535830.678950.656620.68058
Recall0.610490.516870.527970.600170.490740.60398
AUC0.674820.645140.554580.680450.660480.67007
F10.615060.552060.534430.620020.558820.61382
Log loss−0.70762−0.69041−4.94896−0.64238−0.65808−0.65507
APRAcc0.627370.613300.562390.621200.615370.62199
AP0.682300.644910.553520.665320.655020.67302
Recall0.617550.518540.562390.579880.540520.59546
AUC0.681660.653550.580000.666350.653190.66887
F10.622600.565210.561980.603520.582740.60987
Log loss−0.70174−0.70164−4.58461−0.64976−0.65773−0.66110
AAAcc0.632080.550570.539130.637850.616940.62417
AP0.693160.608260.531400.687470.653900.68721
Recall0.619680.747160.521580.606340.554760.60073
AUC0.688320.598330.548300.688880.652660.67797
F10.626650.624100.530500.624280.589400.61232
Log loss−0.69416−0.89231−4.78630−0.63770−0.65728−0.65726
Experiment 4—DistilBERT embedding along with features
DAcc0.470290.471970.472360.474770.465240.51486
AP0.470160.464870.478140.468570.464190.52854
Recall0.468550.347130.459920.431780.434240.53582
AUC0.459180.450480.460530.456270.445870.52520
F10.469090.394360.465280.449510.444790.52411
Log loss−0.86615−0.83000−5.87407−0.71713−0.70460−0.71702
DUAcc0.596760.535200.524330.591270.587850.59093
AP0.625220.590790.522380.613050.601000.62071
Recall0.591310.623650.512610.559590.598050.62787
AUC0.636560.586550.533420.628670.621520.62852
F10.593120.521100.518570.575870.590430.60520
Log loss−0.76028−0.86846−5.60488−0.66880−0.67646−0.70422
DPAcc0.536940.535600.531390.512610.550290.54188
AP0.547070.560920.525100.506750.564090.55104
Recall0.532680.595390.528420.476520.719810.57631
AUC0.552370.565470.539820.517800.577300.56027
F10.534770.537330.530000.493620.613970.55495
Log loss−0.80466−0.78940−4.82944−0.70176−0.68105−0.71054
DRAcc0.587790.618170.568280.602760.620920.60382
AP0.637480.681210.563190.638010.669230.65901
Recall0.573380.533230.576520.570350.494100.52582
AUC0.628480.680850.588410.639770.670600.65374
F10.580370.560760.571210.588040.564020.56009
Log loss−0.75852−0.65251−4.60445−0.66259−0.65396−0.67340
DUPAcc0.613630.558250.516590.609200.608530.60932
AP0.653070.601680.515220.632110.612570.64361
Recall0.613520.598020.498710.585820.575620.59075
AUC0.660780.605670.520290.649860.634520.65319
F10.612420.555910.507560.598360.594500.59990
Log loss−0.72987−0.82009−4.92650−0.66033−0.66840−0.67235
DURAcc0.608360.589080.540920.628770.618340.61083
AP0.668590.630120.535830.674270.658250.66723
Recall0.595580.390770.527970.592770.499150.55723
AUC0.661020.628580.554580.676630.662190.65571
F10.602030.461610.534430.613450.565280.58372
Log loss−0.73206−0.80277−4.94896−0.64445−0.65737−0.67706
DPRAcc0.620810.610940.562390.619740.614190.62204
AP0.674910.653030.553520.662460.654850.67071
Recall0.608470.547930.562390.586940.541420.58773
AUC0.672330.659250.580000.664430.653330.66624
F10.615110.581540.561980.605490.581940.60737
Log loss−0.71715−0.67030−4.58461−0.65107−0.65742−0.66416
DAAcc0.625690.567830.539130.635050.617000.62798
AP0.687720.618110.531400.682360.652420.68189
Recall0.618340.620720.521580.605890.555660.62417
AUC0.682750.608630.548300.684670.650920.67610
F10.621870.576350.530500.622610.589870.62618
Log loss−0.70495−0.82448−4.78630−0.63988−0.65817−0.65485
Experiment 5—RoBERTa embedding along with features
RAcc0.479310.477970.481280.479200.463790.51200
AP0.482000.481200.486120.477550.464720.51173
Recall0.476290.471020.459920.428530.435820.52080
AUC0.472770.469420.475810.468140.449230.51279
F10.477460.473550.469600.450510.446180.51605
Log loss−0.83244−2.58975−5.51829−0.70854−0.70808−0.72229
RUAcc0.603040.573660.524390.606290.585490.60382
AP0.640690.606490.522310.637310.598010.63886
Recall0.597480.588540.512610.580210.593670.55634
AUC0.648440.612280.533410.651300.618370.64450
F10.599170.538460.518600.593860.587280.57900
Log loss−0.73830−0.77159−5.60303−0.65943−0.67778−0.67932
RPAcc0.550620.554550.532230.527300.535600.54614
AP0.563430.568630.525420.528800.556890.56287
Recall0.550510.475850.529770.498600.674830.56384
AUC0.572840.573720.540430.543040.560670.56754
F10.550460.500420.531080.513040.587410.54896
Log loss−0.77311−0.75215−4.82701−0.69216−0.68331−0.70985
RRAcc0.598610.627980.568450.609880.616940.60982
AP0.646060.678800.563720.649040.666050.66122
Recall0.583470.643120.576740.579320.495450.58291
AUC0.639540.678810.588960.651590.667180.65405
F10.591060.633310.571390.596480.563000.59829
Log loss−0.73955−0.64498−4.59471−0.65686−0.65495−0.66775
RUPAcc0.625240.568060.516700.616600.608640.62008
AP0.667290.593550.515350.648730.612450.65451
Recall0.618340.507240.498600.595460.576180.59804
AUC0.674670.606130.520480.663790.634680.66343
F10.621580.505490.507560.607120.594790.61010
Log loss−0.70950−0.85802−4.92633−0.65407−0.66836−0.66392
RURAcc0.620420.579880.541040.637960.618510.62552
AP0.679610.638970.535800.686240.656030.68307
Recall0.611050.511340.527970.600840.504080.58885
AUC0.675500.637630.554630.687510.660420.67341
F10.615360.516830.534480.622550.567130.61014
Log loss−0.71209−0.70740−4.95076−0.63859−0.65845−0.65642
RPRAcc0.625740.603370.562390.628940.614080.62182
AP0.685230.638260.553600.674660.653320.67625
Recall0.616880.511040.562500.590190.548260.60376
AUC0.680580.644590.580050.676500.650210.67162
F10.620980.548710.562020.613100.585670.61299
Log loss−0.70495−0.68527−4.58457−0.64550−0.65827−0.65922
RAAcc0.637350.575740.539180.642220.618340.63068
AP0.698970.616130.531510.696790.653140.69176
Recall0.626300.649520.521470.608470.558570.63907
AUC0.695650.614030.548450.698700.652400.68689
F10.632080.593980.530480.628430.592110.63230
Log loss−0.68758−0.80225−4.78246−0.63235−0.65753−0.64827

Appendix C

Figure A3. Correlation plot of YelpNYC features.
Figure A3. Correlation plot of YelpNYC features.
Jtaer 19 00075 g0a3
Figure A4. Cumulative distribution function of YelpNYC dataset features.
Figure A4. Cumulative distribution function of YelpNYC dataset features.
Jtaer 19 00075 g0a4
Table A3. Classifier performance on YelpNYC dataset.
Table A3. Classifier performance on YelpNYC dataset.
FeatureMetricXGBoostMLP
Classifier
KNNRandom
Forest
Logistic
Regression
Stacking
Experiment 1—Features Only
FUAcc0.8233430.5079030.8748140.8251050.6284940.897397
AP0.9095260.5076890.8806840.8822600.6320960.939389
Recall0.7700960.2185440.8276540.7892640.6237770.888220
AUC0.9042650.5086220.9049420.8855310.6600670.943354
F10.8098020.1692930.8645040.8158490.6249630.895255
Log loss−0.400635−0.690340−2.846283−1.449639−0.659209−0.294573
FPAcc0.6643080.5843160.6182870.6432150.5709640.655741
AP0.7031910.5813520.6098890.6768810.6027390.689264
Recall0.6254030.5952280.6198450.5922460.6203060.623424
AUC0.7169100.6106410.6494700.6865360.6094360.703967
F10.6466500.5816730.6174780.6218780.5889330.640241
Log loss−0.623862−0.677843−4.608506−0.975397−0.680482−0.633884
FRAcc0.6896710.6510370.6325610.6808320.5998510.667819
AP0.7751140.7285500.6334150.7576920.6580440.765502
Recall0.6328860.5881520.5940080.6303650.4843970.580588
AUC0.7667790.7248230.6699600.7514870.6462930.752572
F10.6655410.6117350.6126400.6587930.5433670.621959
Log loss−0.571589−0.624506−4.471583−0.726444−0.657270−0.591367
FUPAcc0.8328590.5837470.8118070.8552390.6545610.896408
AP0.9173040.5806120.8255670.9234620.6810590.950505
Recall0.7693370.2482040.7982920.8272740.5844920.873797
AUC0.9121800.5876280.8700040.9192850.6900790.947489
F10.8176980.3422920.8028610.8462190.6242430.889673
Log loss−0.383452−0.668992−3.036401−0.716218−0.644859−0.321913
FURAcc0.8089330.5327230.8629520.8335370.6290500.887339
AP0.9022080.5334780.8747220.8915340.6813310.936354
Recall0.7599300.1072520.8180830.8020880.6931820.871221
AUC0.8959320.5336170.9012240.8921440.6832900.938758
F10.7957390.1706730.8520240.8251470.6492140.883618
Log loss−0.413841−0.684622−2.822766−1.353577−0.645459−0.328789
FPRAcc0.7379020.6131900.6402870.7219470.6090280.730568
AP0.8397160.6383690.6415300.8197470.6438720.834517
Recall0.6425920.6366820.6439750.6737430.4556050.625946
AUC0.8187480.6713900.6811630.7959340.6278500.809419
F10.7044920.6034310.6395210.7042140.5294200.692354
Log loss−0.512079−0.656739−4.299946−0.728527−0.668499−0.520907
FAAcc0.8261620.5791110.8091370.8597130.6531380.891053
AP0.9137840.5801440.8243290.9281750.6930190.949228
Recall0.7587100.1985630.7925170.8322080.5890470.869730
AUC0.9075990.5798130.8671650.9232340.6928100.945513
F10.8088460.3129720.7992180.8507790.6248420.883170
Log loss−0.392645−0.666210−3.106835−0.716756−0.639341−0.348125
Experiment 2—BERT embedding along with features
BAcc0.4229900.5001360.4230580.4244810.5000140.526108
AP0.4596200.5063260.4738260.4671100.6896880.535742
Recall0.4078080.7992140.4078620.3896980.0000270.586065
AUC0.4239430.5041830.4231120.4177570.6739640.536971
F10.3470040.5332190.3471640.3356370.0000540.492660
Log loss−3.008050−0.708783−19.921263−14.641646−0.693594−1.748505
BUAcc0.4825950.5045280.4465910.5585060.6294840.662342
AP0.7106850.5059750.4839850.6134280.6257510.773045
Recall0.4757760.6207940.4583980.6047990.6336720.809923
AUC0.6787940.5073550.4473700.6118610.6553080.766311
F10.4331480.4357700.4253740.5686920.6317130.710851
Log loss−1.866017−1.027121−16.278267−1.770812−0.662008−0.799907
BPAcc0.4233970.4939410.4307040.4502640.5669510.569351
AP0.5371930.4993410.4768720.4993200.6125920.615051
Recall0.4077810.5748950.4273550.4401520.4622480.758303
AUC0.5193060.4994400.4340660.4936940.6199570.625872
F10.3471580.3911670.3818950.4146420.4697610.631950
Log loss−2.630268−0.792793−18.583398−2.389219−0.690698−0.762318
BRAcc0.4354070.5114140.4238440.5558220.5812660.586634
AP0.5930290.5078930.4738750.5524420.6426400.674425
Recall0.4107900.0354890.4090280.6157250.4677240.720320
AUC0.5543670.5061100.4240930.5652150.6233830.647706
F10.3551900.0581100.3487220.5705810.4874940.633000
Log loss−2.519455−0.726955−19.818545−3.837709−0.675526−0.749183
BUPAcc0.5203610.5097060.4683070.6318960.6503320.633144
AP0.7359650.5109560.4899100.7558880.6751510.812676
Recall0.5320860.1824860.4893860.6897110.5674390.881090
AUC0.7080550.5144310.4708030.7439230.6850300.800779
F10.4959570.1480840.4554170.6509900.6100030.709953
Log loss−1.670196−0.735608−14.617474−0.800951−0.650039−0.965161
BURAcc0.4954050.5305000.4486780.6169450.6278700.665433
AP0.7183060.5272720.4849110.7148050.6845970.794144
Recall0.4965980.0827980.4615700.6796260.6488550.755131
AUC0.6850330.5288670.4500790.7098910.6820870.775330
F10.4527780.1131170.4284040.6359590.6233500.692075
Log loss−1.777515−0.732621−16.141086−1.082843−0.651799−0.774105
BPRAcc0.4577610.5257560.4328180.6245900.6008540.660946
AP0.6677470.5250110.4780110.7259890.6358830.780919
Recall0.4452220.1213770.4302830.6942120.3481360.730216
AUC0.6201160.5275120.4367060.7095310.6199000.763433
F10.4000060.1272590.3872510.6456660.4447380.682300
Log loss−2.138377−0.716985−18.263558−0.961516−0.676973−0.677129
BAAcc0.5401110.5631290.4693510.6761420.6516330.677240
AP0.7436790.5615170.4906440.8263190.6907230.811853
Recall0.5670330.1835710.4913380.7381320.5643220.812119
AUC0.7132720.5648570.4719910.8060440.6906890.802636
F10.5247590.2631760.4571730.6970210.6072610.725856
Log loss−1.582503−0.714157−14.567983−0.685565−0.644227−0.840616
Experiment 3—ALBERTA embedding along with features
AAcc0.4644030.4734040.4915140.4666530.4689170.500136
AP0.4702880.4703290.4966420.4693120.4720010.504103
Recall0.4144230.3275860.3808590.4355430.4465770.381591
AUC0.4535170.4537330.4914260.4527600.4575660.501986
F10.4341750.3811540.4242060.4480060.4560890.415525
Log loss−0.726548−0.733304−6.115318−0.731689−0.702506−0.710033
AUAcc0.7938320.5001220.8746370.6742850.6286840.882351
AP0.8920650.5073460.8805170.7142430.6321900.932566
Recall0.7481900.7975600.8276540.6616780.6240750.870462
AUC0.8828170.5083500.9048420.7225440.6601850.933315
F10.7809330.5341690.8643640.6677230.6251930.878646
Log loss−0.432160−0.691376−2.848825−1.235472−0.659607−0.345269
APAcc0.6336720.5986170.6161310.5772130.5748140.637359
AP0.6540970.6025280.6078360.5856580.6046870.663407
Recall0.6042290.6357330.6158060.5646740.5940630.606588
AUC0.6772200.6344020.6467020.6058740.6114490.683147
F10.6189570.6052710.6145530.5705890.5817390.620935
Log loss−0.656376−0.669623−4.658350−0.844455−0.681444−0.647074
ARAcc0.6757900.6528530.6243050.6501830.6050020.661028
AP0.7561450.7342600.6251290.7076190.6606150.746476
Recall0.6295510.5223260.6159410.6102750.4677240.551444
AUC0.7455550.7238040.6612270.7058230.6510080.731871
F10.6557650.5879590.6189850.6321570.5351480.612574
Log loss−0.595190−0.614549−4.915387−0.847924−0.656562−0.602799
AUPAcc0.8106680.6164020.8118070.7137180.6528810.889576
AP0.9015560.6219720.8254850.7818060.6780180.939777
Recall0.7470790.3198050.7983730.6890060.5841130.869974
AUC0.8938030.6212980.8699390.7757270.6876600.940048
F10.7937660.4487910.8028930.7024280.6228780.883347
Log loss−0.413048−0.650256−3.037827−0.614518−0.645798−0.340803
AURAcc0.7891960.5919070.8628300.6976820.6311370.883937
AP0.8880690.6038040.8745940.7489230.6876330.931048
Recall0.7378340.2670460.8182190.6841530.6783520.856852
AUC0.8789950.6029380.9011300.7525020.6848980.933928
F10.7739840.3654960.8519650.6903710.6433970.878501
Log loss−0.438109−0.672227−2.824785−1.100490−0.646827−0.330495
APRAcc0.7213910.6347700.6378610.6848180.6062630.719954
AP0.8233540.6781430.6399620.7566220.6446140.817200
Recall0.6406400.6197640.6457100.6738240.4674530.605341
AUC0.8011610.6987370.6792190.7437850.6297610.791903
F10.6915200.6180310.6385330.6784500.5328950.677537
Log loss−0.535193−0.641814−4.296107−0.696352−0.667802−0.540149
AAAcc0.8092180.6369390.8091230.7386340.6530030.875803
AP0.9016540.6546580.8242810.8267160.6913650.939394
Recall0.7405450.3964210.7925990.7112920.5822690.863278
AUC0.8926820.6522790.8671240.8112940.6916270.939105
F10.7906480.5174790.7992400.7271680.6202300.871564
Log loss−0.415887−0.639657−3.107363−0.578119−0.640208−0.345767
Experiment 4—DistilBERT embedding along with features
DAcc0.4639960.4724550.4806150.4669920.4689300.515969
AP0.4692050.4690840.4897270.4690940.4700850.522134
Recall0.4211740.3251460.4797610.4361940.3898060.459347
AUC0.4526120.4535320.4790490.4529490.4547400.522263
F10.4379700.3756720.4780310.4485410.4204650.464800
Log loss−0.726284−0.716010−6.663249−0.730338−0.702663−0.703843
DUAcc0.7962990.5387150.8745700.6763450.6288060.889250
AP0.8932670.5551630.8803860.7142900.6323320.929447
Recall0.7522300.3927070.8276260.6628440.6246440.885021
AUC0.8848580.5580870.9047010.7231110.6603530.934898
F10.7835730.3487110.8642980.6694790.6255030.887415
Log loss−0.430092−0.702534−2.852570−1.250693−0.659422−0.316440
DPAcc0.6349060.5915280.6168900.5790020.5739190.631720
AP0.6571810.6072420.6064290.5854610.6038190.657528
Recall0.6039850.5805610.6186800.5677650.6048530.598699
AUC0.6786730.6286070.6453580.6051220.6105710.674496
F10.6192770.5815670.6161540.5730460.5843840.615222
Log loss−0.654246−0.670598−4.704216−0.843267−0.681525−0.653589
DRAcc0.6757350.6452080.6251730.6504000.5941170.661543
AP0.7573060.7300090.6237760.7072170.6476530.748202
Recall0.6247800.5305950.6186800.6108450.4638470.574678
AUC0.7459820.7186610.6602440.7062040.6316520.732337
F10.6540330.5814080.6204210.6326260.5207090.622468
Log loss−0.593284−0.623136−5.079042−0.850721−0.662627−0.600561
DUPAcc0.8121050.6188020.8117660.7126750.6531380.886553
AP0.9025610.6225180.8254800.7810930.6787660.937333
Recall0.7477020.3206180.7983730.6869190.5854950.867751
AUC0.8951660.6245130.8699170.7755660.6877590.939136
F10.7951270.4550170.8028600.7011790.6236200.880667
Log loss−0.410696−0.647028−3.038303−0.608179−0.645595−0.338815
DURAcc0.7913790.5933310.8627760.6983060.6306900.880019
AP0.8888750.5955710.8745430.7491420.6883910.932360
Recall0.7476750.2676430.8181650.6852920.6859700.869974
AUC0.8797690.5954520.9010620.7526700.6844730.934043
F10.7785860.3490840.8519000.6910910.6467940.876226
Log loss−0.437108−0.663626−2.826221−1.090813−0.646802−0.355753
DPRAcc0.7228820.6361800.6379690.6848180.6068590.718666
AP0.8251830.6874030.6380210.7571650.6430640.817608
Recall0.6397180.5987800.6464690.6728750.4689710.610411
AUC0.8017250.7024610.6776680.7437840.6265950.791503
F10.6924020.6022460.6389820.6782060.5324550.678204
Log loss−0.533660−0.647868−4.344742−0.697351−0.667907−0.539344
DAAcc0.8106410.5801680.8090820.7376850.6535850.880060
AP0.9025270.5897330.8242510.8262510.6923410.937490
Recall0.7427140.3803710.7925720.7117260.5910260.855849
AUC0.8944220.5898940.8670890.8105000.6920900.937472
F10.7919630.4008490.7991960.7266310.6260520.872295
Log loss−0.413589−0.746263−3.108298−0.575551−0.639337−0.327304
Experiment 5—RoBERTa embedding along with features
RAcc0.4639420.4659620.4996340.4658260.4628850.500732
AP0.4695240.4703310.4948310.4681110.4672350.508341
Recall0.4212820.4160230.5969090.4350550.4152090.428955
AUC0.4526130.4529550.4911820.4516330.4495270.501905
F10.4380780.4319940.5379760.4473770.4344270.440534
Log loss−0.724444−0.760509−4.928406−0.730004−0.704515−0.706719
RUAcc0.7884100.5038500.8746640.6758440.6288060.888193
AP0.8877360.5050970.8804700.7163130.6322480.936153
Recall0.7412230.2093260.8276540.6613260.6248880.873200
AUC0.8780490.5060390.9048030.7243530.6603030.935274
F10.7746010.1517500.8643780.6686240.6255870.884040
Log loss−0.439150−0.692630−2.848513−1.254115−0.659066−0.344099
RPAcc0.6323440.5893050.6163750.5787990.5739460.638593
AP0.6486650.6064120.6064490.5869130.6038760.659591
Recall0.5989430.6023590.6181650.5670600.5972890.594442
AUC0.6744030.6293840.6455740.6068890.6104510.679281
F10.6157210.5892220.6156710.5725480.5829320.618436
Log loss−0.657602−0.670727−4.674726−0.839443−0.680725−0.649592
RRAcc0.6744480.6529350.6241150.6504950.5946590.663278
AP0.7568930.7334840.6254430.7089230.6509110.750643
Recall0.6277620.5772540.6159690.6097330.4320730.587583
AUC0.7457820.7259410.6608090.7069020.6371550.735540
F10.6542730.6090370.6187410.6321970.5085340.628815
Log loss−0.593553−0.612419−5.006094−0.842396−0.664129−0.598193
RUPAcc0.8072520.5945910.8117930.7155750.6523380.885916
AP0.8984930.5938980.8254900.7857780.6777320.936537
Recall0.7402200.2459810.7984280.6899010.5837870.866260
AUC0.8906830.5973860.8699310.7786460.6873410.938387
F10.7889540.3645080.8029040.7041770.6224280.879668
Log loss−0.419379−0.663754−3.037858−0.612832−0.646393−0.358529
RURAcc0.7837330.6318690.8628710.6997560.6305140.879314
AP0.8837100.6349170.8745150.7505840.6820350.933317
Recall0.7390000.4239930.8182730.6866480.6852920.876318
AUC0.8739310.6424000.9010480.7543220.6819000.932056
F10.7702880.5246230.8520110.6925070.6473810.876698
Log loss−0.445633−0.643519−2.827084−1.082337−0.645318−0.352196
RPRAcc0.7223260.6356380.6383080.6847910.6128510.718354
AP0.8222300.6683180.6384860.7577390.6427650.815394
Recall0.6384710.6114950.6471190.6739870.4389590.603497
AUC0.8003220.6871400.6782400.7445490.6277780.790138
F10.6914850.6183530.6393770.6783810.5203050.675061
Log loss−0.53561−0.648846−4.320839−0.696305−0.668128−0.543369
RAAcc0.8041890.6263660.8090960.7415210.6540190.869093
AP0.8986310.6346930.8242450.8283050.6926560.932614
Recall0.7335500.3568120.7926260.7116990.5875020.856988
AUC0.8898930.6346620.8670770.8126560.6925370.931487
F10.7846980.4852240.7992290.7293790.6250150.862228
Log loss−0.420916−0.652366−3.108753−0.573428−0.639423−0.373861

References

  1. Kim, S.; Kandampully, J.; Bilgihan, A. The Influence of EWOM Communications: An Application of Online Social Network Framework. Comput. Human Behav. 2018, 80, 243–254. [Google Scholar] [CrossRef]
  2. Rudolph, S. The Impact of Online Reviews on Customers’ Buying Decisions [Infographic]. Available online: http://www.business2community.com/infographics/impact-online-reviews-customers-buying-decisions-infographic-01280945#oaFtOjCMhi5CD7de.97 (accessed on 27 March 2020).
  3. Mukherjee, A.; Venkataraman, V. Opinion Spam Detection: An Unsupervised Approach Using Generative Models. 2014. Available online: https://www2.cs.uh.edu/~arjun/tr/UH_TR_2014_07.pdf (accessed on 20 June 2020).
  4. He, S.; Hollenbeck, B.; Proserpio, D. The Market for Fake Reviews. Mark. Sci. 2022, 41, 896–921. [Google Scholar] [CrossRef]
  5. Christopher, S.L.; Rahulnath, H.A. Review Authenticity Verification Using Supervised Learning and Reviewer Personality Traits. In Proceedings of the 2016 International Conference on Emerging Technological Trends (ICETT), Kollam, India, 21–22 October 2016. [Google Scholar] [CrossRef]
  6. Phil Trip Advisor Changes Its Slogan|TripAdvisorWatch: Hotel Reviews in Focus. Available online: https://tripadvisorwatch.wordpress.com/2010/01/19/trip-advisor-changes-its-slogan/ (accessed on 6 December 2021).
  7. Witts, S. TripAdvisor Blocked More than One Million Fake Reviews in 2022—The Caterer. Available online: https://www.thecaterer.com/news/tripadvisor-block-fake-reviews-2022-hospitality (accessed on 16 May 2023).
  8. Butler, O. I Made My Shed the Top-Rated Restaurant on TripAdvisor. Available online: https://www.vice.com/en/article/434gqw/i-made-my-shed-the-top-rated-restaurant-on-tripadvisor (accessed on 12 December 2021).
  9. Marciano, J. Fake Online Reviews Cost $152 Billion a Year. Here’s How e-Commerce Sites Can Stop Them|World Economic Forum. Available online: https://www.weforum.org/agenda/2021/08/fake-online-reviews-are-a-152-billion-problem-heres-how-to-silence-them/ (accessed on 24 March 2023).
  10. Govindankutty, S.; Gopalan, S.P. From Fake Reviews to Fake News: A Novel Pandemic Model of Misinformation in Digital Networks. J. Theor. Appl. Electron. Commer. Res. 2023, 18, 1069–1085. [Google Scholar] [CrossRef]
  11. Online Product and Service Reviews|ACCC. Available online: https://www.accc.gov.au/business/advertising-and-promotions/online-product-and-service-reviews (accessed on 24 March 2023).
  12. Press Information Bureau (PIB). Available online: https://pib.gov.in/PressReleasePage.aspx?PRID=1877733 (accessed on 24 March 2023).
  13. EUR-Lex-32019L2161-EN-EUR-Lex. Available online: https://eur-lex.europa.eu/eli/dir/2019/2161/oj (accessed on 24 March 2023).
  14. Crawford, M.; Khoshgoftaar, T.M.; Prusa, J.D.; Richter, A.N.; Al Najada, H. Survey of Review Spam Detection Using Machine Learning Techniques. J. Big Data 2015, 2, 23. [Google Scholar] [CrossRef]
  15. Vidanagama, D.U.; Silva, T.P.; Karunananda, A.S. Deceptive Consumer Review Detection: A Survey. Artif. Intell. Rev. 2020, 53, 1323–1352. [Google Scholar] [CrossRef]
  16. Mayzlin, D.; Dover, Y.; Chevalier, J. Promotional Reviews: An Empirical Investigation of Online Review Manipulation. Am. Econ. Rev. 2014, 104, 2421–2455. [Google Scholar] [CrossRef]
  17. Moon, S.; Kim, M.Y.; Bergey, P.K. Estimating Deception in Consumer Reviews Based on Extreme Terms: Comparison Analysis of Open vs. Closed Hotel Reservation Platforms. J. Bus. Res. 2019, 102, 83–96. [Google Scholar] [CrossRef]
  18. Barbado, R.; Araque, O.; Iglesias, C.A. A Framework for Fake Review Detection in Online Consumer Electronics Retailers. Inf. Process. Manag. 2019, 56, 1234–1244. [Google Scholar] [CrossRef]
  19. Jindal, N.; Liu, B. Review Spam Detection. In Proceedings of the 16th International World Wide Web Conference, WWW2007, Banff, AB, Canada, 8–12 May 2007; ACM Press: New York, NY, USA, 2007; pp. 1189–1190. [Google Scholar]
  20. Ziora, L. Machine Learning Solutions in the Management of a Contemporary Business Organisation. J. Decis. Syst. 2020, 29, 344–351. [Google Scholar] [CrossRef]
  21. Fontanarava, J.; Pasi, G.; Viviani, M. Feature Analysis for Fake Review Detection through Supervised Classification. In Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan, 19–21 October 2017; pp. 658–666. [Google Scholar] [CrossRef]
  22. Kumar, N.; Venugopal, D.; Qiu, L.; Kumar, S. Detecting Review Manipulation on Online Platforms with Hierarchical Supervised Learning. J. Manag. Inf. Syst. 2018, 35, 350–380. [Google Scholar] [CrossRef]
  23. Wolpert, D.H. Stacked Generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
  24. Van der Laan, M.J.; Polley, E.C.; Hubbard, A.E. Super Learner. Stat. Appl. Genet. Mol. Biol. 2007, 6, 25. [Google Scholar] [CrossRef] [PubMed]
  25. Patel, N.A.; Patel, R. A Survey on Fake Review Detection Using Machine Learning Techniques. In Proceedings of the 2018 4th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, 14–15 December 2018; pp. 1–6. [Google Scholar] [CrossRef]
  26. Rayana, S.; Akoglu, L. Collective Opinion Spam Detection: Bridging Review Networks and Metadata. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 985–994. [Google Scholar] [CrossRef]
  27. Malbon, J. Taking Fake Online Consumer Reviews Seriously. J. Consum. Policy 2013, 36, 139–157. [Google Scholar] [CrossRef]
  28. Zinko, R.; Patrick, A.; Furner, C.P.; Gaines, S.; Kim, M.D.; Negri, M.; Orellana, E.; Torres, S.; Villarreal, C. Responding to Negative Electronic Word of Mouth to Improve Purchase Intention. J. Theor. Appl. Electron. Commer. Res. 2021, 16, 109. [Google Scholar] [CrossRef]
  29. Luca, M.; Zervas, G. Fake It till You Make It: Reputation, Competition, and Yelp Review Fraud. Manage. Sci. 2016, 62, 3412–3427. [Google Scholar] [CrossRef]
  30. Lappas, T.; Sabnis, G.; Valkanas, G. The Impact of Fake Reviews on Online Visibility: A Vulnerability Assessment of the Hotel Industry. Inf. Syst. Res. 2016, 27, 940–961. [Google Scholar] [CrossRef]
  31. Ismagilova, E.; Slade, E.; Rana, N.P.; Dwivedi, Y.K. The Effect of Characteristics of Source Credibility on Consumer Behaviour: A Meta-Analysis. J. Retail. Consum. Serv. 2020, 53, 101736. [Google Scholar] [CrossRef]
  32. Hunt, K.M. Gaming the System: Fake Online Reviews v. Consumer Law. Comput. Law Secur. Rev. 2015, 31, 3–25. [Google Scholar] [CrossRef]
  33. Lau, R.Y.K.; Liao, S.Y.; Chi-Wai Kwok, R.; Xu, K.; Xia, Y.; Li, Y. Text Mining and Probabilistic Language Modeling for Online Review Spam Detection. ACM Trans. Manag. Inf. Syst. 2011, 2, 1–30. [Google Scholar] [CrossRef]
  34. Yelp Yelp Trust & Safety Report. 2021. Available online: https://trust.yelp.com/wp-content/uploads/2022/02/Yelp-Trust-and-Safety-Report-2021.pdf (accessed on 3 January 2023).
  35. Zhang, D.; Zhou, L.; Kehoe, J.L.; Kilic, I.Y. What Online Reviewer Behaviors Really Matter? Effects of Verbal and Nonverbal Behaviors on Detection of Fake Online Reviews. J. Manag. Inf. Syst. 2016, 33, 456–481. [Google Scholar] [CrossRef]
  36. Yoo, K.-H.; Gretzel, U. Comparison of Deceptive and Truthful Travel Reviews. In Information and Communication Technologies in Tourism 2009; Springer: Vienna, Austria, 2009; pp. 37–47. [Google Scholar]
  37. Lai, C.L.; Xu, K.Q.; Lau, R.Y.K.; Li, Y.; Song, D. High-Order Concept Associations Mining and Inferential Language Modeling for Online Review Spam Detection. In Proceedings of the 2010 IEEE International Conference on Data Mining Workshops, Sydney, NSW, Australia, 13–13 December 2010; IEEE: Piscataway, NJ, USA; pp. 1120–1127. [Google Scholar]
  38. Jindal, N.; Liu, B.; Lim, E.-P. Finding Unusual Review Patterns Using Unexpected Rules. In Proceedings of the Proceedings of the 19th ACM international conference on Information and knowledge management-CIKM ′10, Toronto ON Canada, 26–30 October 2010; ACM Press: New York, NY, USA, 2010; p. 1549. [Google Scholar]
  39. Ott, M.; Choi, Y.; Cardie, C.; Hancock, J.T. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19-24 June 2011; Volume 1, pp. 309–319. [Google Scholar]
  40. Mukherjee, A.; Liu, B.; Glance, N. Spotting Fake Reviewer Groups in Consumer Reviews. In Proceedings of the WWW ′12—21st Annual Conference on World Wide Web Companion, Lyon, France, 16–20 April 2012; pp. 191–200. [Google Scholar] [CrossRef]
  41. Feng, S.; Banerjee, R.; Choi, Y. Syntactic Stylometry for Deception Detection. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Republic of Korea, 8–14 July 2012; pp. 171–175. [Google Scholar]
  42. Mukherjee, A.; Kumar, A.; Liu, B.; Wang, J.; Hsu, M.; Castellanos, M.; Ghosh, R. Spotting Opinion Spammers Using Behavioral Footprints. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Chicago, IL, USA, 11–14 August 2013; Part F1288. pp. 632–640. [Google Scholar] [CrossRef]
  43. Lu, Y.; Zhang, L.; Xiao, Y.; Li, Y. Simultaneously Detecting Fake Reviews and Review Spammers Using Factor Graph Model. In Proceedings of the 5th Annual ACM Web Science Conference, WebSci ′13, Paris, France, 2–4 May 2013; pp. 225–233. [Google Scholar] [CrossRef]
  44. Anderson, E.T.; Simester, D.I. Reviews without a Purchase: Low Ratings, Loyal Customers, and Deception. J. Mark. Res. 2014, 51, 249–269. [Google Scholar] [CrossRef]
  45. Banerjee, S.; Chua, A.Y.K. A Linguistic Framework to Distinguish between Genuine and Deceptive Online Reviews. Lect. Notes Eng. Comput. Sci. 2014, 2209, 501–506. [Google Scholar]
  46. Banerjee, S.; Chua, A.Y.K.; Kim, J.J. Using Supervised Learning to Classify Authentic and Fake Online Reviews. In Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication, IMCOM ′15, Bali, Indonesia, 8–10 January 2015. [Google Scholar] [CrossRef]
  47. Li, Y.; Feng, X.; Zhang, S. Detecting Fake Reviews Utilizing Semantic and Emotion Model. In Proceedings of the 2016 3rd International Conference on Information Science and Control Engineering, ICISCE 2016, Beijing, China, 8–10 July 2016; pp. 317–320. [Google Scholar] [CrossRef]
  48. Sun, C.; Du, Q.; Tian, G. Exploiting Product Related Review Features for Fake Review Detection. Math. Probl. Eng. 2016, 2016, 4935792. [Google Scholar] [CrossRef]
  49. Shehnepoor, S.; Salehi, M.; Farahbakhsh, R.; Crespi, N. NetSpam: A Network-Based Spam Detection Framework for Reviews in Online Social Media. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1585–1595. [Google Scholar] [CrossRef]
  50. Ren, Y.; Ji, D. Neural Networks for Deceptive Opinion Spam Detection: An Empirical Study. Inf. Sci. 2017, 385–386, 213–224. [Google Scholar] [CrossRef]
  51. Zhuang, M.; Cui, G.; Peng, L. Manufactured Opinions: The Effect of Manipulating Online Product Reviews. J. Bus. Res. 2018, 87, 24–35. [Google Scholar] [CrossRef]
  52. Nakayama, M.; Wan, Y. Exploratory Study on Anchoring: Fake Vote Counts in Consumer Reviews Affect Judgments of Information Quality. J. Theor. Appl. Electron. Commer. Res. 2017, 12, 1–20. [Google Scholar] [CrossRef]
  53. Jain, N.; Kumar, A.; Singh, S.; Singh, C.; Tripathi, S. Deceptive Reviews Detection Using Deep Learning Techniques. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 11608 LNCS; Springer: Heidelberg, Germany, 2019; pp. 79–91. [Google Scholar] [CrossRef]
  54. Plotkina, D.; Munzel, A.; Pallud, J. Illusions of Truth—Experimental Insights into Human and Algorithmic Detections of Fake Online Reviews. J. Bus. Res. 2020, 109, 511–523. [Google Scholar] [CrossRef]
  55. Hajek, P.; Barushka, A.; Munk, M. Fake Consumer Review Detection Using Deep Neural Networks Integrating Word Embeddings and Emotion Mining. Neural Comput. Appl. 2020, 32, 17259–17274. [Google Scholar] [CrossRef]
  56. Li, L.; Lee, K.Y.; Lee, M.; Yang, S.B. Unveiling the Cloak of Deviance: Linguistic Cues for Psychological Processes in Fake Online Reviews. Int. J. Hosp. Manag. 2020, 87, 102468. [Google Scholar] [CrossRef]
  57. Mohawesh, R.; Tran, S.; Ollington, R.; Xu, S. Analysis of Concept Drift in Fake Reviews Detection. Expert Syst. Appl. 2021, 169, 114318. [Google Scholar] [CrossRef]
  58. Shan, G.; Zhou, L.; Zhang, D. From Conflicts and Confusion to Doubts: Examining Review Inconsistency for Fake Review Detection. Decis. Support Syst. 2021, 144, 113513. [Google Scholar] [CrossRef]
  59. Wang, E.Y.; Fong, L.H.N.; Law, R. Detecting Fake Hospitality Reviews through the Interplay of Emotional Cues, Cognitive Cues and Review Valence. Int. J. Contemp. Hosp. Manag. 2022, 34, 184–200. [Google Scholar]
  60. Hajek, P.; Sahut, J.M. Mining Behavioural and Sentiment-Dependent Linguistic Patterns from Restaurant Reviews for Fake Review Detection. Technol. Forecast. Soc. Chang. 2022, 177, 121532. [Google Scholar] [CrossRef]
  61. Kumar, A.; Gopal, R.D.; Shankar, R.; Tan, K.H. Fraudulent Review Detection Model Focusing on Emotional Expressions and Explicit Aspects: Investigating the Potential of Feature Engineering. Decis. Support Syst. 2022, 155, 113728. [Google Scholar] [CrossRef]
  62. Carlens, H. State of Competitive Machine Learning in 2022. 2023. Available online: https://mlcontests.com/state-of-competitive-machine-learning-2022/ (accessed on 5 January 2023).
  63. Weise Karen A Lie Detector Test for Online Reviewers-Bloomberg. Available online: https://www.bloomberg.com/news/articles/2011-09-29/a-lie-detector-test-for-online-reviewers?leadSource=uverifywall#xj4y7vzkg (accessed on 7 November 2023).
  64. Li, J.; Ott, M.; Cardie, C.; Hovy, E. Towards a General Rule for Identifying Deceptive Opinion Spam. In Proceedings of the Detecting Deceptive Reviews Using Generative Adversarial Networks, Baltimore, MD, USA, 23–25 June 2014; Volume 1, pp. 1566–1576. [Google Scholar] [CrossRef]
  65. Aghakhani, H.; Machiry, A.; Nilizadeh, S.; Kruegel, C.; Vigna, G. Detecting Deceptive Reviews Using Generative Adversarial Networks. In Proceedings of the 2018 IEEE Symposium on Security and Privacy Workshops, San Francisco, CA, USA, 24 May 2018; pp. 89–95. [Google Scholar] [CrossRef]
  66. Yuan, C.; Zhou, W.; Ma, Q.; Lv, S.; Han, J.; Hu, S. Learning Review Representations from User and Product Level Information for Spam Detection. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 1444–1449. [Google Scholar]
  67. Ott, M.; Cardie, C.; Hancock, J. Estimating the Prevalence of Deception in Online Review Communities. In Proceedings of the 21st International Conference on World Wide Web, WWW ′12, Lyon France, 16–20 April 2012; pp. 201–210. [Google Scholar] [CrossRef]
  68. Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-Learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res. 2017, 18, 1–5. [Google Scholar]
  69. Desikan, B.S. Natural Language Processing and Computational Linguistics: A Practical Guide to Text Analysis with Python, Gensim, SpaCy, and Keras; Packt Publishing Ltd.: Birmingham, UK, 2018; ISBN 9781788837033. [Google Scholar]
  70. Jindal, N.; Liu, B. Opinion Spam and Analysis. In Proceedings of the International Conference on Web Search and Web Data Mining—WSDM ′08, Palo Alto, CA, USA, 11–12 February 2008; ACM Press: New York, NY, USA, 2008; p. 219. [Google Scholar]
  71. Li, F.; Huang, M.; Yang, Y.; Zhu, X. Learning to Identify Review Spam. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; pp. 2488–2493. [Google Scholar] [CrossRef]
  72. Hutto, C.J.; Gilbert, E. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014, Ann Arbor, MI, USA, 1–4 June 2014. [Google Scholar]
  73. McCarthy, P.M. An Assessment of the Range and Usefulness of Lexical Diversity Measures and the Potential of the Measure of Textual, Lexical Diversity (MTLD). Ph.D. Thesis, The University of Memphis, Memphis, TN, USA, 2005. [Google Scholar]
  74. Dewang, R.K.; Singh, A.K. Identification of Fake Reviews Using New Set of Lexical and Syntactic Features. In Proceedings of the Sixth International Conference on Computer and Communication Technology 2015—ICCCT ′15, Allahabad India, 25–27 September 2015; pp. 115–119. [Google Scholar] [CrossRef]
  75. Mohammad, S.M.; Turney, P.D. Crowdsourcing a Word-Emotion Association Lexicon. Comput. Intell. 2013, 29, 436–465. [Google Scholar] [CrossRef]
  76. Li, J.; Cardie, C.; Li, S. TopicSpam: A Topic-Model-Based Approach for Spam Detection. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 4–9 August 2013; Volume 2, pp. 217–221. [Google Scholar]
  77. Džeroski, S.; Ženko, B. Is Combining Classifiers with Stacking Better than Selecting the Best One? Mach. Learn. 2004, 54, 255–273. [Google Scholar] [CrossRef]
  78. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar]
  79. Shrikumar, A.; Greenside, P.; Kundaje, A. Learning Important Features through Propagating Activation Differences. In Proceedings of the 34th International Conference on Machine Learning—Volume 70, Sydney, Australia, 6–11 August 2017; pp. 3145–3153. [Google Scholar]
  80. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4768–4777. [Google Scholar]
  81. Yilmaz, C.M.; Durahim, A.O. SPR2EP: A Semi-Supervised Spam Review Detection Framework. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Barcelona, Spain, 28–31 August 2018; IEEE: Piscataway, NJ, USA; pp. 306–313. [Google Scholar]
  82. Mohawesh, R.; Xu, S.; Springer, M.; Jararweh, Y.; Al-Hawawreh, M.; Maqsood, S. An Explainable Ensemble of Multi-View Deep Learning Model for Fake Review Detection. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101644. [Google Scholar] [CrossRef]
  83. Kennedy, S.; Walsh, N.; Sloka, K.; McCarren, A.; Foster, J. Fact or Factitious? Contextualized Opinion Spam Detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Florence, Italy, 28 July–2 August 2019; Alva-Manchego, F., Choi, E., Khashabi, D., Eds.; Association for Computational Linguistics: Florence, Italy, 2019; pp. 344–350. [Google Scholar]
  84. Farrelly, C.M. Deep vs. Diverse Architectures for Classification Problems. arXiv 2017, arXiv:1708.06347. [Google Scholar]
  85. Lundberg, S. Interpretable Machine Learning with XGBoost|by Scott Lundberg|Towards Data Science. Available online: https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27 (accessed on 12 December 2021).
  86. Chowdhury, T.; Oredo, J. AI Ethical Biases: Normative and Information Systems Development Conceptual Framework. J. Decis. Syst. 2022, 32, 617–633. [Google Scholar] [CrossRef]
Figure 1. Proposed framework for fake review detection.
Figure 1. Proposed framework for fake review detection.
Jtaer 19 00075 g001
Figure 2. Correlation among the features of the YelpZip dataset.
Figure 2. Correlation among the features of the YelpZip dataset.
Jtaer 19 00075 g002
Figure 3. Cumulative distribution plot of YelpZip dataset features.
Figure 3. Cumulative distribution plot of YelpZip dataset features.
Jtaer 19 00075 g003
Figure 4. Graph showing the model’s performance on the YelpZip dataset.
Figure 4. Graph showing the model’s performance on the YelpZip dataset.
Jtaer 19 00075 g004
Figure 5. Graph showing the model’s performance on the YelpChi dataset.
Figure 5. Graph showing the model’s performance on the YelpChi dataset.
Jtaer 19 00075 g005
Figure 6. Graph showing the model’s performance on the YelpNYC dataset.
Figure 6. Graph showing the model’s performance on the YelpNYC dataset.
Jtaer 19 00075 g006
Figure 7. Feature importance plot for all three datasets.
Figure 7. Feature importance plot for all three datasets.
Jtaer 19 00075 g007
Figure 8. Force plot of a fake record in the YelpChi dataset.
Figure 8. Force plot of a fake record in the YelpChi dataset.
Jtaer 19 00075 g008
Figure 9. Force plot of a fake record in the YelpNYC dataset.
Figure 9. Force plot of a fake record in the YelpNYC dataset.
Jtaer 19 00075 g009
Figure 10. Force plot of a fake record in the YelpZip dataset.
Figure 10. Force plot of a fake record in the YelpZip dataset.
Jtaer 19 00075 g010
Table 3. Description of the features extracted from the dataset.
Table 3. Description of the features extracted from the dataset.
TypeFeaturesTaken FromDescription
User-Centric Featuresavg_Urating[70]average rating provided by a user
UCcounts[18]number of comments provided by each user
day_Urating[42]number of ratings provided by the user on a single day
Uavg#word average number of words in a comment of a user
Var_Urating[71]variance of ratings
Uvar#word[new]variance of number of words in a review
Product-Centric Featuresavg_Prating[new]average rating received by a product
PCcounts[new]number of comments received by the product
day_Prating[new]number of comments received by a product on a single day
Pavg#word[new]the average number of words in a review received by the product
Var_Prating[new]variance of rating received by a product
Pvar#word[new]variance of number of words in a review on a product
Review-Centric Features#ofCharacter[71]number of characters in a review
#ofword[70]number of words in a review
count_punct[new]number of punctuations in a review
Uppercase[new]number of uppercase characters in a review
Lowercase[new]number of lowercase characters in a review
subjectivity[71]a number representing the proportion of subjective words as opposed to objective words
Exclaim[new]presence of exclamation mark (!) in the review
positiveSent[72] [new]positive sentiment score between 0 to 10
negativeSentnegative sentiment score between 0 to 10
ld[73] [new]lexical diversity: ratio of unique words to all words in a review
le_d[74]lexical density: ratio of opinion words to all words in a review
anger[75] [new]anger emotion on a scale of 0 to 1
anticipationanticipation emotion on a scale of 0 to 1
trusttrust emotion on a scale of 0 to 1
fearfear emotion on a scale of 0 to 1
sadnesssadness emotion on a scale of 0 to 1
joyjoy emotion on a scale of 0 to 1
surprisesurprise emotion on a scale of 0 to 1
disgustdisgust emotion on a scale of 0 to 1
Entropy[26]rating entropy
singletonwhether a review is the only review posted by the user
date_entropytime gap between each consecutive review
similarity[42]maximum content similarity
Extrating extremity
ratio_LCAPS[76]ratio of uppercase character to lowercase character
Table 4. A binary confusion matrix.
Table 4. A binary confusion matrix.
Predicted Class
Actual Class Truthful [1]Fake [0]
Truthful [1]True Positive (TP)False Negative (FN)
Fake [0]False Positive (FP)True Negative (TN)
Table 5. Comparative performance evaluation.
Table 5. Comparative performance evaluation.
MetricAccuracyAverage PrecisionRecallAUCF1
Approaches
SpEagle [26]YelpChi---0.7887-
YelpNYC---0.7695-
YelpZip---0.7942-
[21]YelpChi-----
YelpNYC-----
YelpZip0.806-0.861-0.816
[22]YelpChi-----
YelpNYC-----
YelpZip---0.76860.7901
SPR2EP [81]YelpChi-0.3351-0.8071-
YelpNYC-0.3202-0.8129-
YelpZip-0.422-0.8318-
HFAN [66]YelpChi-0.4887-0.8324-
YelpNYC-0.5382-0.8478-
YelpZip-0.6235-0.8728-
[60]YelpChi-----
YelpNYC-----
YelpZip---0.9160.830
[82]YelpChi-----
YelpNYC-0.96930.8416-0.9009
YelpZip-0.83180.7786-0.8043
[83]YelpChi-----
YelpNYC----
YelpZip0.731----
Our ModelYelpChi0.80060.88040.7730.87560.7916
YelpNYC0.89730.93930.88820.94330.8952
YelpZip0.8389 10.92930.79220.92460.8812
1 Figures in bold are the cases where our model outperformed the existing state of the art
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ashraf, S.A.; Javed, A.F.; Bellary, S.; Bala, P.K.; Panigrahi, P.K. Leveraging Stacking Framework for Fake Review Detection in the Hospitality Sector. J. Theor. Appl. Electron. Commer. Res. 2024, 19, 1517-1558. https://doi.org/10.3390/jtaer19020075

AMA Style

Ashraf SA, Javed AF, Bellary S, Bala PK, Panigrahi PK. Leveraging Stacking Framework for Fake Review Detection in the Hospitality Sector. Journal of Theoretical and Applied Electronic Commerce Research. 2024; 19(2):1517-1558. https://doi.org/10.3390/jtaer19020075

Chicago/Turabian Style

Ashraf, Syed Abdullah, Aariz Faizan Javed, Sreevatsa Bellary, Pradip Kumar Bala, and Prabin Kumar Panigrahi. 2024. "Leveraging Stacking Framework for Fake Review Detection in the Hospitality Sector" Journal of Theoretical and Applied Electronic Commerce Research 19, no. 2: 1517-1558. https://doi.org/10.3390/jtaer19020075

Article Metrics

Back to TopTop