Efficient Detection of Irrelevant User Reviews Using Machine Learning

Kim, Cheolgi; Kim, Hyeon Gyu

doi:10.3390/app14166900

Open AccessArticle

Efficient Detection of Irrelevant User Reviews Using Machine Learning

by

Cheolgi Kim

¹

and

Hyeon Gyu Kim

^2,*

¹

School of EECS, Korea Aerospace University, Hanggongdaehak-ro 76-10, Deogyang-gu, Goyang-si 10540, Republic of Korea

²

Division of Computer Science and Engineering, Sahmyook University, Hwarang-ro 815, Nowon-gu, Seoul 01795, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 6900; https://doi.org/10.3390/app14166900

Submission received: 15 June 2024 / Revised: 29 July 2024 / Accepted: 2 August 2024 / Published: 7 August 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

User reviews such as SNS feeds and blog writings have been widely used to extract opinions, complains, and requirements about a given place or product from users’ perspective. However, during the process of collecting them, a lot of reviews that are irrelevant to a given search keyword can be included in the results. Such irrelevant reviews may lead to distorted results in data analysis. In this paper, we discuss a method to detect irrelevant user reviews efficiently by combining various oversampling and machine learning algorithms. About 35,000 user reviews collected from 25 restaurants and 33 tourist attractions in Ulsan Metropolitan City, South Korea, were used for learning, where the ratio of irrelevant reviews in the two kinds of data sets was 53.7% and 71.6%, respectively. To deal with skewness in the collected reviews, oversampling algorithms such as SMOTE, Borderline-SMOTE, and ADASYN were used. To build a model for the detection of irrelevant reviews, RNN, LSTM, GRU, and BERT were adopted and compared, as they are known to provide high accuracy in text processing. The performance of the detection models was examined through experiments, and the results showed that the BERT model presented the best performance, with an F1 score of 0.965.

Keywords:

data imbalance; irrelevant reviews; machine learning; oversampling; spam reviews; supervised learning

1. Introduction

Big data can be defined as data with 3V (volume, velocity, variety) properties [1]. Well-known examples include credit card transactions, mobile phone call logs, SNS feeds, and blog writings. Among them, user reviews, such as SNS feeds and blog writings, are widely used because they contain opinions, complaints, and requirements about a given place or product from the users’ point of view [2,3]. The reviews can easily be obtained using open search APIs provided by online portal service such as Google and Twitter. However, search results returned by the APIs may include a number of reviews that are not related to a given search term. For example, suppose that “Daewangam”, a popular tourist attraction in Ulsan Metropolitan City, South Korea, is given as a search keyword. Then, the following reviews can be included in the search results.

Ulsan Daewangam Headong Chinese restaurant, a fine dining place…
Introducing Haragum, a café near Daewangam park with a clean and spacious store…

The above reviews do not describe Daewangam itself, but introduce other places near it. Note that the rate of such irrelevant reviews is significantly high. In the case of tourist attractions, about 71.6% of the collected reviews were found to have poor relevance to a given search term (refer to Section 3). If these reviews are included in data analysis or opinion mining for popular place or product recommendation, the accuracy of the analysis can be degraded significantly. For example, suppose we perform a keyword analysis for Daewangam based on the reviews collected online. If irrelevant reviews similar to the above examples are included in the analysis, irrelevant expressions such as Chinese, restaurant, café and Haragum will be recommended as keywords for Daewangam, which will confuse users. The example shows that it is essential to detect and filter out irrelevant reviews prior to the data analysis.

Irrelevant reviews can be viewed as a sub-type of spam reviews discussed in literature. Jindal and Liu [4] introduced three types of spam reviews, as follows:

Type 1 (untrustful opinions): those that deliberately mislead readers by providing undeserving positive reviews or by giving malicious negative reviews.
Type 2 (reviews on brands only): those that do not provide any useful information on the target products in reviews, but only the brands or manufacturers of the products; these reviews are considered as spam because they are not targeted at specific products or places.
Type 3 (non-reviews): those that are non-reviews which have two main sub-types: advertisement and other irrelevant reviews containing no opinions.

Jindal and Liu discussed that Type 2 and 3 reviews are easier to identify compared to Type 1 reviews. From the discussion, most follow-up studies have focused on the detection of Type 1 reviews [5,6,7]. Regarding Type 2 and 3, studies have not been conducted sufficiently. On the other hand, irrelevant reviews can easily be seen nowadays, as data analysis and opinion mining have been actively conducted in various domains. Many applications, such as POI (Point-of-Interest) recommendation and sentiment analysis, require the analysis of a large amount of user reviews and use open search APIs to collect them [8,9,10]. Regarding this, Kim and Park [11] recently discussed the detection of irrelevant reviews using supervised learning, but they did not consider various combinations of machine learning algorithms, which is essential to build an efficient model.

In this paper, we discuss a method to detect irrelevant user reviews efficiently by combining various oversampling and machine learning algorithms. About 35,000 user reviews collected from 25 restaurants and 33 tourist attractions in Ulsan Metropolitan City, South Korea, were used for learning. The ratio of irrelevant reviews from the restaurants and tourist attractions was 53.7% and 71.6%, respectively. To deal with data skewness, Random Undersampler [12], Random Oversampler [13], SMOTE (Synthetic Minority Oversampling Technique) [14], Borderline-SMOTE [15], and ADASYN (Adaptive Synthetic Sampling Approach) [16] were used.

To build a model for the detection of irrelevant reviews, machine learning algorithms, including RNN (Recurrent Neural Network) [17], LSTM (Long Short-Term Memory) [18], GRU (Gated Recurrent Unit) [19], and BERT (Bidirectional Encoder Representation from Transformer) [20] were adopted. These algorithms are known to provide high accuracy in text processing, which was discussed in [21]. By combining the data imbalance processing algorithms and machine learning algorithms together, a total of 24 models were implemented and compared for the experiments.

The contributions of this paper can be summarized as follows:

Unlike existing studies focusing on the detection of Type 1 reviews, the proposed method discussed how to detect Type 3 reviews efficiently using machine learning algorithms.
To ensure that the detection model properly reflects the characteristics of real data, 35,000 user reviews, collected from 25 restaurants and 33 tourist attractions in Ulsan, South Korea, were used for learning.
Various combinations of oversampling and machine learning algorithms were considered in order to build an effective model to detect irrelevant reviews efficiently. A total of 24 models were implemented and compared to find out the best performing model.

The remainder of this paper is organized as follows: Section 2 introduces existing studies to detect spam reviews using machine learning. Section 3 describes the proposed method regarding the data preparation and model implementation for the detection of irrelevant user reviews. Section 4 provides the experimental results, which compare the performance of the detection models implemented by combining the oversampling and machine learning algorithms. Section 5 concludes the paper with the future research directions.

2. Related Work

Review analysis is a representative application area of sentiment analysis [22]. Sentiment analysis, also referred to as sentiment classification, aims to extract opinions from a large number of unstructured texts and classify them into sentiment polarities, positive, neutral, or negative [23]. Conventional approaches usually focused on textual data [24]. Lightart et al. [21] surveyed 112 papers discussing sentiment analysis and showed that LSTM (33.5%) is the most commonly used for text analysis, followed by GRU (8.77%), RNN (7.89%), and BERT (3.07%). With the advances in social media, it is significant to precisely capture the sentiment via the presence of different modalities (i.e., textual, acoustic, and visual) [25]. From this, recent research has focused more on aspect-based multimodal sentiment analysis [26,27,28].

In the case of review analysis, most of the existing studies have still focused on textual data, aiming to determine whether a given review is spam or ham (legitimate review). Table 1 summarizes the existing studies related to the review analysis in terms of source (learning) data, machine learning algorithm with the best performance, and prediction performance.

As mentioned before, the majority of the existing studies discussed the detection of Type 1 reviews. Ott et al. [29] obtained hotel reviews through Amazon Mechanical Turk (AMT), a crowdfunding site, and extracted LIWC (Linguistic Inquiry and Word Count) and bigrams from the reviews. Using the features, they achieved an accuracy of 0.898 using SVM. Shojaee et al. [30] adopted the stylometric feature, a mixture of lexical and syntactic features of the reviews, and obtained an F1 score of 0.84 using SVM for the hotel reviews presented by Ott et al. Feng et al. [31] extracted Context Free Grammar (CFG) parse trees from the hotel reviews and obtained an accuracy of 0.912 using SVM. Li et al. [32] extracted review sentiments (positive or negative) and reviewers’ behaviors from 6000 reviews collected from Epinions and obtained an F1 score of 0.631 using Naïve Bayes. Makherjee et al. [33] suggested that higher accuracy can be achieved using the reviewers’ abnormal behavioral features together with the linguistic features of the reviews, and their SVM model showed an accuracy of 0.861 for the Yelp data. Li et al. [34] tried to increase accuracy by using reviewers’ temporal and spatial patterns and obtained an F1 score of 0.85 using SVM.

Recently, neural networks have been used more frequently to achieve higher accuracy. Barushka and Hajec [35] obtained an accuracy of 0.89 using DNN (Deep Neural Network) for the hotel reviews of Ott et al. Li et al. Reference [36] employed CNN (Convolutional Neural Network) [37] for the dataset presented by [31] and achieved an F1 score of 0.823. Zhao et al. [38] also used CNN and achieved an F1 score of 0.828 for 24,166 hotel reviews collected online. Shahariar et al. [39] showed that the LSTM model provided better performance than the CNN model and achieved an F1 score of 0.946 using LSTM for the hotel reviews discussed in the work of Ott et al., while Liu et al. [40] used bidirectional LSTM with feature combination and obtained an F1 score of 0.876 for the dataset from [31].

Several studies have attempted to combine two or more algorithms for better performance. Wang et al. [41] adopted Attention [42] to express the relationship between two features, in addition to CNN to express linguistic features. Their algorithm provided an F1 score of 0.889 and 0.912 for the hotel and restaurant reviews in the Yelp data, respectively. Ren and Zhang [43] combined CNN and bidirectional RNN and obtained an F1 score of 0.839 for the dataset from [31]. Bhuvaneshwari et al. [44] used CNN and LSTM to build a document vector and obtained an F1 score of 0.87 for the YelpZip data. Duma et al. [45] combined Transformers [46] with CNN and LSTM and achieved an F1 score of 0.965.

Regarding Type 2 and 3 reviews, studies have not been conducted sufficiently. Jindal and Liu [47] obtained an AUC (Area Under the Curve) of 0.987 using logistic regression for 470 reviews. Raymond et al. [48] achieved an AUC of 0.95 using SVM for 1032 synthesized reviews. Although the two studies achieved high detection accuracy, it is difficult to say that the characteristics of the real data are sufficiently reflected in the results. Recently, Kim and Park [11] used LSTM and BERT to build a detection model and achieved an F1 score of 0.930 for user reviews collected from tourist attractions, but their approach needs to be examined from more diverse perspectives, including data imbalance processing.

Meanwhile, Li et al. [49] indicated that it is difficult to identify the difference between the deceptive (spam) and truthful (legitimate) reviews when using supervised learning and suggested that topic modeling can be used to reveal the patterns in a more interpretable way. Regarding this, Ya et al. [50] adopted the LDA (Latent Dirichlet Allocation) topic modeling with an estimated degree of the reviewers’ abnormality, and achieved an F1 score of 0.93 for the Dianping reviews. Ahsan and Sharma [51] suggested an optimal feature set for spam detection from tweet messages, which can be used with LDA topic modeling. Jakulov et al. [52] discussed that the LDA does not take advantage of dense word representation, which can capture semantically meaningful regularities between words, and extended their topic modeling algorithm using lda2vec [53] and BERT to enhance accuracy.

Table 1. Summarization of the existing studies: source data, best algorithm, performance measure, and best detection score.

Type	Ref No.	Source Data	Best Algorithm	Measure	Best Score
Type 1	[29]	Hotel reviews from AMT	SVM	Accuracy	0.898
	[30]	Hotel reviews of [29]	SVM	F1 score	0.840
	[31]	Hotel reviews	SVM	Accuracy	0.912
	[32]	6000 reviews from Epinions	Naïve Bayes	F1 score	0.631
	[33]	Yelp data	SVM	Accuracy	0.861
	[34]	Yelp data	SVM	F1 score	0.850
	[35]	Hotel reviews of [29]	DNN	Accuracy	0.890
	[36]	Hotel reviews of [31]	CNN	F1 score	0.823
	[38]	24,166 hotel reviews	CNN	F1 score	0.828
	[39]	Hotel reviews of [29]	LSTM	F1 score	0.946
	[40]	Hotel reviews of [31]	LSTM	F1 score	0.876
	[41]	Yelp data	Attention + CNN	F1 score	0.912
	[43]	Hotel reviews of [31]	CNN + RNN	F1 score	0.839
	[44]	YelpZip data	CNN + LSTM	F1 score	0.870
	[45]	Yelp data	Transformers + CNN + LSTM	F1 score	0.965
Type 2 or 3	[47]	470 reviews	Logistic regression	AUC	0.987
	[48]	1032 synthesized reviews	SVM	AUC	0.950
	[11]	Tourist attraction reviews	BERT	F1 score	0.930

3. Proposed Method

To perform learning in the proposed method, approximately 35,000 user reviews collected from 25 restaurants and 33 tourist attractions in Ulsan Metropolitan City, South Korea, were used. The user reviews for each place were collected using the blog search APIs [54] provided by Naver, a popular online portal service in South Korea. From the collection process, 58 data sets were prepared, each of which had a set of reviews for a restaurant or a tourist attraction. Table 2 and Table 3 show the number of reviews collected from restaurants and tourist attractions, respectively, in each district of Ulsan.

The review data sets were then tailored for learning. The preprocessing steps for learning include labeling, morphological analysis, and stopword removal. First, manual labeling was performed for supervised learning. In this step, each review in the data sets was manually labeled to indicate whether it is irrelevant or not. From the labeling, about 53.7% of the reviews collected from restaurants were found irrelevant, and about 71.6% of the reviews collected from tourist attractions were found irrelevant, as shown in Table 2 and Table 3.

The words in the reviews were then standardized by conducting morphological analysis. In this step, adjectives or verbs were converted into their basic forms. Note that user reviews often include proper nouns or new words reflecting recent trends, which are not included in a dictionary. Thus, instead of using conventional morphological analysis approaches based on a dictionary, we adopted a statistical approach using cohesion scoring [55,56], which can be calculated based on word frequencies. After the conversion, the stopwords were removed from the reviews. In this step, meaningless words such as articles were excluded. To improve accuracy, rarely occurring words were also excluded from the reviews. In the proposed method, words with fewer than three occurrences were removed.

Table 4 and Table 5 show the statistics for the reviews of the restaurants and tourist attractions after the preprocessing, respectively. The number of unique tokens and the maximum number of tokens in a sentence are used for the parameter settings of the detection models that will be implemented below.

The refined reviews in each data set were then split into the training and test data sets, with a ratio of 8:2. The training data set was again split into X_train and T_train, where the former represents the independent variables used to train the model, and the latter represents the dependent variable with category labels against the independent variables. In this case, X_train contains user reviews, while T_train has class labels for the reviews (1 if a review is noisy, or 0 otherwise). The test data set was also split into X_test and T_test in the same manner.

To conduct learning, each review in X_train must be converted to a numerical vector. For this purpose, the Tokenizer class in Keras [57], an open source software library provided by Google for building artificial neural networks, was used as follows.

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = 2000)
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)

The num_words parameter of Tokenizer denotes the total size of the token ID space. In the above code, 2000 was given for the parameter that was sufficient to store the training data, since the total number of unique tokens does not exceed 2000, as shown in Table 4 and Table 5. The fit_on_texts method creates the token index based on the word frequency of the reviews in X_train. The texts_to_sequences method transforms each text review to a sequence of integers representing the token indexes. The transformed integer sequences were stored in X_train_seq. The reviews in X_test were also converted into the corresponding integer sequences in the same manner.

For the training data set, including X_train_seq and T_train, data imbalance processing algorithms were applied. To balance the skewed data, either the undersampling of the majority class or the oversampling of the minority class can be used. To implement undersampling or oversampling, scikit-learn [58] was used, which is a Python-based machine learning library. To perform undersampling, the RandomUnderSampler class of scikit-learn’s imblearn.under_sampling package was used as follows: The random_state parameter of the class was set to 0 to obtain the same data in each run. For other parameters, default values were used. The sampling data for X_train_seq and T_train were stored in X_train_samp and T_train_samp, respectively.

from imblearn.under_sampling import RandomUnderSampler
sampler = RandomUnderSampler(random_state = 0)
X_train_samp, T_train_samp = sampler.fit_resample(X_train_seq, T_train)

To perform oversampling, RandomOverSampler, SMOTE, Borderline-SMOTE, and ADASYN were used. The following shows the code to apply RandomOverSampler to the training data. The class was imported from the imblearn.over_sampling package. The code to use other algorithms is also similar; only the class name is changed to SMOTE, BorderlineSMOTE, or ADASYN, and the rest of the code, except the class name, is the same.

from imblearn.over_sampling import RandomOverSampler
sampler = RandomOverSampler (random_state = 0)
X_train_samp, T_train_samp = sampler.fit_resample(X_train_seq, T_train)

The oversampled data are then fed into a model to detect irrelevant reviews. As mentioned earlier, various machine learning algorithms, such as RNN, LSTM, GRU, and BERT, whose algorithms are known to provide high accuracy in text processing, were used in the proposed method. The RNN, LSTM, and GRU models can be implemented using Keras. For example, the LSTM model can be built as follows:

model = keras.Sequential([
keras.layers.Embedding(2000, 50, input_length = 40),
keras.layers.LSTM(32),
keras.layers.Dense(128, activation = ‘relu’),
keras.layers.Dense(1, activation = ‘sigmoid’)
])
model.fit(X_train_samp, T_train_samp, epochs = 30)

The Embedding layer contains corpus, a set of unique words appearing in the training data. Each word in the corpus is represented as a multi-dimensional vector to keep information about the relationship to other words in the reviews. The layer receives three parameters, including the corpus size, the dimension of the word vector, and the maximum length of a sentence. In the above code, 2000, 50, and 40 were given for the parameters that were sufficient to store the training data. In this case, each word is represented as a 50-dimensional vector, and a 40 × 50 matrix representing a given review is output to the next layer. The LSTM layer receives the matrix and performs training to remember the sequence of the words in the review. The parameter 32 denotes the number of hidden nodes used to maintain the learning information, which can be adjusted to a larger value if more information needs to be trained. The remaining two Dense layers are used for the classification, i.e., to determine whether the review from the LSTM layer is noisy. The Dense layer has two parameters to denote the number of hidden nodes and the activation function. As an activation function, relu is commonly used in the middle layers to prevent the vanishing gradient problem, and sigmoid is used in the last layer to determine the class to which a given output belongs. The fit function executes the configured model with a given training data. The learning was set to be performed up to 30 epochs.

The RNN and GRU models can be implemented in the same manner. To implement the RNN models, the keras.layers.SimpleRNN class can be used, instead of the keras.layers.LSTM in the above code. Similarly, the GRU model can be implemented using the keras.layers.GRU class. The rest of the code, except the class name, is the same.

The three algorithms mentioned above use only given input data for learning. For example, the number of training parameters used in the Embedding layer can be calculated by multiplying the corpus size by the word matrix size. Thus, in the above code, 100,000 parameters were used for training. On the other hand, BERT improves the accuracy of text processing by using pre-trained data in addition to the given input data. For example, when using the basic multilingual model of BERT, the number of parameters used for learning in the embedding layer reaches approximately 91,000,000. This implies that BERT contains richer information in the corpus, which becomes the basis for providing relatively higher accuracy than other algorithms.

To implement the BERT model, ktrain [59] was used, which is a light-weight wrapper for the Keras. The Transformer class in the ktrain’s text package was used for implementation. As a pre-trained model, the basic multilingual model of BERT was used to process Korean in the reviews. The maxlen parameter was set to 40, the same as in other models. For other parameters, default values were used.

from ktrain import text
transformer = text.Transformer(‘bert-base-multilingual-uncased’, maxlen = 40, …)
model = transformer.get_classifier()

Figure 1 shows possible combinations of data imbalance processing algorithms, and the machine learning algorithms discussed above that can be used to build a model for the detection of irrelevant reviews. Because the case where the original data are used without undersampling or oversampling can be included in the combination, a total of 24 models can be implemented and compared.

4. Experimental Results

In this section, we compare the performance of the 24 detection models in Figure 1 through experiments with the two kinds of data sets discussed in the previous section, including 25 data sets for restaurants and 33 data sets for tourist attractions. To investigate the performance more precisely, the precision, recall, F1 score, and balanced accuracy of the models were compared. For each model with a data set, five-fold cross-validation was performed at each run, and the average score of five runs was used for the performance comparison. Experiments were conducted on Google Colab with a V100 GPU and 40 GB of memory.

4.1. Performance Measure

In general, a confusion matrix is used to measure the performance of binary classification models. It consists of four parameters, including TP (True Positive), FP (False Positive), FN (False Negative), and TN (True Negative). TP is the number of records whose labels are predicted as positive and are actually positive. FP is the number of records whose labels are predicted as positive but are actually negative. TN and FN can be interpreted in the same manner.

A traditional measure to evaluate classification performance is accuracy, which is defined as the number of classifications that a model correctly predicts, divided by the total number of predictions.

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(1)

Accuracy may not be used properly when the data are skewed to one class. For example, suppose that the ratio of the positive class is 5%. One way to achieve high accuracy in this case is simply to predict all records as negative, which will result in 95% accuracy. On the other hand, if the opposite prediction is made, the accuracy drops to 5%. This example indicates that FP and FN should be considered together when measuring classification performance. To measure the performance from the perspective of FP, the following precision can be used.

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

To measure the performance from the perspective of FN, recall can be used.

R e c a l l = \frac{T P}{T P + F N}

(3)

A simple way to measure performance considering both FP and FN is to use the average of precision and recall. The F1-score is defined as the harmonic mean of precision and recall, as follows:

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

The F1 score can effectively express the performance of models when the positive class is a minority with a small ratio; however, it may not be effective in the opposite case. To complement this, balanced accuracy can be used, which is defined as the arithmetic mean of the true positive rate (also called sensitivity or recall) and the true negative rate (also called specificity). Balanced accuracy can express the performance of models appropriately when the positive class is a majority.

B a l a n c e d a c c u r a c y = \frac{1}{2} \times (\frac{T P}{T P + F N} + \frac{T N}{T N + F P})

(5)

In this paper, precision, recall, the F1 score, and balanced accuracy were used together to validate the performance of each model.

4.2. Precision

We first checked the models’ precision from the perspective of data imbalance processing algorithms and machine learning algorithms. Table 6 shows the mean and standard deviation of the models’ precision, obtained from 25 review data sets for the restaurants in Ulsan, where the ratio of irrelevant reviews is about 53.7%. Each score in the table represents an average value for the 25 data sets, with five runs performed for each data set. This also applies to the experimental results shown in the below tables discussing the models’ recall, F1 scores, and balanced accuracy, which were obtained from the same review data sets for the restaurants. The precision of the RandomUnderSampler was too low, less than 0.5 or even 0 depending on the models, so its scores were excluded.

In terms of data imbalance processing algorithms, the models using the original data presented the best performance, with an average precision of 0.963, while the performance of the models with ADASYN was the worst, with an average of 0.940. In terms of machine learning algorithms, the BERT model provided the best performance, with an average of 0.963, while the LSTM model performed the worst, with an average of 0.919. Among the individual models, the GRU model with the original data provided the highest precision, at 0.967.

Table 7 shows the mean and standard deviation of the models’ precision, obtained from 33 review data sets for the tourist attractions in Ulsan, where the ratio of irrelevant reviews is about 71.6%. Each score in the table represents an average value for the 33 data sets, with five runs performed for each data set. This also applies to the experimental results, shown in the below tables discussing the models’ recall, F1 scores, and balanced accuracy, which were obtained from the same review data sets for the tourist attractions. The precision of the RandomUnderSampler was too low, less than 0.5 or even 0 depending on the models, so its scores were excluded.

In terms of data imbalance processing algorithms, the average precision of the algorithms was similar, ranging from 0.965 to 0.969. Among the machine learning algorithms, the GRU model provided the best performance, with an average of 0.973, while the RNN model performed the worst, with an average of 0.961. Among the individual models, the GRU model with Borderline-SMOTE provided the highest precision, at 0.976.

Figure 2 compares the precision of the models in terms of data imbalance processing algorithms. The left shows the models’ precision, obtained from 25 review data sets for the restaurants. The LSTM model was the most affected by oversampling, with its precision dropping from 0.966 to a maximum of 0.897. The precision of the GRU model also dropped from 0.967 to a maximum of 0.941 after oversampling. On the other hand, there was no significant difference in the precision of the models obtained from the 33 review data sets for the tourist attractions, as shown in the figure on the right.

To figure out why the performance of the LSTM and GRU models was degraded, we examined the changes in the TP and FP values of the models after oversampling was applied. Figure 3 compares the TP and FP values of the models obtained from the restaurant reviews. In both the LSTM and GRU models, the TP values decreased, while the FP values increased after oversampling. This leads to the performance degradation of the two models.

In general, when oversampling is applied, the ratio of a minority class increases. In our experimental data, the negative class with legitimate reviews is the minority. Thus, its ratio increases from 46.3% to 50% after oversampling. This increases the possibility that a given record in the test data set is classified as negative. As a result, the TN and FN values increase, while the TP and FP values decrease. On the other hand, in Figure 3, the FP values increased. Regarding this, more study is necessary to find out the reason in future.

When data skewness is higher, both TP and FP decreased. Figure 4 compares TP and FP of the models obtained from the tourist attraction reviews. In the RNN, LSTM, and GRU models, the TP values decreased when RandomOverSampler, SMOTE, and Borderline-SMOTE were applied. In these cases, the FP values also decreased, which is in line with our expectation. The reduction rates of the TP and FP values were small, with a maximum of 0.09. As a result, no significant difference in precision was observed after oversampling.

The above results regarding the models’ precision can be summarized as follows:

When oversampling was applied for the nearly balanced data, such as the review data sets for the restaurants, the precision of the LSTM and GRU models was degraded, due to the fact that the TP values of the models decreased, while their FP values increased.
When oversampling was applied for the skewed data, such as the review data sets for the tourist attractions, there was no significant difference in the models’ precision. Both the TP and FP decreased, and their reduction rates were so small.
The performance of the BERT model was not affected by oversampling algorithms.

4.3. Recall

Table 8 compares the mean and standard deviation of the models’ recall, obtained from 25 review data sets for the restaurants in Ulsan. Among data imbalance processing algorithms, the models using the original data presented the best performance, with an average recall of 0.968, while the performance of the models with ADASYN was the worst, with an average of 0.958. In terms of machine learning algorithms, the BERT model provided the best performance, with an average of 0.969, while the LSTM model performed the worst, with an average of 0.956. Among the individual models, the LSTM model with the original data provided the highest recall, at 0.975. Compared to the precision shown in Table 6, the performance gap between the algorithms was not so significant in this case.

Table 9 compares the mean and standard deviation of the models’ recall, obtained from 33 review data sets for the tourist attractions. In terms of data imbalance processing algorithms, the models using the original data and RandomOverSampler presented the best performance, with an average of 0.948, while the performance of the models with ADASYN was the worst, with an average of 0.883. Among the machine learning algorithms, the BERT model provided the best performance, with an average of 0.958, while the RNN model performed the worst, with an average of 0.873. Among the individual models, the BERT model using SMOTE provided the highest recall, at 0.961.

Figure 5 compares the recall of the models in terms of data imbalance processing algorithms. The figures on the left and right show the recall obtained from the restaurant and tourist attraction data sets, respectively. Note that the patterns of the two figures contrast with those of Figure 2. In the left-hand figure, there was no significant difference in the models’ performance. On the other hand, in the right-hand figure, the models’ performance was significantly affected by the oversampling algorithms. The RNN model was most affected by oversampling, with its recall dropping from 0.931 to a maximum of 0.825. The recall of the LSTM and GRU models also dropped from 0.951 to 0.887 and from 0.952 to 0.860, respectively, after oversampling. The performance of the BERT model was not affected by the oversampling algorithms. Interestingly, no performance degradation was observed when RandomOverSampler was applied.

To figure out the reason for the performance degradation, we examined the changes in TP and FP of the models after oversampling was applied. Figure 6 compares the TP and FN values of the models obtained from the restaurant reviews. As shown on the left side of Figure 5, minor performance degradation was observed only in the LSTM model. The TP values decreased, while the FP values increased after oversampling, causing the model’s recall to drop from 0.975 to 0.949.

Figure 7 compares the TP and FN values of the models obtained from the tourist attraction reviews. In the RNN, LSTM, and GRU models, the TP values decreased when SMOTE, Borderline-SMOTE, and ADASYN were applied. Note that the FN values also drastically increased in these cases. This led to the significant drop of the models’ recall. For example, in case of the RNN model with SMOTE, the TP values decreased from 475 to 423, while the FN values increased from 35 to 87. As a result, the recall of the model significantly dropped from 0.931 to 0.835. The situation of the other models, except BERT, was similar. The performance of the BERT model was not affected by oversampling algorithms. In addition, the performance degradation was not so severe when RandomOverSampler was applied.

The above results regarding the models’ recall can be summarized as follows:

When oversampling was applied for the nearly balanced data, such as the review data sets for the restaurants, there was no significant difference in the models’ recall.
When oversampling was applied for the skewed data, such as the review data sets for the tourist attractions, the recall of the RNN, LSTM, and GRU models degraded, due to the fact that the TP values of the models decreased, while their FP drastically increased.
The performance of the BERT model was not affected by oversampling algorithms.

4.4. F1 Scores

Table 10 shows the mean and standard deviation of the models’ F1 scores, obtained from 25 review data sets for the restaurants. In terms of data imbalance processing algorithms, the models using the original data presented the best performance, with an average F1 score of 0.965, while the performance of the models with ADASYN was the worst, with an average of 0.946. In terms of machine learning algorithms, the BERT model provided the best performance, with an average of 0.966, while the LSTM model performed the worst, with an average of 0.933. Among the individual models, the LSTM and GRU models with the original data provided the highest F1 score, at 0.970.

Table 11 compares the mean and standard deviation of the models’ F1 scores obtained from 33 review data sets for the tourist attractions. Among the data imbalance processing algorithms, the models using the original data presented the best performance, with an average F1 score of 0.957, while the performance of models with ADASYN was the worst, with an average of 0.921. Among the machine learning algorithms, the BERT model provided the best performance, with an average of 0.963, while the RNN model performed the worst, with an average of 0.913. Among the individual models, the BERT model using SMOTE presented the highest F1 score, at 0.964.

The results of Table 10 and Table 11 require further comparative analyses. For example, the performance difference between the LSTM and BERT model using the original data in Table 11 is very small, and it is not clear whether the difference is likely due to a change or to some other factor of interest. To clarify this, a paired t-test was performed to examine the statistically significant differences between the experimental models. For this purpose, the p-value was set to 0.05, and a degree of freedom was set to 4, since an average was taken from five observations. Neither of the two models had t-values within the interval [−2.776, +2.776] from the t-table, which is needed to accept the null hypothesis [60]. For example, the t-value calculated from the LSTM and BERT models using the original data, where the performance difference was the smallest in Table 11, was about 4.379. When the paired t-test was performed on the two models with the largest performance difference, including the BERT model with SMOTE and the RNN model with ADASYN, the t-value increased to 9.662. The results showed that the models are statistically different to each other.

Figure 8 compares the F1 scores of the models in terms of data imbalance processing algorithms. The left shows the models’ F1 scores obtained from the restaurant reviews. In this case, precision is a key factor that determines a model’s F1 score. For example, the precision of the LSTM and GRU models was significantly deteriorated after oversampling, as shown on the left side of Figure 2, which leads to the degradation of the F1 scores of the models. The performance of the other two models, including the RNN and BERT models, was not affected by oversampling. The right side of Figure 8 shows the models’ F1 scores obtained from the tourist attraction reviews. In this case, recall is a key factor that determines a model’s F1 score. For example, the recall of the RNN, LSTM, and GRU models was significantly deteriorated after oversampling, as shown on the right side of Figure 5, which results in the degradation of the F1 scores of the models. The performance of the BERT model was not affected by oversampling.

The above results regarding the models’ F1 scores can be summarized as follows:

None of the oversampling algorithms improved the performance of the models. The models using the original data provided the best F1 scores for both kinds of data sets.
When oversampling was applied for the nearly balanced data, such as the review data sets for the restaurants, the F1 scores of the models were heavily influenced by precision.
When oversampling was applied for the skewed data, such as the review data sets for the tourist attractions, the F1 scores of the models were significantly influenced by recall.
The average F1 score of the models for the nearly balanced data was 0.953, which was higher than the average F1 score of the models for the skewed data, whose score was 0.937.
The BERT model provided the best performance, with an average F1 score of 0.965. Its performance was not affected by the oversampling algorithms.

4.5. Balanced Accuracy

Table 12 shows the mean and standard deviation of the models’ balanced accuracy, obtained from 25 review data sets for the restaurants. The results show the same pattern as the F1 scores in Table 10. In terms of data imbalance processing algorithms, the models using the original data presented the best performance, with an average score of 0.961, while the performance of models with ADASYN was the worst, with an average of 0.932. In terms of the machine learning algorithms, the BERT model provided the best performance with an average of 0.962, while the LSTM model performed the worst, with an average of 0.912. Among the individual models, the LSTM and GRU models with the original data provided the highest accuracy, at 0.967.

Table 13 compares the mean and standard deviation of the models’ balanced accuracy, obtained from 33 review data sets for the tourist attractions. The results also show the same pattern as the F1 scores in Table 11. Among the data imbalance processing algorithms, the models using the original data presented the best performance, with an average score of 0.933, while the performance of models with ADASYN was the worst, with an average of 0.901. Among the machine learning algorithms, the BERT model provided the best performance, with an average of 0.938, while the RNN model performed the worst, with an average of 0.891. Among the individual models, the BERT model using SMOTE presented the highest F1 score, at 0.940.

Figure 9 compares the balanced accuracy of the models in terms of data imbalance processing. The results show the same pattern as the F1 scores in Figure 8. The only difference is that the accuracy of a model is less than the F1 score of that model.

5. Conclusions and Future Work

In this paper, we discussed a method to detect irrelevant user reviews efficiently by combining various oversampling and machine learning algorithms. About 35,000 user reviews collected from 25 restaurants and 33 tourist attractions in Ulsan were used for learning, where the ratio of irrelevant reviews in the two kinds of data sets was about 53.7% and 71.6%, respectively. To deal with the data skewness in the collected reviews, data imbalance processing algorithms such as RandomUnderSampler, RandomOverSampler, SMOTE, Borderline-SMOTE, and ADASYN were adopted. To build a model for the detection of irrelevant reviews, RNN, LSTM, GRU, and BERT were used, whose algorithms are known to provide good performance in text processing. By combining these algorithms, a total of 24 detection models were implemented and compared to achieve the best performance. The performance of the models was examined through experiments with the two kinds of review data sets. The results can be summarized as follows:

When oversampling algorithms were applied to the nearly balanced data, such as the review data sets for the restaurants where the ratio of irrelevant reviews was about 53.7%, the precision of the LSTM and GRU models significantly degraded. In this case, the F1 scores of the models were heavily influenced by precision.
When oversampling algorithms were applied to the skewed data, such as the review data sets for the tourist attractions where the ratio of irrelevant reviews was about 71.6%, the recall of the RNN, LSTM, and GRU models significantly degraded. In this case, the F1 scores of the models were heavily influenced by recall.
None of the oversampling algorithms improved the performance of the models. The models using the original data provided the best F1 scores for both kinds of review data sets.
The average F1 score of the models for the nearly balanced data was 0.953, which was higher than the average F1 score of the models for the skewed data, whose score was 0.937.
The BERT model provided the best performance, with an average F1 score of 0.965. In addition, its performance was not affected by oversampling algorithms.

As summarized, the BERT model can be used properly for the detection of irrelevant reviews. Its superior performance can be explained by the fact that BERT uses pre-trained data for learning, in addition to a given input data. As discussed in Section 3, when using the basic multilingual model of BERT, the number of parameters used for learning in the embedding layer reaches approximately 91,000,000, which is much larger than the 100,000 parameters when using LSTM. This also implies that other pre-trained language models, such as ROBERTA [61], XLNET [62], ALBERT [63], T5 [64], and ElecTra [65], can be used for the efficient detection of irrelevant reviews.

We also observed that conventional oversampling algorithms, such as SMOTE, Borderline-SMOTE, and ADASYN cannot improve the models’ performance when processing text reviews. This implies that new records, augmented by oversampling algorithms, cannot properly reflect their original contexts and act as noise when the classification is performed. Interestingly, the performance of the BERT model did not degrade after oversampling. Regarding this, further analysis is required to understand the relationship between how each word is mapped and oversampled when using the algorithms.

The discussion in this paper was limited to the detection of irrelevant reviews, especially when the collected reviews have opinions about popular places, i.e., when a search term to collect reviews is the name of a popular tourist attraction, restaurant, or cafe. In this case, irrelevant reviews commonly have opinions about other places near a given target place. On the other hand, when the collected reviews have opinions about popular menus, products, or other themes, rather than the places discussed in this paper, the characteristics of the irrelevant reviews may differ. For example, when the name of a restaurant’s menu is given as a search term, the search results may contain reviews introducing not only the menu of the given restaurant, but also those of other popular restaurants. In this case, a more sophisticated model, such as an ensemble model combining multiple classifiers, can be adopted to deal with the diverse types of irrelevant reviews, whose characteristics may vary depending on a given search term.

Regarding the above discussion, we plan to continue our research to find more types of irrelevant reviews that can be collected online and used for data analysis. In addition, we will continue research on using various pre-trained language models, such as ROBERTA, XLNET, ALBERT, T5, and ElecTra, for the efficient detection of irrelevant reviews.

Author Contributions

Methodology, C.K.; Writing—original draft, H.G.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sagiroglu, S.; Sinanc, D. Big data: A review. In Proceedings of the 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, CA, USA, 20–24 May 2013; pp. 42–47. [Google Scholar]
Sachdeva, N.; McAuley, J. How useful are reviews for recommendation? A critical review and potential improvements. In Proceedings of the SIGIR ‘20: The 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 25–30 July 2020; pp. 1845–1848. [Google Scholar]
Xu, Z.; Zeng, H.; Ai, Q. Understanding the effectiveness of reviews in e-commerce top-N recommendation. In Proceedings of the 2021 ACM SIGIR International Conference on the Theory of Information Retrieval, Virtual, 11–15 July 2021; pp. 149–155. [Google Scholar]
Jindal, N.; Liu, B. Analyzing and detecting review spam. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 547–552. [Google Scholar]
Crawford, M.; Khoshgoftaar, T.M.; Prusa, J.D.; Richter, A.N.; Al Najada, H. Survey of review spam detection using machine learning techniques. J. Big Data 2015, 2, 23. [Google Scholar] [CrossRef]
He, L.; Wang, X.; Chen, H.; Xu, G. Online spam review detection: A survey of literature. Hum.-Centric Intell. Syst. 2022, 2, 14–30. [Google Scholar] [CrossRef]
Mewada, A.; Dewang, R.K. A comprehensive survey of various methods in opinion spam detection. Multimed. Tools Appl. 2023, 82, 13199–13239. [Google Scholar] [CrossRef]
Diaz-Garcia, J.A.; Ruiz, M.D.; Martin-Bautista, M.J. NOFACE: A new framework for irrelevant content filtering in social media according to credibility and expertise. Expert Syst. Appl. 2022, 208, 118063. [Google Scholar] [CrossRef]
Pezoa-Fuentes, C.; García-Rivera, D.; Matamoros-Rojas, S. Sentiment and emotion on Twitter: The case of the global consumer electronics industry. J. Theor. Appl. Electron. Commer. Res. 2023, 18, 765–776. [Google Scholar] [CrossRef]
Patel, R.; Passi, K. Sentiment analysis on Twitter data of world cup soccer tournament using machine learning. IoT 2020, 1, 218–239. [Google Scholar] [CrossRef]
Kim, H.G.; Park, Y.H. Efficient detection of noise reviews over a large number of places. IEEE Access 2023, 11, 114390–114402. [Google Scholar] [CrossRef]
Random Undersampler. Available online: https://imbalanced-learn.org/stable/under_sampling.html (accessed on 14 June 2024).
Random Oversampler. Available online: https://imbalanced-learn.org/stable/over_sampling.html (accessed on 14 June 2024).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. Artif. Intell. Res. 2020, 16, 321–357. [Google Scholar] [CrossRef]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, ICIC 2005, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the International Joint Conference on Neural Networks, IJCNN 2008, Hong Kong, China, 1–6 June 2008; pp. 1322–1328. [Google Scholar]
Yadav, S.P.; Zaidi, S.; Mishra, A.; Yadav, V. Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN). Arch. Comput. Methods Eng. 2022, 29, 1753–1770. [Google Scholar] [CrossRef]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Ligthart, A.; Catal, C.; Tekinerdogan, B. Systematic reviews in sentiment analysis: A tertiary study. Artif. Intell. Rev. 2021, 54, 4997–5053. [Google Scholar] [CrossRef]
Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
Wang, Y.; Huang, M.; Zhu, X.; Zhao, L. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–4 November 2016; pp. 606–615. [Google Scholar]
Xu, N.; Mao, W.; Chen, G. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 371–378. [Google Scholar]
Yu, Y.; Lin, H.; Meng, J.; Zhao, Z. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks. Algorithms 2016, 9, 41. [Google Scholar] [CrossRef]
Gu, D.; Wang, J.; Cai, S.; Yang, C.; Song, Z.; Zhao, H.; Xiao, L.; Wang, H. Targeted aspect-based multimodal sentiment analysis: An attention capsule extraction and multi-head fusion network. IEEE Access 2021, 9, 157329–157336. [Google Scholar] [CrossRef]
Liu, H.; Li, X.; Lu, W.; Cheng, K.; Liu, X. Graph augmentation networks based on dynamic sentiment knowledge and static external knowledge graphs for aspect-based sentiment analysis. Expert Syst. Appl. 2024, 251, 123981. [Google Scholar] [CrossRef]
Xiao, L.; Xue, Y.; Wang, H.; Hu, X.; Gu, D.; Zhu, Y. Exploring fine-grained syntactic information for aspect-based sentiment classification with dual graph neural networks. Neurocomputing 2022, 471, 48–59. [Google Scholar] [CrossRef]
Ott, M.; Choi, Y.; Cardie, C.; Hancock, J.T. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 21 June 2011; pp. 309–319. [Google Scholar]
Shojaee, S.; Murad, M.A.A.; Azman, A.B.; Sharef, N.M.; Nadali, S. Detecting deceptive reviews using lexical and syntactic features. In Proceedings of the 13th International Conference on Intelligent Systems Design and Applications, Bangi, Malaysia, 8–10 December 2013; pp. 53–58. [Google Scholar]
Feng, S.; Banerjee, R.; Choi, Y. Syntactic stylometry for deception detection. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Republic of Korea, 8–14 July 2012; pp. 171–175. [Google Scholar]
Li, J.; Ott, M.; Cardie, C.; Hovy, E. Towards a general rule for identifying deceptive opinion spam. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, 22–27 June 2014; pp. 1566–1576. [Google Scholar]
Mukherjee, A.; Venkataraman, V.; Liu, B.; Glance, N. What yelp fake review filter might be doing? In Proceedings of the 7th International AAAI Conference on Weblogs and Social Media, Cambridge, MA, USA, 11 July 2013; pp. 409–418. [Google Scholar]
Li, H.; Chen, Z.; Mukherjee, A.; Liu, B.; Shao, J. Analyzing and detecting opinion spam on a large-scale dataset via temporal and spatial patterns. In Proceedings of the 9th International AAAI Conference on Web and Social Media, Oxford, UK, 26–29 May 2015; pp. 634–637. [Google Scholar]
Barushka, A.; Hajek, P. Review spam detection using word embeddings and deep neural networks. In Proceedings of the Artificial Intelligence Applications and Innovations: 15th IFIP WG 12.5 International Conference, AIAI 2019, Hersonissos, Crete, Greece, 24–26 May 2019; pp. 340–350. [Google Scholar]
Li, L.; Ren, W.; Qin, B.; Liu, T. Learning document representation for deceptive opinion spam detection. In Proceedings of the 14th China National Conference on Computational Linguistics, Guangzhou, China, 13–14 November 2015; pp. 393–404. [Google Scholar]
O’Shea, K.; Nash, R. An introduction to convolutional neural network. arXiv 2015, arXiv:1511.08458. [Google Scholar]
Zhao, S.; Xu, Z.; Liu, L.; Guo, M.; Yun, J. Towards accurate deceptive opinions detection based on word order-preserving CNN. Math. Probl. Eng. 2018, 2018, 2410206. [Google Scholar]
Shahariar, G.M.; Biswas, S.; Omar, F.; Shah, F.M.; Binte Hassan, S. Spam review detection using deep learning. In Proceedings of the IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 10–12 October 2019; pp. 27–33. [Google Scholar]
Liu, W.; Jing, W.; Li, Y. Incorporating feature representation into BiLSTM for deceptive review detection. Computing 2020, 102, 701–715. [Google Scholar] [CrossRef]
Wang, X.; Liu, K.; Zhao, J. Detecting deceptive review spam via attention-based neural networks. In Proceedings of the Natural Language Processing and Chinese Computing: 6th CCF International Conference, NLPCC 2017, Dalian, China, 8–12 November 2017; pp. 866–876. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 1–11. [Google Scholar]
Ren, Y.; Zhang, Y. Deceptive opinion spam detection using neural network. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 140–150. [Google Scholar]
Bhuvaneshwari, P.; Rao, A.N.; Robinson, Y.H. Spam review detection using self-attention based CNN and bi-directional LSTM. Multimed. Tools Appl. 2021, 80, 18107–18124. [Google Scholar] [CrossRef]
Duma, R.A.; Niu, Z.; Nyamawe, A.S.; Tchaye-Kondi, J.; Yusuf, A.A. A deep hybrid model for fake review detection by jointly leveraging review text, overall ratings, and aspect ratings. Soft Comput. 2023, 27, 6281–6296. [Google Scholar] [CrossRef]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
Jindal, N.; Liu, B. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 219–230. [Google Scholar]
Lau, R.Y.K.; Liao, S.Y.; Kwok, R.C.-W.; Xu, K.; Xia, Y.; Li, Y. Text mining and probabilistic language modeling for online review spam detection. ACM Trans. Manag. Inf. Syst. 2011, 2, 1–30. [Google Scholar] [CrossRef]
Li, J.; Cardie, C.; Li, S. Topicspam: A topic-model based approach for spam detection. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 4–9 August 2013; pp. 217–221. [Google Scholar]
Ya, Z.; Qingqing, Z.; Yuhan, W.; Shuai, Z. LDA_RAD: A Spam review detection method based on topic model and reviewer anomaly degree. J. Phys. Conf. Ser. 2020, 1550, 022008. [Google Scholar] [CrossRef]
Ahsan, M.; Sharma, T.P. Spams classification and their diffusibility prediction on Twitter through sentiment and topic models. Int. J. Comput. Appl. 2022, 44, 365–375. [Google Scholar] [CrossRef]
Jakupov, A.; Mercadal, J.; Zeddini, B.; Longhi, J. Analyzing deceptive opinion spam patterns: The topic modeling approach. In Proceedings of the 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), Macao, China, 31 October–2 November 2022; pp. 1251–1261. [Google Scholar]
Moody, C.E. Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv 2016, arXiv:1605.02019. [Google Scholar]
Naver Blog Search API. Available online: https://developers.naver.com/docs/serviceapi/search/blog/blog.md (accessed on 14 June 2024).
Jin, Z.; Tanaka-Ishii, K. Unsupervised segmentation of Chinese text by use of branching entropy. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, Sydney, Australia, 18 July 2006; pp. 428–435. [Google Scholar]
Kim, H.G. Efficient keyword extraction from social big data based on cohesion scoring. J. KSCI 2020, 25, 87–94. [Google Scholar]
Keras. Available online: https://www.tensorflow.org/guide/keras (accessed on 14 June 2024).
Scikit-Learn. Available online: https://en.wikipedia.org/wiki/Scikit-learn (accessed on 14 June 2024).
Maiya, A.S. Ktrain: A low-code library for augmented machine learning. J. Mach. Learn. Res. 2022, 23, 7070–7075. [Google Scholar]
Student’s t-Distribution. Available online: https://en.wikipedia.org/wiki/Student%27s_t-distribution (accessed on 14 June 2024).
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 1–18. [Google Scholar]
Liu, H.; Singh, V.; Filipiuk, M.; Hari, S.K.S. ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures. IEEE Open J. Comput. Soc. 2024, 1–12. [Google Scholar] [CrossRef]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Raffel, C. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv 2020, arXiv:2010.11934. [Google Scholar]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]

Figure 1. Combination of data imbalance processing algorithms and machine learning algorithms that can be used to implement a model for the detection of irrelevant reviews.

Figure 2. Comparison of the models’ precision in terms of data imbalance processing algorithms obtained from (a) 25 review data sets for the restaurants and (b) 33 review data sets for the tourist attractions.

Figure 3. Comparison of TP and FP values of the models obtained from 25 review data sets for the restaurants where the ratio of irrelevant reviews is about 53.7%.

Figure 4. Comparison of TP and FP values of the models obtained from 33 review data sets for the tourist attractions where the ratio of irrelevant reviews is about 71.6%.

Figure 5. Comparison of the models’ recall in terms of data imbalance processing algorithms obtained from (a) 25 review data sets for the restaurants and (b) 33 review data sets for the tourist attractions.

Figure 6. Comparison of TP and FN of the models obtained from 25 review data sets for the restaurants where the ratio of irrelevant reviews is about 53.7%.

Figure 7. Comparison of the TP and FN values of the models obtained from 33 review data sets for the tourist attractions where the ratio of irrelevant reviews is about 71.6%.

Figure 8. Comparison of the models’ F1 scores in terms of data imbalance processing algorithms obtained from (a) 25 review data sets for the restaurants and (b) 33 review data sets for the tourist attractions.

Figure 9. Comparison of the models’ balanced accuracy in terms of data imbalance processing algorithms, obtained from (a) 25 review data sets for the restaurants and (b) 33 review data sets for the tourist attractions.

Table 2. Number of restaurants and reviews collected from each district in Ulsan Metropolitan City, South Korea.

	Namgu	Dongu	Bukgu	Ulju	Joonggu	Total
No. of places	6	5	6	3	5	25
No. of irrelevant reviews	2199	2020	2054	922	2189	9384
No. of legitimate reviews	1822	1762	2173	750	1597	8104
Ratio of irrelevant reviews	54.7%	53.4%	48.6%	55.1%	57.8%	53.7%

Table 3. Number of tourist attractions and reviews collected from each district in Ulsan.

	Namgu	Dongu	Bukgu	Ulju	Joonggu	Total
No. of places	5	7	8	5	8	33
No. of irrelevant reviews	2269	2461	2357	3304	2154	12,545
No. of legitimate reviews	925	1041	1040	1029	936	4971
Ratio of irrelevant reviews	71.0%	70.3%	69.4%	76.3%	69.7%	71.6%

Table 4. Statistics for the reviews of restaurants for each district in Ulsan.

	Namgu	Dongu	Bukgu	Ulju	Joonggu	Average
No. of tokens	68,333	67,990	73,016	24,218	61,000	58,911
No. of unique tokens	1637	1633	1596	656	1523	1409
Maximum no. of tokens in a sentence	39	37	38	33	41	38
Average no. of tokens in a sentence	18.0	17.4	18.3	15.4	17.1	17.2
Ratio of top-5 frequent words	10.6%	9.8%	11.8%	10.7%	11.3%	10.9%

Table 5. Statistics for the reviews of tourist attractions for each district in Ulsan.

	Namgu	Dongu	Bukgu	Ulju	Joonggu	Average
No. of tokens	56,077	61,179	56,264	77,209	54,535	61,053
No. of unique tokens	1465	1569	1560	2159	1643	1679
Maximum no. of tokens in a sentence	38	36	36	37	38	37
Average no. of tokens in a sentence	18.6	18.5	17.6	18.8	18.6	18.4
Ratio of top-5 frequent words	7.5%	9.4%	7.8%	7.2%	8.5%	8.1%

Table 6. Mean and standard deviation of the models’ precision, obtained from 25 review data sets for the restaurants in Ulsan, where the ratio of irrelevant reviews is about 53.7%.

	RNN	LSTM	GRU	BERT	Average
Original data	0.965 ± 0.026	0.966 ± 0.018	0.967 ± 0.017	0.962 ± 0.020	0.963 ± 0.020
RandomOverSampler	0.951 ± 0.023	0.910 ± 0.099	0.946 ± 0.052	0.965 ± 0.019	0.943 ± 0.048
SMOTE	0.955 ± 0.024	0.911 ± 0.096	0.950 ± 0.041	0.962 ± 0.019	0.944 ± 0.045
Borderline-SMOTE	0.953 ± 0.024	0.912 ± 0.087	0.941 ± 0.059	0.963 ± 0.019	0.942 ± 0.047
ADASYN	0.954 ± 0.026	0.897 ± 0.099	0.943 ± 0.053	0.966 ± 0.022	0.940 ± 0.050
Average	0.954 ± 0.025	0.919 ± 0.080	0.949 ± 0.044	0.963 ± 0.020	0.946 ± 0.042

Table 7. Mean and standard deviation of the models’ precision obtained from 33 review data sets for the tourist attractions in Ulsan, where the ratio of irrelevant reviews is about 71.6%.

	RNN	LSTM	GRU	BERT	Average
Original data	0.960 ± 0.017	0.972 ± 0.012	0.969 ± 0.014	0.968 ± 0.012	0.967 ± 0.014
RandomOverSampler	0.958 ± 0.016	0.960 ± 0.021	0.971 ± 0.009	0.969 ± 0.012	0.965 ± 0.014
SMOTE	0.960 ± 0.019	0.968 ± 0.017	0.975 ± 0.008	0.968 ± 0.012	0.968 ± 0.014
Borderline-SMOTE	0.964 ± 0.012	0.970 ± 0.014	0.976 ± 0.009	0.968 ± 0.015	0.969 ± 0.013
ADASYN	0.962 ± 0.022	0.960 ± 0.039	0.974 ± 0.013	0.967 ± 0.013	0.966 ± 0.022
Average	0.961 ± 0.017	0.966 ± 0.021	0.973 ± 0.010	0.968 ± 0.005	0.967 ± 0.015

Table 8. Mean and standard deviation of the models’ recall obtained from 25 review data sets for the restaurants in Ulsan, where the ratio of irrelevant reviews is about 53.7%.

	RNN	LSTM	GRU	BERT	Average
Original data	0.953 ± 0.033	0.975 ± 0.019	0.974 ± 0.019	0.969 ± 0.023	0.968 ± 0.023
RandomOverSampler	0.963 ± 0.021	0.953 ± 0.038	0.967 ± 0.031	0.968 ± 0.028	0.963 ± 0.030
SMOTE	0.963 ± 0.025	0.949 ± 0.033	0.966 ± 0.029	0.973 ± 0.022	0.963 ± 0.027
Borderline-SMOTE	0.960 ± 0.025	0.952 ± 0.037	0.969 ± 0.029	0.972 ± 0.023	0.963 ± 0.029
ADASYN	0.953 ± 0.029	0.949 ± 0.043	0.965 ± 0.032	0.964 ± 0.030	0.958 ± 0.034
Average	0.959 ± 0.027	0.956 ± 0.034	0.968 ± 0.028	0.969 ± 0.025	0.963 ± 0.028

Table 9. Mean and standard deviation of the models’ recall obtained from 33 review data sets for the tourist attractions in Ulsan, where the ratio of irrelevant reviews is about 71.6%.

	RNN	LSTM	GRU	BERT	Average
Original data	0.931 ± 0.013	0.951 ± 0.015	0.952 ± 0.012	0.956 ± 0.013	0.948 ± 0.013
RandomOverSampler	0.941 ± 0.026	0.949 ± 0.028	0.948 ± 0.022	0.955 ± 0.015	0.948 ± 0.023
SMOTE	0.835 ± 0.058	0.894 ± 0.061	0.873 ± 0.055	0.961 ± 0.010	0.891 ± 0.046
Borderline-SMOTE	0.831 ± 0.049	0.890 ± 0.052	0.864 ± 0.048	0.957 ± 0.013	0.885 ± 0.040
ADASYN	0.825 ± 0.047	0.887 ± 0.048	0.860 ± 0.040	0.960 ± 0.011	0.883 ± 0.037
Average	0.873 ± 0.039	0.914 ± 0.041	0.900 ± 0.035	0.958 ± 0.012	0.911 ± 0.032

Table 10. Mean and standard deviation of the models’ F1 scores obtained from 25 review data sets for the restaurants in Ulsan, where the ratio of irrelevant reviews is about 53.7%.

	RNN	LSTM	GRU	BERT	Average
Original data	0.954 ± 0.027	0.970 ± 0.016	0.970 ± 0.017	0.966 ± 0.020	0.965 ± 0.020
RandomOverSampler	0.957 ± 0.021	0.927 ± 0.054	0.955 ± 0.035	0.966 ± 0.022	0.951 ± 0.033
SMOTE	0.959 ± 0.024	0.926 ± 0.053	0.957 ± 0.030	0.967 ± 0.020	0.952 ± 0.032
Borderline-SMOTE	0.957 ± 0.024	0.928 ± 0.046	0.954 ± 0.037	0.968 ± 0.020	0.952 ± 0.032
ADASYN	0.954 ± 0.027	0.914 ± 0.062	0.953 ± 0.035	0.965 ± 0.025	0.946 ± 0.037
Average	0.956 ± 0.024	0.933 ± 0.046	0.958 ± 0.031	0.966 ± 0.021	0.953 ± 0.031

Table 11. Mean and standard deviation of the models’ F1 scores, obtained from 33 review data sets for the tourist attractions in Ulsan, where the ratio of irrelevant reviews is about 71.6%.

	RNN	LSTM	GRU	BERT	Average
Original data	0.945 ± 0.011	0.961 ± 0.013	0.961 ± 0.012	0.962 ± 0.011	0.957 ± 0.012
RandomOverSampler	0.949 ± 0.014	0.954 ± 0.012	0.959 ± 0.013	0.962 ± 0.011	0.956 ± 0.013
SMOTE	0.892 ± 0.034	0.928 ± 0.033	0.921 ± 0.034	0.964 ± 0.010	0.926 ± 0.028
Borderline-SMOTE	0.891 ± 0.029	0.927 ± 0.030	0.916 ± 0.031	0.962 ± 0.011	0.924 ± 0.025
ADASYN	0.888 ± 0.029	0.920 ± 0.023	0.913 ± 0.025	0.963 ± 0.010	0.921 ± 0.022
Average	0.913 ± 0.024	0.938 ± 0.022	0.934 ± 0.023	0.963 ± 0.010	0.937 ± 0.020

Table 12. Mean and standard deviation of the models’ balanced accuracy, obtained from 25 review data sets for the restaurants in Ulsan, where the ratio of irrelevant reviews is about 53.7%.

	RNN	LSTM	GRU	BERT	Average
Original data	0.950 ± 0.030	0.967 ± 0.019	0.967 ± 0.019	0.961 ± 0.025	0.961 ± 0.023
RandomOverSampler	0.952 ± 0.025	0.903 ± 0.094	0.946 ± 0.050	0.962 ± 0.025	0.941 ± 0.048
SMOTE	0.954 ± 0.027	0.903 ± 0.092	0.949 ± 0.040	0.963 ± 0.023	0.942 ± 0.045
Borderline-SMOTE	0.951 ± 0.027	0.907 ± 0.079	0.942 ± 0.054	0.963 ± 0.023	0.941 ± 0.046
ADASYN	0.947 ± 0.030	0.880 ± 0.113	0.941 ± 0.048	0.959 ± 0.029	0.932 ± 0.055
Average	0.951 ± 0.028	0.912 ± 0.079	0.949 ± 0.042	0.962 ± 0.025	0.943 ± 0.043

Table 13. Mean and standard deviation of the models’ balanced accuracy, obtained from 33 review data sets for the tourist attractions in Ulsan, where the ratio of irrelevant reviews is about 71.6%.

	RNN	LSTM	GRU	BERT	Average
Original data	0.916 ± 0.028	0.940 ± 0.027	0.936 ± 0.029	0.938 ± 0.026	0.933 ± 0.027
RandomOverSampler	0.919 ± 0.025	0.928 ± 0.020	0.939 ± 0.020	0.939 ± 0.023	0.931 ± 0.022
SMOTE	0.874 ± 0.035	0.912 ± 0.033	0.908 ± 0.037	0.940 ± 0.024	0.909 ± 0.032
Borderline-SMOTE	0.876 ± 0.029	0.911 ± 0.036	0.905 ± 0.036	0.937 ± 0.027	0.907 ± 0.032
ADASYN	0.871 ± 0.037	0.895 ± 0.043	0.901 ± 0.033	0.937 ± 0.025	0.901 ± 0.034
Average	0.891 ± 0.031	0.917 ± 0.032	0.918 ± 0.031	0.938 ± 0.025	0.916 ± 0.030

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, C.; Kim, H.G. Efficient Detection of Irrelevant User Reviews Using Machine Learning. Appl. Sci. 2024, 14, 6900. https://doi.org/10.3390/app14166900

AMA Style

Kim C, Kim HG. Efficient Detection of Irrelevant User Reviews Using Machine Learning. Applied Sciences. 2024; 14(16):6900. https://doi.org/10.3390/app14166900

Chicago/Turabian Style

Kim, Cheolgi, and Hyeon Gyu Kim. 2024. "Efficient Detection of Irrelevant User Reviews Using Machine Learning" Applied Sciences 14, no. 16: 6900. https://doi.org/10.3390/app14166900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Detection of Irrelevant User Reviews Using Machine Learning

Abstract

1. Introduction

2. Related Work

3. Proposed Method

4. Experimental Results

4.1. Performance Measure

4.2. Precision

4.3. Recall

4.4. F1 Scores

4.5. Balanced Accuracy

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI