Next Article in Journal
Development of Machine Learning Algorithms for Application in Major Performance Enhancement in the Selective Catalytic Reduction (SCR) System
Previous Article in Journal
Changes of Macrobenthic Diversity and Functional Groups in Saltmarsh Habitat under Different Seasons and Climatic Variables from a Subtropical Coast
 
 
Article
Peer-Review Record

ERF-XGB: Ensemble Random Forest-Based XG Boost for Accurate Prediction and Classification of E-Commerce Product Review

Sustainability 2023, 15(9), 7076; https://doi.org/10.3390/su15097076
by Daniyal M. Alghazzawi, Anser Ghazal Ali Alquraishee, Sahar K. Badri and Syed Hamid Hasan *
Reviewer 1:
Reviewer 2:
Sustainability 2023, 15(9), 7076; https://doi.org/10.3390/su15097076
Submission received: 8 February 2023 / Revised: 11 April 2023 / Accepted: 12 April 2023 / Published: 23 April 2023

Round 1

Reviewer 1 Report

The presentation of the method looks very well - I congratulate the authors with a good job. However, I have very serious comments on what is inside.

 

I've tried to look at the dataset source [25], and it leads to another article, where the authors state:

"We extract up-to-date movie data (movie name, Metascore, Rating, year, votes, and gross income) from IMDB movie site. This is an ethical (grey area) data extraction process from the internet, which is more accurate and reliable than collecting data from a third party."

 

So, there is no such thing as some fixed dataset, thus if you take it directly online, please provide info like period of time intervals or something, letting to reproduce your results. If I am wrong, just provide the link to the data instead of some indirect articles.

 

In both datasets there are "positive and negative categories", you state it and I have also checked it. So the targeted classification is binary. I have questions then: 

a. Why does your method support three (+neutral, see Figure 1, and there are comments "the sentiments of online reviews are classified using the proposed ERF-XGB approach into three categories as positive, negative and neutral sentiments") categories?

b. If your method supports three categories, then we have no idea if this works simply because there was no dataset used with three in the current article. Maybe it simply won't work?

Discuss the topic of the number of categories explicitly, please.

 

I don't see that the dataset would be split into learning and testing parts, the proportions are not provided either, thus I assume that you calculate metrics and methods are learning using the same data. There are examples on the internet where the results are better. Let's take a look at this Kaggle amateur research:

https://www.kaggle.com/code/thikhuyenle/combination-of-cnn-lstm-for-text-classification

There is some keypoints on it:

1. it uses 50000 IMDB reviews instead of 25000

2. the accuracy reaches 97%

3. author also splits the dataset into testing and validating(which is often omitted)

 

I am absolutely sure that for practical purposes the accuracy of the learning part is not representative at all, if you want to convince me that your method works - provide the performance metric on the testing part, e.g. 30% of data must be not involved in the training process - it must be used for testing (metric calculations) purposes only. My doubts would be gone if you would provide train and validation accuracy curves, where X would be a number of iterations of the optimization method, similarly as it is done in that amateur Kaggle mini-research.

 

The selection of methods used in this article is very weird, I have these questions:

1. You use modified XGboost, so why didn't you compare it to the original XGboost?

2. In many researches (e.g. https://informatica.vu.lt/journal/INFORMATICA/article/1257/info) LGBM classifier gives better results, this method is close to XGboost, but it is newer. Thus, this 5 year old but still state-of-art method wasn't compared to.

3. Simple (or regularized) random forests with appropriate parameters sometimes perform surprisingly well (e.g. https://link.springer.com/article/10.1007/s11042-020-09658-z), and you do not provide it in your article. I.e. you miss an important well-known method which usually shows how well does your technique perform compared to the well-working classical algorithm.

 

Also, another important note, as we can see from mentioned article https://informatica.vu.lt/journal/INFORMATICA/article/1257/info , the hyperparameters are very important. There is a high chance that your dataset works badly with default hyperparameters of some methods, thus you must absolutely necessarily check different hyperparameters, i.e. perform hyperparameters tuning. Only then we can talk about your method's superiority to other state-of-art methods. Moreover, you do parameter tuning for your method - it simply looks like dishonest comparisson.

 

There is a chance that my comments can be adequately addressed, thus I am looking forward to the reviewed version. A major revision is necessary in this case.

Author Response

The presentation of the method looks very well - I congratulate the authors on a good job. However, I have very serious comments on what is inside.

 Response:

I've tried to look at the dataset source [25], and it leads to another article, where the authors state:

"We extract up-to-date movie data (movie name, Metascore, Rating, year, votes, and gross income) from the IMDB movie site. This is an ethical (grey area) data extraction process from the internet, which is more accurate and reliable than collecting data from a third party."

Response: Thanks for the comment. A collection of 50,000 reviews from IMDB, allowing no more than 30 reviews per movie. The structured dataset has an equal number of positive and negative reviews, so a rough guess gives  50% accuracy. We split the dataset equally into training and test sets. The training set is the same 25,000 labeled reviews used to induce word vectors with our model. Neutral reviews were not included in the dataset

https://ai.stanford.edu/~amaas/data/sentiment/

https://github.com/hiDaDeng/cnsenti/tree/master/cnsenti

 

 

So, there is no such thing as some fixed dataset, thus if you take it directly online, please provide info like period intervals or something, letting to reproduce your results. If I am wrong, just provide the link to the data instead of some indirect articles.

Response: We agree and the dataset links are provided below.

 https://ai.stanford.edu/~amaas/data/sentiment/

https://github.com/hiDaDeng/cnsenti/tree/master/cnsenti

 

In both datasets there are "positive and negative categories", you state it and I have also checked it. So the targeted classification is binary. I have questions then: 

 

 

  1. Why does your method support three (+neutral, see Figure 1, and there are comments "the sentiments of online reviews are classified using the proposed ERF-XGB approach into three categories as positive, negative and neutral sentiments") categories?

Response: Thanks for the comment. The proposed system can be used as a binary classification to analyze the sentimental analysis of the proposed ERF-XGB algorithm such as positive and negative. No multiclass classification is utilized here we have only utilized binary classification.

  1. If your method supports three categories, then we have no idea if this works simply because there was no dataset used with three in the current article. Maybe it simply won't work?

Response: Sorry for the mistake. The proposed method supports only 2 categories which can be included in section 4.2( Dataset description)

Discuss the topic of the number of categories explicitly, please.

Response:  Our aim is to develop a web application which embeds machine learning model which provides the analysis of user reviews for particular product. It shows the positive and negative polarity of reviews for searched product which will be helpful for users. In this application when a user searches for a product, review data is collected from ecommerce websites and that data is passed to ERF-XGB classifier which classifies the reviews into positive and negative sentiments based on the features extracted by the model. We show user the overall positive and negative polarity of reviews for searched product and also we show how accurately we have obtained the results. Thus these results helps user to decide about the product.

I don't see that the dataset would be split into learning and testing parts, the proportions are not provided either, thus I assume that you calculate metrics and methods are learning using the same data. There are examples on the internet where the results are better. Let's take a look at this Kaggle amateur research:

https://www.kaggle.com/code/thikhuyenle/combination-of-cnn-lstm-for-text-classification

Response: Thanks for the comment. The datasets are split into two types testing and training phases. Where the training phase contains 70% of the data, and the training phase contains 30% of the data. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms

There are some key points on it:

  1. it uses 50000 IMDB reviews instead of 25000
  2. the accuracy reaches 97%
  3. author also splits the dataset into testing and validating(which is often omitted)

Response: Thank you for giving me a chance of revising my manuscript. The dataset will be split into two ways testing phase and the training phase. Details about dataset will be included in section 4.2 (Dataset Description). Kindly check table 2, figures 3 and 4.

 

 

I am sure that for practical purposes the accuracy of the learning part is not representative at all, if you want to convince me that your method works - provide the performance metric on the testing part, e.g. 30% of data must be not involved in the training process - it must be used for testing (metric calculations) purposes only. My doubts would be gone if you would provide train and validation accuracy curves, where X would be several iterations of the optimization method, similarly as it is done in that amateur Kaggle mini-research.

Response: We agree and as a response to the above comment we have corrected the error in the revised manuscript. A total of 70% data is used for training while the remaining 30% is used for testing.

 

The selection of methods used in this article is very weird, I have these questions:

  1. You use modified XGboost, so why didn't you compare it to the original XGboost?

Response: Thanks for the comment. The comparative analysis with XGBoost is given in Table 2 of the revised manuscript.

  1. In many types of research (e.g. https://informatica.vu.lt/journal/INFORMATICA/article/1257/info) LGBM classifier gives better results, this method is close to XGboost, but it is newer. Thus, this 5-year-old but still state-of-art method wasn't compared to

Response: Thanks for the comment. The comparison between the existing techniques will be added in section 4. (Experimental Results and Discussions). Kindly refer to figures 3 and 4.

 

  1. Simple (or regularized) random forests with appropriate parameters sometimes perform surprisingly well (e.g. https://link.springer.com/article/10.1007/s11042-020-09658-z), and you do not provide it in your article. I.e. you miss an important well-known method that usually shows how well your technique performs compared to the well-working classical algorithm.

Response:  We agree and as a response to the above comment, the proposed model is compared with the standard and regularized random forest classifier and the results are provided in section 4.5 of the revised manuscript.

Also, another important note, as we can see from mentioned article https://informatica.vu.lt/journal/INFORMATICA/article/1257/info, is the hyperparameters are very important. There is a high chance that your dataset works badly with default hyperparameters of some methods, thus you must necessarily check different hyperparameters, i.e. perform hyperparameters tuning. Only then we can talk about your method's superiority to other state-of-art methods. Moreover, you do parameter tuning for your method - it simply looks like a dishonest comparison.

Response: Thanks for the comment. The hyperparameters and their values are added in section 4.4 (Hyperparameter configuration ) Table 3.

                                   Parameters

        Value

Base Learner

Tree

Gamma

0

Learning rate

0.03

Number of pruning after control

0.2

Regularization

L2

Tree depth

4

Random sampling decision tree Ratio

0.7

Minimum leaf node sample weight

2

 

 

There is a chance that my comments can be adequately addressed, thus I am looking forward to the reviewed version. A major revision is necessary in this case.

Reviewer 2 Report

Need to explanation on methodology mainly Figure 2. Others improvement can be made based on the comments on the paper.

Comments for author File: Comments.pdf

Author Response

We first thank the reviewers for their time to analyze this manuscript and for their valuable feedback.  Sorry for the mistake. The explanation of the proposed flowchart will be added in section 3 (proposed methodology) above figure 2. Figure 2 represents the Flow diagram of the proposed ERF-XGB approach is given below. The input dataset is selected randomly and the ensemble RF and XGB parameters are initialized for bootstrap sampling. An RF- classifier is constructed and trained for each instance using majority voting. The capability of the model is improved using more instances via XG boost. If the minimum loss value is not obtained, the model has trained again until the loss value is minimized. The ERF-XGB model predicts the polarity of the input as negative or positive. As a response to the above comment, the improvements are made as per the reviewer's suggestions and the instructions provided in the pdf file attached. The references added are listed in the references section (Refer to references 26-30)

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

My first note is about English ... I usually do not pay attention to language issues too much, but here I see "ProposedMethodology" in the caption, next there is "Harris hawk’s" (Hawk is a name), then I saw "system ERF-XGB gives high accuracy than others" and I assume that there are more of such drawbacks - authors need to polish their text. By the way, you classify ERF-XGB as an algorithm, I don't know why have you used the term "system" here.

 

About the comparison with standard XGBoost: do both (original and yours) methods use the same set of hyperparameters? Otherwise, it is not clear which parameters were used, if you want an honest comparison, for original XGBoost hyperparameter tunning must be performed separately.

 

From Figure 5 it can be clearly seen that most methods have not converged. Thus, what is the final iteration count? Can't see any info on that. If you limit all your calculations to 50 iterations - you must explicitly state that, however, the motivation is unclear.

 

Also, you have ignored my comment "My doubts would be gone if you would provide train and validation accuracy curves, where X would be a number of iterations of the optimization method, similarly as it is done in that amateur Kaggle mini-research." I don't see any reasons not to show readers these informative figures for your best method (check https://www.kaggle.com/code/thikhuyenle/combination-of-cnn-lstm-for-text-classification), readers would really appreciate seeing when the actual errors begin to raise, i.e. when the overfitting effect happens - it is very interesting, especially because your method shows high accuracy, i.e. overfitting is expected to be small.

Author Response

My first note is about English ... I usually do not pay attention to language issues too much, but here I see "ProposedMethodology" in the caption, next there is "Harris hawk’s" (Hawk is a name), then I saw "system ERF-XGB gives high accuracy than others" and I assume that there are more of such drawbacks - authors need to polish their text. By the way, you classify ERF-XGB as an algorithm, I don't know why have you used the term "system" here.

Response: As per the comment we have modified mentioned errors and the language all over the paper

 

About the comparison with standard XGBoost: do both (original and yours) methods use the same set of hyperparameters? Otherwise, it is not clear which parameters were used, if you want an honest comparison, for original XGBoost hyperparameter tunning must be performed separately.

Response: Thank you for pointing it out. As per the comment we have to clarify that we have not used the same set of hyperparameters. We have used different hyperparameters for XGBoost and ERF-XGB as they use different hyperparameters, and the same hyperparameters will not be optimal for both models. Moreover, the choice of hyperparameters can significantly impact the performance of the model. Tuning hyperparameters involves finding the best combination of hyperparameters for optimal performance of the model.  

From Figure 5 it can be clearly seen that most methods have not converged. Thus, what is the final iteration count? Can't see any info on that. If you limit all your calculations to 50 iterations - you must explicitly state that, however, the motivation is unclear.

Thank you for the comment. In this work the iteration ranges from 50 to 400. Hence the revised figure is updated in section 4, figure 5.

The motivation of the work is to address the challenges related to sentiment analysis of e-commerce product reviews. E-commerce platforms rely heavily on product reviews to understand customer opinions and preferences, and sentiment analysis provides a means to automatically extract and classify the sentiment expressed in these reviews. However, existing sentiment analysis techniques often face challenges such as polysemy, disambiguation, and dimension mapping of words, which can lead to inaccurate classification of sentiments. The proposed approach, Ensemble Random Forest-based XG boost (ERF-XGB), aims to overcome these challenges and enhance the accuracy of sentiment polarity classification in product reviews. The approach utilizes a combination of ensemble learning, random forest, and XGBoost algorithms to improve the accuracy of sentiment classification. Additionally, the Harris hawk optimization (HHO) algorithm is used to select the most relevant features from the datasets, which further enhances the accuracy of the classification.

Also, you have ignored my comment "My doubts would be gone if you would provide train and validation accuracy curves, where X would be a number of iterations of the optimization method, similarly as it is done in that amateur Kaggle mini-research." I don't see any reasons not to show readers these informative figures for your best method (check https://www.kaggle.com/code/thikhuyenle/combination-of-cnn-lstm-for-text-classification), readers would really appreciate seeing when the actual errors begin to raise, i.e. when the overfitting effect happens - it is very interesting, especially because your method shows high accuracy, i.e. overfitting is expected to be small.

Thank you. As per the comment we have added Performance Analysis of Accuracy for Testing and validation for different Iterations in figure 5 in section 4 which is marked in red color

Round 3

Reviewer 1 Report

The response was adequate, however, I have two more comments:

1. You still haven't mentioned anywhere explicitly, that you have a binary classification - it is a game changer and important key term. Reader must see it in abstract and in comments about your methodology and Figure 1.

2. "Accuracy for Testing and validation for different Iterations" Figure looks very strange for any specialist in ML field. First of all, it is clearly seen that accuracy haven't started to drop, thus, there was no reason to stop iterating. Secondly, the gap between validation and training curves becomes smaller as iterations number increases - a very non-standard and unexpected behaviour. Thirdly, there is strange jump of accuracy at the end of the validation curve.

As for (2.) I suggest authors to search for mistakes, either in the code or in some other technical part of their work and provide the correct results. In any case, from the provided graph there was no reason to stop iterating, thus, at least I suggest you to continue to see the actual performance of your model. Since there are no signs of accuracy reaching it's peek, you haven't properly measured your model, the research is unfinished in current state. As a reviewer with time terms that MDPI give I don't have enough time to reproduce the results but I would be eager to, because from what I have seen I believe there is a high chance of mistake somewhere.

Author Response

please see in attachment

Author Response File: Author Response.pdf

Back to TopTop