Next Article in Journal
Safety of Human–Artificial Intelligence Systems: Applying Safety Science to Analyze Loopholes in Interactions between Human Organizations, Artificial Intelligence, and Individual People
Previous Article in Journal
An Intelligent Model and Methodology for Predicting Length of Stay and Survival in a Critical Care Hospital Unit
 
 
Article
Peer-Review Record

Improving Minority Class Recall through a Novel Cluster-Based Oversampling Technique

Informatics 2024, 11(2), 35; https://doi.org/10.3390/informatics11020035
by Takorn Prexawanprasut and Thepparit Banditwattanawong *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Informatics 2024, 11(2), 35; https://doi.org/10.3390/informatics11020035
Submission received: 28 February 2024 / Revised: 19 April 2024 / Accepted: 20 May 2024 / Published: 28 May 2024
(This article belongs to the Section Machine Learning)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper proposes a method for improving minority class recall in imbalanced datasets.

While problems associated with imbalanced datasets are of concern in machine learning, the method proposed in this paper to address the problem is not adequately described. 

The presentation is poor, especially in sections 1 and 2 where the authors have used inappropriate synonyms that do not suit to the context of technical discussion. For eg, words such as engender, enmeshing, intricate, delineate, nuanced etc. do not fit well in the context of the paper. 

In lines 73-76, it is stated that the claimed benefit of using the proposed method is "contingent upon specific parameter settings whose generalizability is yet to be comprehensively established" - it is not clear from the paper what are the specific parameter settings?

The details of the implementation are not sufficient.  Why is the cluster with highest G-mean identified/selected (as mentioned in line 284)?

How is the predefined oversampling ratio (delta) specified (line 307). What is its value used in the implementation in this paper?

Line 310: The algorithm increases the oversampling ratio...." By what factor is the oversampling ratio increased? 

Line 462: It is stated the information entropy is assessed in the context of misclassified minority instances - please elaborate. Detail the procedure to compute information entropy.

The authors also fail to properly explain why the proposed algorithm works well on certain datasets but does not work well for some datasets.

A lot of attempted explanation is very generic. Specific engineering reasoning/explanation is missing in much of the discussion. 

It is stated in the paper the for optimal performance meticulous parameter tuning is required. However, details of parameter tuning are missing in the paper.

It is also not clear how the synthesized samples are generated for the majority and minority classes. 

 

 

 

 

Comments on the Quality of English Language

The authors have tried to use synonymous words which are not suited/appropriate in the context of the paper. The paper has inappropriate jargon which make it appear vague and ambiguous.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper is well written, the structure and organization are correct, the presentation of the proposed method is clear, and the experiments are sufficient. However, there are some issues that could improve the work and that should be taken into account before accepting it for publication. In particular, these are the points that must be taken into account in the revised version:

1. The Authors should refer to recent clustering-based algorithms to face class imbalance. For iinstance:

- R. Choudhary, S. Shukla (2021) A clustering based ensemble of weighted kernelized extreme learning machine for class imbalance learning, Expert Systems with Applications, Vol. 164, 114041.

- Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, X. Han (2021) A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Information Sciences, Vol. 572, pp. 574-589.

- X.W. Liang, A.P. Jiang, T. Li, Y.Y. Xue, G.T. Wang (2020) LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM. Knowledge-Based Systems, Vol. 196, 105845.

- X. Tao, Q. Li, W. Guo, C. Ren, Q. He, R. Liu, J.-R. Zou (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Information Sciences, Vol. 519, pp. 43-73.

- Z. Li, M. Huang, G. Liu, C. Jiang (2021) A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Systems with Applications, Vol. 175, 114750.

-  A. Guzmán-Ponce, R.M. Valdovinos, J.S. Sánchez, J.R. Marcial-Romero (2020) A new under-sampling method to face class overlap and imbalance" Applied Sciences, Vol. 10, No. 15, 5164.

- E. Elyan, C.F., Moreno-Garcia, C. Jayne (2021) CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification. Neural Comput & Applications, Vol. 33, pp. 2839-2851.

2. The Authors must include a statistical analysis to verify that the differences in the results of the proposed method are significant.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors proposed a novel cluster-based oversampling technique to reduce false-negative errors and improve the recall of minority classes without compromising the performance of other classification metrics. The method is interesting.

 Comments:

 

The method utilizes K-means clustering for resampling, and the selection of the cluster number, k, is crucial for the generation of samples. However, the authors only describe the value of k for the first dataset Delinquency Telecom in section 5.1. Additionally, in the conclusion section, the authors acknowledge the sensitivity to factors such as the number of samples, classes, features, and imbalanced ratios. Therefore, the authors should provide a more detailed description of the process for selecting the value of k in K-means clustering for each dataset.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have significantly revised the manuscript and addressed most of the comments. But some suggestions have not be addressed properly in the revised manuscript. 

Responding to a comment by stating that "the statement in question is removed in the revised manuscript" is not sufficient.

 

Comments on the Quality of English Language

English presentation still needs improvement. The readability of the article is poor.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

No additional comments. The Authors have addressed all my suggestions. 

Author Response

Thank you for your thorough review and for acknowledging our efforts in addressing your suggestions. We appreciate your time and valuable feedback. If you have any further comments or concerns in the future, please do not hesitate to let us know. Your insights have been immensely helpful in improving our manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

All comments are well addressed

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have revised the paper significantly and addressed all the major concerns raised in the review report. 

All the language related issues have been fixed. Details of implementation of the proposed technique have been improved and the inclusion of section 5.4 regarding parameter tuning adds clarity to the implementation. 

Details of information entropy have also been improved in the revised manuscript.

There seems to by a typo error in the response to comment 6 in the cover letter. I presume the instances labeled as majority and minority should be 24 and 109 respectively (in the illustration provided), rather than 23 and 108: But the point has been made clear and the comment addressed.

Back to TopTop