Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

Symmetry 2024, 16(3), 273; https://doi.org/10.3390/sym16030273

by Yiheng Chen¹

, Jinbai Zou^1,*, Lihai Liu² and Chuanbo Hu¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Asma Channa

Reviewer 4: Anonymous

Symmetry 2024, 16(3), 273; https://doi.org/10.3390/sym16030273

Submission received: 25 January 2024 / Revised: 22 February 2024 / Accepted: 23 February 2024 / Published: 26 February 2024

(This article belongs to the Section Computer)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Journal: symmetry

Manuscript ID: symmetry-2866814

The authors proposed a modified reclassification algorithm based on Gaussian distribution. The minority class samples are reclassified by the KNN algorithm. Different synthesis strategies are selected according to the combination of the minority class samples, and the Gaussian distribution is used to replace the uniform random distribution for interpolation operation under certain classification conditions to reduce the possibility of generating noise samples. There are following concerns in this manuscript:

Abstract should be revised to show the motivation behind this research. Some quantitative achievement are also required to show in the abstract section.
What are the real-world problems where oversampling has a significant role and how the proposed model will solve such problems?
Introduction section is missing some important discussion on oversampling in context with real world problems and datasets. The author should link the applications with some image and tabular datasets.
In the algorithm 1 steps 3 and 13 are repeating without any justification. Further, the input is not being given the algorithm properly and also how the required output is returned by the algorithm.
The algorithms show the simple steps of comparison and increments, what are the novel contributions of the authors and in which algorithm?
Sub figures in Figure 2 are needed to be labelled horizontally and vertically. Also how the proposed algorithm shows better achievement.
What are the limitations and the overhead of the proposed strategy and how it will be covered in future work?
It is not mentioned that how much dataset is divided into training and testing. How the authors validated the results like using cross validation or any other technique?
Some important performance evaluation parameters like sensitivity and specificity are missing.
Discussion should be further extended with some real-world applications where the proposed approach should be useful.
What are the ethical aspects of the proposed approach in context with oversampling?
Future work should be extended with some more future directions.

Comments on the Quality of English Language

Minor editing of English language required

Author Response

Thanks very much for taking your time to review this manuscript. I really appreciate all your comments and suggestions! Please find my itemized responses in below and my revision in the re-submitted files.

Comments 1: Abstract should be revised to show the motivation behind this research. Some quantitative achievement are also required to show in the abstract section.

Response 1: Thank you for pointing this out. Therefore, we have revised the abstract section and added the quantitative achievement in the abstract section.

Comments 2: What are the real-world problems where oversampling has a significant role and how the proposed model will solve such problems?

Comments 3: Introduction section is missing some important discussion on oversampling in context with real world problems and datasets. The author should link the applications with some image and tabular datasets.

Response 2 and 3: Thanks for pointing this out. We have supplemented detailed information about these in the introduction section. Starting from the prevalent phenomenon of imbalanced datasets in the real world, examples are provided in application scenarios such as fault diagnosis and medical analysis. It is emphasized that the imbalance in datasets significantly impacts the accuracy of artificial intelligence models in data analysis. The introduction further highlights oversampling as a data-based improvement method that is currently widely applied and researched. In conclusion, oversampling can ameliorate the imbalance of the datasets, and the training of the artificial intelligence model using the optimized datasets can greatly improve the accuracy of the trained model in dealing with diagnosis and prediction problems. This relevant content is elaborated upon and changed in Section 1 on the second page.

Comments 4: In the algorithm 1 steps 3 and 13 are repeating without any justification. Further, the input is not being given the algorithm properly and also how the required output is returned by the algorithm.

Response 4: Thank you for pointing this out. The first loop in Algorithm 1 means that all minority samples are traversed and the k-neighbors of each minority sample are obtained. The second loop indicates that each minority sample is classified by comparing the ratio of the majority class to the minority class in its k-neighbors. The output obtained includes the noise dataset, danger dataset, and safe dataset. This output can be obtained in the second loop.

Comments 5: The algorithms show the simple steps of comparison and increments, what are the novel contributions of the authors and in which algorithm?

Response 5: Thank you for pointing this out. Comparing with the previously mentioned SMOTE method, the novel contributions of this improved algorithm can be observed. These steps are the novel contributions themselves: firstly, reclassify the minority class samples. Based on the types determined after the reclassification of minority class samples, Gaussian distribution is utilized instead of uniform distribution for synthesizing data through interpolation among the minority class samples. The synthetic data generated through this process possesses higher effectiveness and reliability.

Comments 6: Sub figures in Figure 2 are needed to be labelled horizontally and vertically. Also how the proposed algorithm shows better achievement.

Response 6: Thank you for pointing this out. We have labelled the sub figures in in Figure 2 and provided additional explanation on how the proposed algorithm demonstrates better performance. These changes can be found in subsection 4.2 on the page of 7 and 8.

Comments 7: What are the limitations and the overhead of the proposed strategy and how it will be covered in future work?

Response 7: Thank you for pointing this out. The limitations have been mentioned in the conclusions section that the standard deviation parameters sigma_1 and sigma_2 used in the interpolation process in the algorithm program need to be set artificially in advance. We think the further research direction of this paper will be how to use some adaptive method to judge the most appropriate standard deviation parameter. The overhead of the proposed strategy is that it requires more computational time compared to traditional oversampling algorithms. This change can be found in conclusion section on page 12.

Comments 8: It is not mentioned that how much dataset is divided into training and testing. How the authors validated the results like using cross validation or any other technique?

Response 8: Thank you for pointing this out. It has been mentioned in section 4 on page 7. We divided these datasets into two parts, the test set and training set, according to the "80-20 rule". The "80-20 rule" refers to the practice of dividing the available dataset into approximately 80% for training and 20% for testing. References cited in this section reveal that this technique is widely employed in related studies. This change can be found in section 4 on page 7.

Comments 9: Some important performance evaluation parameters like sensitivity and specificity are missing.

Response 9: Thank you for pointing this out. The evaluation parameters G-mean, F-measure, and AUC were selected based on relevant studies such as reference-2, reference-16, reference-25, and reference-27, ensuring their sufficient representativeness. Parameters like sensitivity and specificity have not appeared as important performance evaluation parameters in related studies of oversampling algorithms.

Comments 10: Discussion should be further extended with some real-world applications where the proposed approach should be useful.

Response 10: Thank you for pointing this out. We have discussed some real-world applications of the proposed approach in the conclusion. This change can be found in conclusion section on page 12.

Comments 11: What are the ethical aspects of the proposed approach in context with oversampling?

Response 11: Thank you for pointing this out. The proposed approach in context with oversampling does not involve any ethical concerns.

Comments 12: Future work should be extended with some more future directions.

Response 12: Thank you for pointing this out. Compared with the traditional oversampling algorithm, the proposed method requires more computing time. How to optimize the operation time of oversampling algorithm will also be one of the future research directions. This change can be found in conclusion section on page 12.

Reviewer 2 Report

Comments and Suggestions for Authors

Performance values obtained in classification problems with class imbalance may not reflect reality. Focusing on the class imbalance problem, the authors propose a modified classification algorithm based on Gaussian distribution.

F-beta Score and balanced accuracy parameters should be examined.

The approach you suggest to eliminate class imbalance is for discrete objects. Discuss your approach in the discussion section in case the class examples are signals or images.

Author Response

Thanks very much for taking your time to review this manuscript. I really appreciate all your comments and suggestions!

Comments 1: F-beta Score and balanced accuracy parameters should be examined.

Response 1: Thank you for pointing this out. Therefore, we have described these parameters in a more detailed and professional manner. This change can be found in subsection 4.2.

Comments 2: The approach you suggest to eliminate class imbalance is for discrete objects. Discuss your approach in the discussion section in case the class examples are signals or images.

Response 2: We agree with that this approach is for discrete objects. In fact, current oversampling algorithms are designed for discrete problems. Due to the principles of the formulas, these algorithms are not suitable for application in continuous objects. We have also explored how to handle image and signal data. We believe that extracting some characteristic parameters from this data, performing discretization, and then conducting oversampling analysis is an effective approach to address this issue.

Reviewer 3 Report

Comments and Suggestions for Authors

The article presents a significant contribution to the field of class imbalance learning problems by proposing an improved oversampling algorithm.

1. In abstract, provide a brief insight into the evaluation criteria or key metrics that underscore the proposed algorithm's superiority over traditional methods.

2. In conclusion, emphasize the practical implications and potential real-world applications of your algorithm to underscore its relevance and impact. This could involve discussing how your algorithm might contribute to more equitable or accurate predictive modeling in contexts where minority class samples are critical (e.g., medical diagnoses, fraud detection).

Overall, your article makes a valuable contribution to addressing class imbalance in machine learning.

Comments on the Quality of English Language

The quality of English language is generally good, with a clear conveyance of the main ideas and contributions of the paper.

Author Response

Thanks very much for taking your time to review this manuscript. I really appreciate all your comments and suggestions!

Comments 1: In abstract, provide a brief insight into the evaluation criteria or key metrics that underscore the proposed algorithm's superiority over traditional methods.

Response 1: Thank you for pointing this out. Therefore, we have provided a specific description of the proposed algorithm's superiority. This change can be found in abstract.

Comments 2: In conclusion, emphasize the practical implications and potential real-world applications of your algorithm to underscore its relevance and impact. This could involve discussing how your algorithm might contribute to more equitable or accurate predictive modeling in contexts where minority class samples are critical (e.g., medical diagnoses, fraud detection).

Response 2: Thank you for pointing this out. We have added the practical implications in conclusion. This change can be found in section 6.

Reviewer 4 Report

Comments and Suggestions for Authors

A modified reclassification algorithm based on Gaussian distribution was proposed, and it was confirmed that the proposed algorithm was 2 to 8% superior to the existing oversampling algorithm. There is a need for research on the problem of processing imbalanced data in various industries, and although it is difficult to say that it is a tremendous development, its effectiveness has been confirmed. There are no special technical problems found, and overall it appears to be a well-written article.

Author Response

Thanks very much for taking your time to review this manuscript. We are honored to receive your recognition. This serves as a crucial motivation for us to further advance in subsequent research.

Article Menu

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

Further Information

Guidelines

MDPI Initiatives

Follow MDPI