Next Article in Journal
A Self-Triggered Digitally Assisted Hybrid LDO with 110 ns Settling Time in 65 nm CMOS
Next Article in Special Issue
Real Time Assessment of Smart Concrete Inspection with Piezoelectric Sensors
Previous Article in Journal
Fractional Encoding of At-Most-K Constraints on SAT
Previous Article in Special Issue
Smooth Coverage Path Planning for UAVs with Model Predictive Control Trajectory Tracking
 
 
Article
Peer-Review Record

Denying Evolution Resampling: An Improved Method for Feature Selection on Imbalanced Data

Electronics 2023, 12(15), 3212; https://doi.org/10.3390/electronics12153212
by Li Quan, Tao Gong * and Kaida Jiang
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Electronics 2023, 12(15), 3212; https://doi.org/10.3390/electronics12153212
Submission received: 1 July 2023 / Revised: 17 July 2023 / Accepted: 20 July 2023 / Published: 25 July 2023
(This article belongs to the Special Issue Smart Sensing, Monitoring, and Control in Industry 4.0)

Round 1

Reviewer 1 Report

This article presents a data resampling technique that employs an evolutionary approach to tackle the issue of data imbalance. By utilizing the evolutionary process, it generates supplementary positive samples, thereby mitigating the bias resulting from class imbalance. This method enhances the diversity of the training set in the nearest neighbor classification procedure, consequently diminishing the likelihood of over fitting and augmenting the classifier's generalization capability.

This article is very interesting and has been able to add  knowledge to machine learning techniques. I strongly believe that the article will interest the readers of this journal. I recommend that the article should be accepted in its present form.

Author Response

We are grateful to reviewer #1 for his/her effort reviewing our paper and his/her positive feedback. The summary of our work as written by this reviewer is precise.

Reviewer 2 Report

In this manuscript the authors discuss about an approach towards feature selection on imbalanced data. They firstly describe the challenges and the issues raised due to imbalanced data, concluding to the proposed data resampling technique that mitigates the bias resulting from class imbalance.

 

Overall, a well studied research that should consider the following comments:

1.       In the abstract please provide some details regarding the evaluation of the proposed methodology, and a preliminary discussion of the derived results. Were these sufficient? Did you face any limitations? These should be also tackled here

2.       In the introduction, why do you specifically refer to the problems of credit card fraud? Please provide additional details regarding imbalanced data challenges to additional domains, prior to concluding into this specific domain

3.       I would not expect to see the dataset used in the introduction section – please refer to it in the methodology section

4.       Please include some additional challenges of imbalanced data and try to report the research findings of the below studies that refer to bias detection and data cleaning actions:

a.       Biran, Ofer, et al. "PolicyCLOUD: A prototype of a cloud serverless ecosystem for policy analytics." Data & Policy 4 (2022): e44.

b.       Mavrogiorgou, Argyro, et al. "Adjustable data cleaning towards extracting statistical information." Public Health and Informatics. IOS Press, 2021. 1013-1014.

c.       Guo, Lisa N., et al. "Bias in, bias out: underreporting and underrepresentation of diverse skin types in machine learning research for skin cancer detection—a scoping review." Journal of the American Academy of Dermatology 87.1 (2022): 157-159.

d.       Mavrogiorgos, Konstantinos, et al. "Automated Rule-Based Data Cleaning Using NLP." 2022 32nd Conference of Open Innovations Association (FRUCT). IEEE, 2022.

e.       Vaid, Shashank, Reza Kalantar, and Mohit Bhandari. "Deep learning COVID-19 detection bias: accuracy through artificial intelligence." International Orthopaedics 44 (2020): 1539-1542.

5.       Please add a paragraph indicating the structure of the document at the end of the introduction

6.       Regarding the related work, it would be nice to have a summary table of the optimization strategies, indicating their primary features

7.       Regarding section 3, I cannot find the matching between the figure of the architecture and the sub-sections

8.       In section 4, please include the dataset of section 1

9.       Please try to increase the visibility of Figure 4

10.   Regarding the overall evaluation, the evaluation criteria are not clearly identified. Moreover, regarding the proposed methodology have you considered the energy efficiency of the proposed method? Please have a look at the following research and try to include its methodology in your criteria:

a.       Karabetian, Andreas, et al. "An Environmentally-sustainable Dimensioning Workbench towards Dynamic Resource Allocation in Cloud-computing Environments." 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA). IEEE, 2022.

11.   In the conclusions section I would like to see the limitations and the assumptions made for this work. Moreover, please try to provide a list of the stakeholders of this research and explain how they are going to benefit from it. Lastly, indicate how you are going to disseminate and communicate your research outcomes and findings.

Author Response

We are grateful to reviewer #2 for his/her effort reviewing our paper and his/her positive feedback. We have carefully addressed all the reviewer's concerns. Please see below our replies. We hope he/she is satisfied with our answers and the new (Fig, Data or Conclusions)we provided. Changes highlighted in red have been made accordingly in the revised manuscript.

 

Comments and Suggestions for Authors

In this manuscript the authors discuss about an approach towards feature selection on imbalanced data. They firstly describe the challenges and the issues raised due to imbalanced data, concluding to the proposed data resampling technique that mitigates the bias resulting from class imbalance.

Overall, well-studied research that should consider the following comments:

  1. In the abstract please provide some details regarding the evaluation of the proposed methodology, and a preliminary discussion of the derived results. Were these sufficient? Did you face any limitations? These should be also tackled here

Response:

Thank you for pointing out the issues. The abstract section of the paper has been revised.

  1. In the introduction, why do you specifically refer to the problems of credit card fraud? Please provide additional details regarding imbalanced data challenges to additional domains, prior to concluding into this specific domain

Response:

Thank you for your question. Firstly, the introduction section of the paper has been modified to address the relevant issues regarding imbalanced data. As for why the credit card fraud dataset was used, it was because we happened to find a large-scale dataset that exhibited significant data imbalance, which provided an appropriate case for modeling and testing the proposed resampling method in the paper.

  1. I would not expect to see the dataset used in the introduction section – please refer to it in the methodology section

Response:

The part of the paper that discusses the dataset has been modified.

  1. Please include some additional challenges of imbalanced data and try to report the research findings of the below studies that refer to bias detection and data cleaning actions:
  2. Biran, Ofer, et al. "PolicyCLOUD: A prototype of a cloud serverless ecosystem for policy analytics." Data & Policy 4 (2022): e44.
  3. Mavrogiorgou, Argyro, et al. "Adjustable data cleaning towards extracting statistical information." Public Health and Informatics. IOS Press, 2021. 1013-1014.
  4. Guo, Lisa N., et al. "Bias in, bias out: underreporting and underrepresentation of diverse skin types in machine learning research for skin cancer detection—a scoping review." Journal of the American Academy of Dermatology 87.1 (2022): 157-159.
  5. Mavrogiorgos, Konstantinos, et al. "Automated Rule-Based Data Cleaning Using NLP." 2022 32nd Conference of Open Innovations Association (FRUCT). IEEE, 2022.
  6. Vaid, Shashank, Reza Kalantar, and Mohit Bhandari. "Deep learning COVID-19 detection bias: accuracy through artificial intelligence." International Orthopaedics 44 (2020): 1539-1542.

Response:

Thank you for your reminder. We have reviewed the relevant literature you suggested and made modifications to the sections of the paper involved. We have also cited some research findings [a,b,d] as referenced in [12,13,21].

  1. Please add a paragraph indicating the structure of the document at the end of the introduction

Response:

The introduction has been revised.

  1. Regarding the related work, it would be nice to have a summary table of the optimization strategies, indicating their primary features

Response:

Regarding the section on related work, we have included a summary and presentation, as shown in Fig 1 in the paper.

  1. Regarding section 3, I cannot find the matching between the figure of the architecture and the sub-sections

Response:

Thank you for pointing out the issue. We have now redrawn Fig 2.

  1. In section 4, please include the dataset of section 1

Response:

The section 4 of the paper has been revised.

  1. Please try to increase the visibility of Figure 4

Response:

Fig 4 in the original manuscript has been modified and is now labeled as Fig 5. The size of the modified image has been increased to improve visibility.

  1. Regarding the overall evaluation, the evaluation criteria are not clearly identified. Moreover, regarding the proposed methodology have you considered the energy efficiency of the proposed method? Please have a look at the following research and try to include its methodology in your criteria:
  2. Karabetian, Andreas, et al. "An Environmentally-sustainable Dimensioning Workbench towards Dynamic Resource Allocation in Cloud-computing Environments." 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA). IEEE, 2022.

Response:

Thank you for pointing out the issue. When evaluating resampling methods, the current main metrics used are F1 score, confusion matrix, or receiver operating characteristic (ROC) curve. In the paper, we only presented the comparison using F1 score for two reasons. Firstly, a higher F1 score indicates better overall performance of the model in the prediction task. F1 score is a commonly used metric to evaluate the performance of classification models, as it takes into account both precision and recall. Precision measures the proportion of true positives among the samples predicted as positives, while recall measures the ability of the model to correctly identify all positives. F1 score is the harmonic mean of precision and recall, thus balancing the trade-off between the two. Secondly, while the results from confusion matrix and ROC curve also provide insights, they can only show the performance of individual models, whereas the F1 score curve represents the overall performance. Therefore, the results from other evaluation methods were not presented.

We present Table 2 logistic regression validation F1 score, precision and recall score comparison for different resampling models at the end of the paper

Regarding your suggestion of additional evaluation criteria, we believe that it can add more objectivity to the assessment of our resampling method. However, currently, we are not able to effectively integrate testing and evaluation. The mentioned research result is one of the optimization directions for our resampling method. Please refer to reference [48] for more details.

  1. In the conclusions section I would like to see the limitations and the assumptions made for this work. Moreover, please try to provide a list of the stakeholders of this research and explain how they are going to benefit from it. Lastly, indicate how you are going to disseminate and communicate your research outcomes and findings.

Response:

The conclusion part has been modified.

Author Response File: Author Response.pdf

Reviewer 3 Report

This article primarily discusses a novel data resampling technique that employs an evolutionary approach to address the issue of data imbalance. In scenarios where positive samples are scarce and negative samples are abundant, the classifier may prioritize the prediction of negative samples, overlooking the positive ones, thereby compromising the overall classification performance. This new resampling technique mitigates the bias resulting from class imbalance by generating supplementary positive samples. Furthermore, this method enhances the diversity of the training set, reducing the likelihood of overfitting and augmenting the classifier's generalization capability. The results of this study are reasonable and practical. However, the manuscript can be improved on the following points:

1.  During the data evolution process, this method exhibits time complexity. When selecting two parents, the time complexity increases, leading to a significant increase in computation time as the data size grows. Although this method achieves high precision, it suffers from low time efficiency. The constraints used for data evolution also affect computational efficiency. The authors should provide explanations for these issues.

2. Detailed Method Explanation: Although the paper provides some explanation of the method, more detailed descriptions may be needed for certain aspects, such as the selection of specific parameters and the specific steps of the evolution process. This would help readers better understand and reproduce the research results.

3.  More Experimental Results: The paper could provide more experimental results, including results under different parameter settings and comparisons with other existing methods. This would help readers more comprehensively evaluate the performance of the method.

4. In-depth Discussion: The paper could further discuss the advantages and disadvantages of the method, as well as possible directions for improvement. This would help readers understand the potential and limitations of the method more deeply.

5.  Practical Application Cases: If possible, the paper could provide some application cases of the method in real-world problems. This would help readers understand the practical effects and application value of the method.

6.  Emphasis on Conclusion: In the conclusion section, the main findings of the research should be clearly summarized, and their contributions and impacts on the field should be emphasized. This would help readers understand the importance and value of the research.

The English quality is satisfactory, adhering to the standard conventions of academic writing. Only minor enhancements are required to further refine the English quality.

Author Response

We are grateful to reviewer #3 for his/her effort reviewing our paper and his/her positive feedback. We have carefully addressed all the reviewer's concerns. Please see below our replies. We hope he/she is satisfied with our answers and the new (Fig, Data or Conclusions)we provided. Changes highlighted in red have been made accordingly in the revised manuscript.

Please see the attachment.

 

 

Comments and Suggestions for Authors

This article primarily discusses a novel data resampling technique that employs an evolutionary approach to address the issue of data imbalance. In scenarios where positive samples are scarce and negative samples are abundant, the classifier may prioritize the prediction of negative samples, overlooking the positive ones, thereby compromising the overall classification performance. This new resampling technique mitigates the bias resulting from class imbalance by generating supplementary positive samples. Furthermore, this method enhances the diversity of the training set, reducing the likelihood of overfitting and augmenting the classifier's generalization capability. The results of this study are reasonable and practical. However, the manuscript can be improved on the following points:

  1. During the data evolution process, this method exhibits time complexity. When selecting two parents, the time complexity increases, leading to a significant increase in computation time as the data size grows. Although this method achieves high precision, it suffers from low time efficiency. The constraints used for data evolution also affect computational efficiency. The authors should provide explanations for these issues.

Response:

Thank you for your attention and reminder regarding the time complexity issue in the evolutionary process of data resampling. We have re-evaluated the time complexity associated with randomly selecting parents in our resampling method. Even with the case of randomly selecting two parents, the time complexity still remains . However, we acknowledge that the overall evolutionary process for imbalanced data will increase as the data size, , grows. We have addressed and explained this issue in our paper and have proposed some potential solutions to improve this process. We will explore and optimize these solutions in future work.

  1. Detailed Method Explanation: Although the paper provides some explanation of the method, more detailed descriptions may be needed for certain aspects, such as the selection of specific parameters and the specific steps of the evolution process. This would help readers better understand and reproduce the research results.

Response:

Thank you for your attention to the evolutionary process of data. We have added relevant parameters and constraint methods regarding weighted selection of evolved data in our paper.

  1. More Experimental Results: The paper could provide more experimental results, including results under different parameter settings and comparisons with other existing methods. This would help readers more comprehensively evaluate the performance of the method.

Response:

We present Table 2 logistic regression validation f1 score, precision and recall score comparison for different resampling models at the end of the paper

  1. In-depth Discussion: The paper could further discuss the advantages and disadvantages of the method, as well as possible directions for improvement. This would help readers understand the potential and limitations of the method more deeply.

Response:

Indeed, in response to this issue, we have already made revisions to the conclusion section of our paper, highlighting the current challenges faced by the method and discussing potential directions for future research.

  1. Practical Application Cases: If possible, the paper could provide some application cases of the method in real-world problems. This would help readers understand the practical effects and application value of the method.

Response:

Regarding this issue, unfortunately, the current method mainly involves the process of resampling imbalanced data and has not been validated with specific classification detection cases. Among the known public classification detection models, the results obtained from this method are superior to other resampling methods, and these relevant results have also been presented in the paper.

  1. Emphasis on Conclusion: In the conclusion section, the main findings of the research should be clearly summarized, and their contributions and impacts on the field should be emphasized. This would help readers understand the importance and value of the research.

Response:

The conclusion section of the paper has been revised.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors have well taken into consideration all of my comments and have improved their research work! Congratulations, the current manuscript can be accepted in its current form.

Reviewer 3 Report

The authors have made corrections to the manuscript based on the reviewers' suggestions. The manuscript has been significantly improved and is now qualified for Electronics

Back to TopTop