Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

An Automated Big Data Quality Anomaly Correction Framework Using Predictive Analysis

Data 2023, 8(12), 182; https://doi.org/10.3390/data8120182

by Widad Elouataoui^*

, Saida El Mendili^* and Youssef Gahi^*

Reviewer 1: Anonymous

Reviewer 2:

Claudia Duran

Reviewer 3: Anonymous

Reviewer 4:

Huijun Wu

Reviewer 5:

Luisa D'Amore

Data 2023, 8(12), 182; https://doi.org/10.3390/data8120182

Submission received: 14 September 2023 / Revised: 23 November 2023 / Accepted: 27 November 2023 / Published: 1 December 2023

(This article belongs to the Section Information Systems and Data Management)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a framework to addresses data quality by considering six key quality dimensions: Accuracy, Completeness, Conformity, Uniqueness, Consistency, and Readability. The framework is independent of a specific field and is designed to be applicable across various areas, offering a generic approach to address data quality anomalies. The authors investigate their framework on two datasets with an accuracy of 98.22%. The experimental results have shown that the framework has allowed to boost the data quality to a great score reaching 99%, with an improvement rate of up to 14.76% of the quality score.

Author Response

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript is very general and it is not clear what knowledge contribution it generates. Improvement is needed:

1. The research methodology is not clear. Some concepts of pre-processing and data processing are studied which are general and different applications in industries are described in little detail. It is convenient to define the hypotheses of the research in order to understand what is to be tested.

2. The introduction shows a general context of big data and the metrics by which predictions are measured, but the motivation and what is to be investigated is lacking.

3. Section II lacks a method, Figure 2 develops the KDD, but does not visualise the new knowledge to be generated. The dimensions in Figure 3 do not visualise new knowledge.

4. In Section III, the classification given in the table is very general and does not show what the contribution to the research is. They seem to name some industries in which big data is used, but it is not clear how important it is (advantages and disadvantages). Why are these methods useful in these industries? What are the characteristics of the industries?

5. Improve the objectives as it is not clear what research is being carried out.

6.The research needs to be tidied up and better explained as the possible scenarios are not understood, especially how they relate to the objectives. 7. In the implementation further explain the sample, why are these attributes chosen? It is necessary to know the database and the information process to understand the relevance of the research architecture. It is not clear which system is being studied.

8. As the research is not clear, the contributions of the results, discussion and conclusions are not understood.

9. There is a lack of references showing the state of the art in more detail and there are many self-references.

10. Write a summary that is in line with the content of the research.

Comments on the Quality of English Language

Moderate editing of English language required

Author Response

Thank you for taking the time to review our paper. Your thoughtful insights and constructive feedback have been immensely valuable to us. We have carefully considered your comments and have made the necessary corrections to enhance the overall quality of our paper. Please find the made corrections in the attached file.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

In this paper, the authors present an advanced framework designed to automatically rectify anomalies in the quality of large datasets by employing an intelligent predictive model. This comprehensive framework effectively addresses key dimensions of data quality, encompassing Accuracy, Completeness, Conformity, Uniqueness, Consistency, and Readability. What sets this framework apart is its adaptability across diverse domains, offering a versatile solution to rectify data quality anomalies. Here are some recommendations for enhancing the paper:

(1) The paper currently contains a multitude of sections and subsections, resulting in a somewhat convoluted structure. Consider reorganizing the paper to streamline the content and improve its overall clarity.

(2) While it's crucial to provide an exhaustive review of related work, it's also vital to maintain conciseness. You might consider summarizing the related work with textual descriptions rather than presenting lengthy tables.

(3) Clearly articulate the specific domains or application areas where the proposed automated framework for correcting big data quality anomalies is most effective. Explain why the framework is particularly well-suited for these areas.

(4) To demonstrate the validity of your method, consider conducting a comparative analysis with other similar frameworks for correcting big data quality anomalies, such as a comparison with the approach presented in doi: 10.1016/j.ijcce.2022.10.001.

(5) Ensure that the figures in the paper are of high quality, particularly with respect to resolution and legibility.

(6) Conclude Section 1 by providing a clear roadmap of the paper's structure. Highlight the primary motivations for combining intelligent predictive models in your research.

(7) Present Algorithm 1 in text format rather than as an image. Additionally, include an analysis of the algorithm's time complexity and computational cost.

(8) Provide a detailed explanation for the selection of Data Sets 1 and 2. Describe the logic behind your experimental design choices.

(9) Enrich the list of references, particularly with recent review papers or relevant sources related to big data notions.

Author Response

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The authors present a sophisticated framework in this conference paper, which leverages an intelligent predictive model to automatically correct big data quality anomalies.

Here are some comments:

1. Line 521: Change "figure-4" to "figure-12."

2. Line 378: Change "figure-3" to "figure-11."

3. Line 377: Change "figure-2" to "figure-10."

4. After line 372, update the figure caption from "figure-1" to "figure-9."

5. Line 477: Remove "(HDFS)" since it's already mentioned in the previous line (line 476).

6. Consider condensing the content in the first 10 pages, which primarily cover background and related work. This will help maintain a balanced distribution of information throughout the paper.

7. Move Section V, "Possible Scenarios," between Section 2 and Section 3 for improved logical flow.

8. Line 595: Remove "(HDFS)" as it's already mentioned in line 476.

9. Line 642: Use "Credit" instead of "CRediT."

Comments on the Quality of English Language

The quality of the English presentation meets the required standards.

Author Response

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors

The authors propose a framework that addresses the main aspects of data quality by considering six key quality dimensions: Accuracy, Completeness, Conformity, Uniqueness, Consistency, and Readability.

By introducing this framework,the authors aim to provide a comprehensive solution for addressing big data quality anomalies. In my opinion this is a quite challenging aim. The present contribution seems one step towards this aim. Mainly because the implementation was in Phyton.

Author Response

Thank you for taking the time to review our paper. Your thoughtful insights and constructive feedback have been immensely valuable to us. We have made the necessary corrections to enhance the overall quality of our paper. The framework was actually implemented using pyspark in Cloudera Data Platform ( CDP) a platform designed with different tools for big data processing. Thus, tools such a Hadoop, and apache spark were used in the implementation.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Comments on the Quality of English Language

Minor editing of English language required

Author Response

We appreciate your feedback and would like to inform you that we have revised the manuscript. Our focus during the revision process included comprehensive English editing to enhance clarity, coherence, and overall readability.

Article Menu

An Automated Big Data Quality Anomaly Correction Framework Using Predictive Analysis

Further Information

Guidelines

MDPI Initiatives

Follow MDPI