Next Article in Journal
Strength Analysis of an Open Car Body with Honeycomb Elements during Ro-Ro Transportation
Next Article in Special Issue
Route Risk Index for Autonomous Trucks
Previous Article in Journal
TTH-Net: Two-Stage Transformer–CNN Hybrid Network for Leaf Vein Segmentation
Previous Article in Special Issue
Efficient Dissemination of Safety Messages in Vehicle Ad Hoc Network Environments
 
 
Article
Peer-Review Record

Data Quality Analysis and Improvement: A Case Study of a Bus Transportation System

Appl. Sci. 2023, 13(19), 11020; https://doi.org/10.3390/app131911020
by Shuyan Si 1, Wen Xiong 1,2,* and Xingliang Che 1
Reviewer 1: Anonymous
Reviewer 2:
Appl. Sci. 2023, 13(19), 11020; https://doi.org/10.3390/app131911020
Submission received: 24 August 2023 / Revised: 2 October 2023 / Accepted: 4 October 2023 / Published: 6 October 2023
(This article belongs to the Special Issue Big Data Applications in Transportation)

Round 1

Reviewer 1 Report

Summary

This manuscript presents a taxonomy of data quality-related challenges found in the scientific literature. The authors then investigate common data-cleaning strategies to overcome said challenges. A bus transportation network is then used as an example to showcase the various categories of data quality issues found in real-world scenarios. Lastly, the authors perform a set of standard data-cleaning operations and propose two alternative approaches for noisy data cleaning and data imputation. Experimental results show good performance of the proposed approaches. 

Review

Overall, the manuscript has some merits for publication. For instance, it proposes a data quality taxonomy that could be of interest to the research community and provides a couple of simple yet effective data-cleaning methods. However, there are a number of important issues that must be addressed before this work can be accepted. Please see my detailed comments below:

Content

- The novelty of this work is very limited. One of the two data cleaning procedures proposed is very simplistic while the second appears to be a mere application of an existing algorithm. The authors should try to better emphasize where the novelty of this work lies.

- The motivation behind the need for good quality data in a general context is clear. However, the motivation behind the need for good data in the specific case of the bus transportation network should be made more clear (e.g. because better quality data would allow us to get more accurate predictions of bus arrival times at stations)

- The authors mention, as one of the main contributions of this manuscript, the analysis of 26.160 papers to create the data quality taxonomy. However, there is no evidence to support this claim. The authors must back this claim by, for example, adding a new column to Table 1 containing, for each data quality challenge, the papers from the literature that have addressed said challenge. This would be a very valuable resource for the community. 

- In the Related Work section the authors mention that "Based on our statistics, we can confidently conclude that DQ problems have been garnering increasing attention year by year". Evidence must be provided to support this claim, for instance, the authors could aggregate the 26160 data quality-related papers by year and plot the total numbers to show the trend.

- Also in the Related Work section, some example papers and the data quality issues studied in them are presented. However, the authors do not specifically say how these manuscripts would be mapped into their proposed taxonomy. For instance, would Ref. 6 be put into categories 2 and 4 based on the data issues it studies?

- In the Introduction, the authors claim that "no studies have analyzed DQ issues on public transportation systems" yet several examples of manuscripts doing this are presented in the related work section. I would advise modifying or removing that claim. 

- On line 50 the authors mention that DQ problems have an impact on downstream applications, please give some examples of such applications. 

- The manuscript seems to implement either existing or novel data cleaning procedures and cover the entire taxonomy of data quality issues. This should be made more clear throughout the manuscript. 

- When describing the Spark data cleaning procedure the authors should clarify that Alg. 1 is used in step 5 and Alg. 2 is used in step 6.    

Presentation

- I find the organization of the manuscript somewhat confusing: 

- If the data quality taxonomy is a main result of the paper, it should have its own section instead of being part of the related work. In this new section, the authors could not only present the taxonomy but also describe the currently available data-cleaning solutions.

- The data quality issues found on the bus transportation network as well as the data cleaning solutions proposed could also have a separate section.  

- The manuscript is sufficiently well written but contains many grammatical mistakes making it hard to follow at times, e.g.:

 - "poor data quality dataset costs" -> poor quality datasets cost

 - "garnering increasing" -> gathering increasing?

 - "We second describe" -> Secondly, we describe

 - "We third describe" -> Thirdly, we describe

 - "illuminate anomalies" -> illustrate anomalies

- Some figures could be improved:

 - Fig. 8, consider using a log scale to better visualize the missing records on lines 4-6

 - Fig. 3 and others are very small, consider increasing the text size. 

 - Fig. 6, consider a more appropriate naming for the subplots, e.g. "example trajectory 1", "example trajectory 2"

Other comments 

- In line 488 there is an error. The plate number should be A001** not A002**

- In line 486, how many samples exactly were used to compute the results presented in Table 4?

 

 

 

See general comments. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Overall, the paper introduction provides a clear understanding of the problem, the research is addressing and its significance. However, it could be improved for clarity, coherence, and precision. Here are my recommendations:

Introduction:

1. It's great that you've identified a gap in the current research landscape. However, more justification could be provided as to why a comprehensive study of data quality issues is needed, and why specifically in public bus transportation systems. This can help underline the significance of your research.

2. You could strengthen your problem statement by discussing some of the specific challenges associated with studying data quality issues in public bus transportation systems. Are there unique factors or complexities in this field that make data quality a particularly pressing issue? This could bolster your argument for the need for more research in this area.

Related Work

1. Your methodology for selecting relevant literature is clear and robust. An enhancement might be to include a brief justification for your choices. Why you selected the specific databases and the time range could be useful for readers to understand your rationale.

2. Your classification into six categories of DQ problems, are these categories widely accepted in the field, or are they new categories you are proposing? Why are these the best or most useful categories?

Conclusions

1. Although you've touched on the potential benefits for public transportation, you might want to detail more about how these findings could be applied, or discuss the potential for your methods to be adapted for other sectors. This could underline the broader implications of your research.

2. Consider adding a section on future work. This could involve identifying limitations of your current study, proposing extensions of your research, or suggesting new related research directions. This can demonstrate a forward-thinking approach and an understanding of the wider research context.

References

1. It is important to ensure that your references are current (with the last 5 years) and from reputable sources. This will enhance the credibility of your paper.

Minor editing of English language required

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors have done an excellent job addressing all the issues raised during the first review. In its current form, the manuscript reads much better and the contributions are more precise and backed up with data/citations. Considering these changes, I believe the manuscript is now ready to be published. 

Some more grammatical mistakes have been introduced throughout the manuscript in this revision. Overall, the presentation quality could be improved. 

Reviewer 2 Report

All is OK now.

Back to TopTop