Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Dataset Constrution through Ontology-Based Data Requirements Analysis

Appl. Sci. 2024, 14(6), 2237; https://doi.org/10.3390/app14062237

by Liangru Jiang and Xi Wang^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Gabriel Pestana

Appl. Sci. 2024, 14(6), 2237; https://doi.org/10.3390/app14062237

Submission received: 26 January 2024 / Revised: 27 February 2024 / Accepted: 4 March 2024 / Published: 7 March 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The quality of machine learning (ML) systems has become increasingly important as ML technology has advanced. The datasets that these systems learn from greatly influence their effectiveness. Nevertheless, there is a shortage of uniformity in the dataset collection process, and there are few techniques for performing requirements analysis to determine which datasets are required. The model's capacity to generalize is weakened by this lack of consistency, which also leads to subpar training efficiency and ambiguity about the dataset's quality. This paper suggests an ontology-based requirement analysis approach to address these issues. This approach defines coverage criteria within the ontology to specify data requirements and integrates domain knowledge into the data requirements analysis process. After that, the creation of high-quality datasets can be directed by these criteria. An image recognition system experiment was carried out to validate this method in the context of autonomous driving. The outcomes show that ML systems perform better when trained on datasets created by this data needs analysis method.

The topic is very interesting and addresses one of the major issues with AI systems. The proposal and experimentation are good. My major concern is that the author needs to improve the following sections:

a) Literature review. Its too brief and doesn't include all relevant references. If this section is weaker then gaps are not very well evident and hence the justification to undertake the research is also less evident.

b) Concluding remarks are too brief. That's not a scientific way of concluding the paper.

Finally, add the following points:

a) Discussion section where you provide interdisciplinary perspectives and something out of the box discussion. Remember that your results are meaningful for many stakeholders and hence also include implications for them (no need to add a section but could be added as multiple paragraphs or in any way you seem suitable).

Comments on the Quality of English Language

Minor English modifications are required.

Author Response

Thank you very much for taking the time to review this manuscript. Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This article discussed the importance of dataset quality in machine learning systems and the role of data requirements analysis in improving the performance and accuracy of such systems. The authors highlighted the significance of domain knowledge in determining specific data requirements and emphasized the need for structured and formal methods to analyze data requirements effectively.

I send the following major comments.

11. The author should explain why using ontologies in data preparation tasks than Simple graph neural networks. The graph method is meaningful enough to express relationships between data and is also good for expressing simple connections. Authors must fully explain whether they applied ontology concepts rather than graph structures.

22. The section on related work does not cover the field of bits of knowledge. The author should add more content to the related works section including data preparation schemes and methods using ontologies and not using ontologies and their differences.

33. I guess that not using a deep learning model in this task could be attributed to the focus on methodology, simplicity, resource constraints, interpretability, and the specific objectives of the study related to data requirements analysis rather than model complexity. The authors should further explain the scope of application of the proposed method with specific examples and datasets. It is difficult to evaluate the effectiveness of the proposed method in situations such as autonomous driving alone.

44. In generating the SGG(Scene Graph Generation), how to understand the relation of detected objects? For example, when there are three entities, 'car', 'behind', and 'sign', through YoLo you will only be able to know 'car' and 'sign', but how can you obtain a semantic relationship such as 'behind'?

55. The author introduced the unbiased-SGG (USGG) framework that used causal inference in section 4. The authors should add more detailed explanations of USGG. This article seems to lack information on USGG and does not infer the knowledge of causal inference.

Comments on the Quality of English Language

English needs further correction.

Author Response

Thank you very much for taking the time to review this manuscript. Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper's proposed method employs ontology-based domain modelling and criteria-driven data specification. Experimentation on a traffic scene image recognition system validates its effectiveness. The emphasis lies on leveraging semantic information from the domain model to aid users in crafting precise data requirements. The authors introduce two coverage criteria, based on entity attributes and inter-entity relations, to filter candidate datasets. A succinct overview of related work highlights gaps, notably the absence of systematic semantic data modelling in research scenarios.

It is recommended to shift section 5.1 to explain the graph presented in Fig. 2

Comments on the Quality of English Language

Line 56: The domain model should briefly describe the concept. What is a “domain model” eventually supported by an ISO/IEC standard

Lines 77-79: it is mentioned, “We trained the model using a dataset” should consider providing a link to the dataset. It is also mentioned “trained the model using an candidate dataset collected randomly”, again should consider providing a link to the dataset.

Lines 132-138: data requirements - please review because it is too vague/generic and needs to be more objective with fewer activations. At this point, the reader wants more concrete/verifiable contextualization to properly sustain the affirmations regarding the data requirements landscape and eventually, compare or clarify how it differs from a data-centric approach.

Lines 145-1589: consider resuming/rephrasing the description of Fig. 1; it is too vague/generic. Authors must provide a deeper explanation of the “workflow”.

Also, consider redesigning the workflow diagram in Fig 1 to become more detailed – this Fig does not map or clarify the Ontology-Based Data Requirements Analysis Approach. The data governance lifecycle schema/workflow should consider more informational artefacts (or processor, business rules, etc.) to reach the Final Dataset

Just in this paragraph, the adjective “crucial” is mentioned five times. You should review the paper to avoid repetition and use fewer adjectives. It is preferable to adopt a pragmatic approach by providing some examples or scenarios to complement or clarify the user's understanding.

Line 193: the authors refer to “four categories, including entity, relation, attribute, and value.”, but they do not explain what are these four categories or when they are used. A Mathematical expression is presented in (1) without first contextualizing the reader with the four categories. It would help provide a scenario explaining the decision to use/correlate the Tree and trafficSign to follow the author's viewpoint better. This should be added before Fig. 2. The provided example semantic model should be explained to improve learnability and mitigate misinterpretations concerning the domain-related elements.

Lines 292-293: the phrase “As shown in Figure 2, we can see that the relation node Position is directly related to the target entity node TrafficSign” is not clear because what is directly related is the node TrafficSign with the node Position, in fact, the node Position looks more like a convergence node; similar observation regarding the node Do

Lines 308-309: it is not clear how “the proposed research aims to provide engineers with a foundation for describing the learning behaviour of image recognition system and constructing high-quality dataset”, this statement is too generic; does it mean it is adjusted to image recognition and what happens when the image is fuzzy (easily misinterpretation of what the picture represents – there are plenty of research work studying the challenges to address such fuzziness).

Lines 359-360: the same phrase is repeated “ This research aims to provide engineers with a foundation…”

Line 361: this phrase is repeated… consider reviewing the paragraph Line 361-364; it is too generic, and in this section, it is expected to provide more concrete data-evidence features.

Fig. 5 is oversimplified; a more complete example should be considered so that the reader gets a better understanding of the scene graph drawn from a subset of the generated tuples.

Line 385: “traffic scene dataset” Until now, the dataset source has not been mentioned. The manuscript should provide a link or a reference to the traffic scene dataset used for the examples presented in Fig. 7 and Fig. 4 à the CCTSDB2021 China Road traffic data list (mentioned in Line 392)

Line 384: Section 5.1. Experimental setup – the information provided in this section should be closer to Fig. 2 to improve the reader's understanding of the scenario used to present the proposed ontology-based domain model.

Regarding the graph diagram in Fig. 2, it is also not clear whether it is one of the outcomes generated by the supporting tool presented in Fig. 7

Line 433: Table 5 and Table 6 should be merged into a single table.

Also, some orthographic mistakes (e.g., line 519: crutial) and a minor grammar review is required (e.g., line 184 “we establish an domain model”) – a recommendation to perform a full paper review to eliminate/correct these written issues.

Line 515: The conclusion needs to be improved. It does not summarize the outcomes (or gaps) found in the related work, outlining the missing point – from the author's perspective. Omits the proposed model which is the key achievement of the, instead it is mentioned that a tool was developed … this section requires a major review.

two self-citations were identified: [4] and [10]

Author Response

Thank you very much for taking the time to review this manuscript. Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

You have addressed almost all my concerns. However, discussion section need to be further improved. Its not sufficient just to say that it has applicability in.... Eloborate by including some references, how it can be adopted and if you think there are some involved issues. Ideally this should be like a brief section which could motivate interdisciplinary application of your research work.

Rest is fine.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The author has successfully addressed all questions.

Comments on the Quality of English Language

The current English level is sufficient for reading.

Author Response

Thank you for your affirmation.

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

Congratulations. You have addressed my concerns. Good luck.

Comments on the Quality of English Language

Its fine.

Article Menu

Dataset Constrution through Ontology-Based Data Requirements Analysis

Further Information

Guidelines

MDPI Initiatives

Follow MDPI