Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Building Advanced Web Applications Using Data Ingestion and Data Processing Tools

Electronics 2024, 13(4), 709; https://doi.org/10.3390/electronics13040709

by Šimun Šprem^1,*, Nikola Tomažin¹, Jelena Matečić¹ and Marko Horvat^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2024, 13(4), 709; https://doi.org/10.3390/electronics13040709

Submission received: 24 December 2023 / Revised: 2 February 2024 / Accepted: 6 February 2024 / Published: 9 February 2024

(This article belongs to the Special Issue Advanced Web Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The objective of the article is an analysis of the tools and technologies used by web applications for data management. It specifies the process of collecting and publishing data in the cloud using data ingestion tools. Next, data processing tools are analysed in order to analyse, refine and derive meaningful information from the raw data.

The paper mainly presents the following issues:

- the data collection and publication process takes the Dataphos technology developed by Syntho as a reference and compares it with the Change Data Capture (CDC) model. he added value of Dataphos is given by referring to the vendor link, there is not a real evaluation phase.

- Concerning the data processing tools, it is not clear whether they are analysed in order to compare them or only to highlight their characteristics. Key features and challenges listed for each tool it is unclear whether they are derived from experimentation by the author or whether they are features specified by the manufacturers. Since these tools have been extensively analysed and compared in the literature, a section on related works would be desirable.

- The discussion section is made with reference to two specific use cases and indicating Dataphos as better than CDC without having reported an experimental comparison. Furthermore, the use of Apache Beam is relegated to the programming language with which to write the pipeline, rather than the specification of reasons for choosing this tool over others. In particular, however, no experiments are reported to show that the choice of Spark, Flink or Kafka is the right one with respect to the use cases presented.

- In most cases, the references in the paper are not pointed with respect to individual assertions but specified at the argument level of the section or subsection.

- All Figures should have a short explanatory title and caption. Report explanations in the text of the section where the figure is recalled.

- All Tables must be cited in the text.

- Check References section with reference to MDPI Reference List and Citations Style Guide.

The keywords in the article do not seem appropriate.

Some typos:

- Fig. 3 should resize.

- sec. 3.2.2, in title: Flink, no Spark.

- table 8 and table 9: wrong table caption.

Author Response

Response to Reviewer 1 Comments

Point 1: The data collection and publication process takes the Dataphos technology developed by Syntho as a reference and compares it with the Change Data Capture (CDC) model. The added value of Dataphos is given by referring to the vendor link, there is not a real evaluation phase.

Response 1: Thank you very much for your comment.

For the revised version, we have conducted a new experiment comparing a typical CDC platform and the Dataphos Publisher using a standardized dataset. The experiment is described in a substantially expanded Section 2.4 “CDC vs. Dataphos Publisher”. For the experiment we have selected the Microsoft Wide World Importers (WWI) database as the dataset and the Debezium platform for the CDC tool. The selected dataset and the CDC tool are described as well as the reasoning behind their selection. The results show that the Dataphos has better performance in the scenario where rapid data propagation and minimal latency are critical factors. The results of the experiment are presented in Table 3 and discussed in Section 2.4.

Point 2: Concerning the data processing tools, it is not clear whether they are analysed in order to compare them or only to highlight their characteristics. Key features and challenges listed for each tool it is unclear whether they are derived from experimentation by the author or whether they are features specified by the manufacturers. Since these tools have been extensively analysed and compared in the literature, a section on related works would be desirable.

Response 2: Thank you for your valuable suggestion. In the revised version we have thoroughly transformed and considerably expanded Section 2 “Data Ingestion” (pages 2-8).

We have added new Section 2.1. “State of the art in data ingestion tools” (pages 3-5) that examines main capabilities and common applications of state-of-the-art (SOTA) data ingestion tools. In this section two comparative tables (Table 1 and Table 2, page 4) were added. Table 1 contains a qualitative comparison between main functionalities of Apache Flume, Apache NiFi and Apache Kafka Connect data ingestion tools with the recommendation for the optimal tool for each of the functionalities. Table 2 presents a comparison between Apache Flume, Apache NiFi and Apache Kafka Connect data ingestion tools performance indicators with the recommendation for the optimal tool based on their respective capabilities related to the specific indicator.

Section 2.2 “Change Data Capture” (pages 5 and 6) has been improved with new references to relevant literature and the latest developments in this area.

Furthermore, in the revised version we have amended Section 2.3 (page 6) and enhanced the description of the Dataphos architecture together with new references on this topic.

Also, as described in the previous response, in the revised version in Section 2.4 (pages 6-8) we have described a new experiment that compared data processing performance of the Dataphos publisher and a selected typical CDC platform. The experimental results show in which circumstances or usage scenarios the Dataphos is a better platform than CDC.

Point 3: The discussion section is made with reference to two specific use cases and indicating Dataphos as better than CDC without having reported an experimental comparison.

Response 3: In the revised version, we conducted a new experiment comparing the data processing performance of the Dataphos and a CDC platform. The experiment is described in Section 2.4 (page 7 and 8). The experiment utilized a well-known and openly available dataset so it can be replicated by other researchers. The results indicate that Dataphos offers is a better choice in specific scenarios that are explained in Section 2.3.

Also, the Discussion section has been transformed and improved (page 23-25) with a description of two (2) distinct use-cases in Sections 4.1 and 4.2 (pages 24 and 25, respectively). The discussion summarizes the findings and explains the selection process for the optimal data processing tool on the presented commonplace scenarios. Finally, the references section has been substantially expanded regarding the selection of optimal data engineering tools for different usage scenarios.

Point 4: Furthermore, the use of Apache Beam is relegated to the programming language with which to write the pipeline, rather than the specification of reasons for choosing this tool over others.

Response 4: In the revised version (Section 3 and Section 3.4, pages 9 and 20, respectively), we have addressed the concerns regarding the representation of Apache Beam in our analysis. As correctly pointed out, Apache Beam is fundamentally different from other data processing tools in that it serves primarily as an SDK for defining and describing data processing pipelines rather than as a data processing engine itself.

In Section 2.4, we have conducted an experiment and quantitative comparison specifically between the Dataphos platform and a typical CDC tool. This was done to provide a concrete, measurable basis for comparison between these two tools. In contrast, our discussion of Apache Beam is more qualitative in nature. We clarify that the role of Apache Beam is not directly comparable to other data processing engines such as Apache Spark or Flink. Instead, it provides a layer of abstraction that enables the definition and management of pipelines across different processing engines.

In the revised version we also cover how Apache Beam can be used in conjunction with other data processing tools. This is particularly important as in practice Apache Beam is often used in conjunction with these tools to leverage its capabilities in orchestrating and defining pipelines. In this way, we aim to create a better understanding of Apache Beam's unique position in the data processing ecosystem.

Our approach in the revised version emphasizes that the overview of tools is not meant to be a direct, quantitative comparison between all tools. Rather, it is an examination of their respective roles and strengths in the context of data processing, with a particular focus on practical applications and interoperability, as evident in the detailed analysis of Dataphos and CDC.

Additionally, in the revised version we have endeavored to substantiate our claims with quantitative data wherever possible, drawing on a variety of sources such as technical reports, peer-reviewed papers, and industry publications. This approach ensures that our findings and recommendations are not only based on professional experience but are also supported by measurable metrics and evidence from the wider research community.

Point 5: In particular, however, no experiments are reported to show that the choice of Spark, Flink or Kafka is the right one with respect to the use cases presented.

Response 5: To streamline the findings and to show to show the decision process behind the choice between Spark, Flink or Kafka, a new Discussion section was added in the revised version of the paper (page 23 and 24) containing a flowchart (Figure 11, page 24) for selection of the optimal data processing tool. The information was summarized through two (2) distinct use-cases in Sections 4.1 and 4.2 (pages 24 and 25, respectively) that involve Spark, Flink and Kafka, and which explain in detail the selection process for the optimal tool in commonplace data processing scenarios.

Also, we have considerably expanded the references section and added new citations to existing research regarding the selection of optimal data engineering tools for different usage scenarios.

Point 6: In most cases, the references in the paper are not pointed with respect to individual assertions but specified at the argument level of the section or subsection.

Response 6: The reference list has been expanded to increase the breadth and depth of academic sources supporting our arguments and assertions. In addition, the references have been redistributed throughout the text so that they are no longer found only in the introductory paragraphs of each chapter or section. Now the references are interwoven throughout the paper, aligning more closely with the individual assertions and key points in the various paragraphs. This approach provides a more detailed citation style and ensures that each important statement is directly supported by a corresponding reference.

Point 7: All Figures should have a short explanatory title and caption. Report explanations in the text of the section where the figure is recalled.

Response 7: All figures and tables have been revised, as well as their references in the text. In the amended version of the paper, all figures and tables have a title and caption with an appropriate explanation.

Point 8: All Tables must be cited in the text.

Response 8: In the revised version, all tables and figures are cited in the text.

Point 9: Check References section with reference to MDPI Reference List and Citations Style Guide.

Response 9: In the revised version all references have been checked to comply with the MDPI Reference List and Citations Style Guide. Doi numbers were listed if available.

Point 10: The keywords in the article do not seem appropriate.

Response 10: In the revised version the list of references has been expanded, optimized with regard to the content of the article and aligned with the IEEE taxonomy vocabulary.

Point 11: Some typos:

- Fig. 3 should resize.

- sec. 3.2.2, in title: Flink, no Spark.

- table 8 and table 9: wrong table caption.

Response 11: Thank you for bringing these problems to our attention. All typos in the revised version have been corrected.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This article focuses on data ingestion and data processing solutions in the field of Web Applications. In terms of data ingestion it presents a novel tool called Dataphos Publisher and compares it to the Change Data Capture process. In terms of data processing it reviews the strengths and weaknesses of four data processing tools and provides insight on the process of deciding which tool is more suitable for different use cases.

Overall the paper is well written and the reviews of the various tools and their functionality are thorough.

One important issue with the article in its current form, is that it reads more like a technical report than an actual research article.

In order to improve in that regard, the research questions could be stated more explicitly in the introduction section along with a summary of the research’s goals and findings. Additionally, equivalent answers should provided in the conclusion section detailing the research’s findings. These answers should discussed in the context of recent research activity in this field and if possible compared with findings of other researchers and contradictions or similarities should be noted.

Moreover the methodology of comparing the various tools should be made clearer. Did the researchers implement different solutions for the purposes of this research? Are the various claims backed by metrics or by other research in the field? Even if a lot of the provided insight is derived from professional experience, the maximum effort should be made to corroborate any claims with quantitative data or findings from other reports or research articles.

To that end, the article should be improved with more references and a clearer indication of which information is derived from each reference. The introduction section mentions various facts without providing any reference at all. In Sections 2 and 3 the references are presented loosely at the end of each subsection. In order to achieve both better clarity and more reliability each claim must be immediately followed by a relevant reference/source so it is made clear what is a sourced fact and what is the researcher’s opinion.

Besides improving the references to the existing text it would be a good idea to provide a more detailed presentation of state of the art research in the field of data ingestion and processing, based on previously published works in the field. A lot of the article’s references are derived from online documentation and informational Webpages and not enough from peer reviewed articles or reviews.

With regards to data processing it is important to provide the reader with as many sources as possible backing the authors’ claims concerning the various tools.

With regards to data ingestion, it is important to provide the reader with sources about the current landscape, the design philosophy and the actual needs that lead to the proposed Dataphos Publisher, rather than just describe its components and implementation.

Maybe adding a comparative table that presents summary information on all the four tools used would also help with streamlining the findings.

Some more additional minor notes are:

In line 327 the title of the subsection is “Challenges of Apache Spark” but it should be “Challenges of Apache Flink” since this is what the subsection covers.

After line 435, figure 10 is inserted halfway through a sentence. Move the figure at the start or the end of the paragraph to improve readability.

The level of English of the article is very high and only some minor proofreading might be necessary.

Author Response

Response to Reviewer 2 Comments

Point 1: This article focuses on data ingestion and data processing solutions in the field of Web Applications. In terms of data ingestion it presents a novel tool called Dataphos Publisher and compares it to the Change Data Capture process. In terms of data processing it reviews the strengths and weaknesses of four data processing tools and provides insight on the process of deciding which tool is more suitable for different use cases. Overall the paper is well written and the reviews of the various tools and their functionality are thorough.

Response 1: Thank you very much for your kind comments.

Point 2: One important issue with the article in its current form, is that it reads more like a technical report than an actual research article. In order to improve in that regard, the research questions could be stated more explicitly in the introduction section along with a summary of the research’s goals and findings.

Response 2: Thank you very much for your helpful suggestions.

We have made several changes to the revised version to improve the scientific contributions of the paper.

Firstly, in the introduction section, we have explicitly stated the aim of the paper and the research questions. Our main goal with this research is threefold: 1) to present the Dataphos platform, 2) to provide a review of the most common data processing platforms, and 3) to assist other scientists and data engineers in selecting the most appropriate architecture and setup for their needs. We have also summarized the goals and findings of our research at the beginning of the article.

Secondly, in the revised version we have conducted a new experiment comparing a typical CDC platform and the Dataphos Publisher using a standardized dataset. The experiment is described in an expanded Section 2.4 “CDC vs. Dataphos Publisher”. For the experiment we have selected the Microsoft Wide World Importers (WWI) database as the dataset and the Debezium platform for the CDC tool. The selected dataset and the CDC tool are described as well as the reasoning behind their selection. The results show that the Dataphos has better performance in the scenario where rapid data propagation and minimal latency are critical factors. The results of the experiment are presented in Table 3 and discussed in Section 2.4 (page 8).

Thirdly, we have added a new Section 2.1. “State of the art in data ingestion tools” (pages 3-5) that examines main capabilities and common applications of state-of-the-art (SOTA) data ingestion tools. In this section two comparative tables (Table 1 and Table 2, page 4) were added. Table 1 contains a qualitative comparison between main functionalities of Apache Flume, Apache NiFi and Apache Kafka Connect data ingestion tools with the recommendation for the optimal tool for each of the functionalities. Table 2 presents a comparison between Apache Flume, Apache NiFi and Apache Kafka Connect data ingestion tools performance indicators with the recommendation for the optimal tool based on their respective capabilities related to the specific indicator.

Finally, an improved qualitative analysis of different data processing tools and their optimal application scenarios is described in Discussion section (page 23-25). Two commonplace usage scenarios (i.e., use cases) are brough forward and explained. Also, in the revised version in Sections 4.1 and 4.2 (pages 24 and 25, respectively) a clear and practical framework for making informed decisions in the field of data engineering is provided in the form of a flowchart accompanied with a thorough description of the process to select the optimal data processing platform depending on the concrete usage scenario.

All these modifications aim to enhance the academic quality of the paper in the revised version and ensure it aligns more closely with the standards of a research article, rather than a technical report.

Point 3: Additionally, equivalent answers should provided in the conclusion section detailing the research’s findings.

Response 3: In the revised version, the conclusion section of the revised version has been substantially improved in scope and depth. The purpose of the paper has been clearly stated, and the research findings have been detailed. An outlook for future research and applications of the research findings has been given.

Point 4: These answers should discussed in the context of recent research activity in this field and if possible compared with findings of other researchers and contradictions or similarities should be noted.

Response 4: In the revised version we have endeavored to substantiate our claims with quantitative data wherever possible, drawing on a variety of sources such as technical reports, peer-reviewed papers, and industry publications. This approach ensures that our findings and recommendations are not only based on professional experience but are also supported by measurable metrics and evidence from the wider research community.

Point 5: Moreover the methodology of comparing the various tools should be made clearer. Did the researchers implement different solutions for the purposes of this research? Are the various claims backed by metrics or by other research in the field? Even if a lot of the provided insight is derived from professional experience, the maximum effort should be made to corroborate any claims with quantitative data or findings from other reports or research articles.

Response 5: In the revised version, the methodology for comparing different data processing tools Apache Spark, Apache Flink, Kafka Streams, and Apache Beam has been explained in Section 3 “Data processing tools” (page 8). The statements have been collaborated with citations from a substantially expanded and improved set of references.

In terms of comparing data processing tools, our approach is to contrast CDC (Change Data Capture) and Dataphos while providing a comprehensive overview of other important data processing tools. As far as CDC and Dataphos are concerned, we implemented both solutions as part of our research to provide direct comparative insights in Section 2.4. This allowed us to back up our claims with first-hand empirical data and ensure a robust and practical evaluation.

For other tools such as Apache Spark, Apache Flink, Kafka Streams and Apache Beam, the comparison is based on a comprehensive review of existing literature and current research in the field.

Point 6: To that end, the article should be improved with more references and a clearer indication of which information is derived from each reference.

Response 6: Thank you very much for your kind comment. New references have been added throughout the text in the revised version, ensuring that the topics covered are comprehensive and well-supported.

Each reference is now carefully linked to the relevant sections rather than just the chapter's beginning and ending paragraphs. This methodology provides a clearer context for each reference, making it easier for readers to understand where specific information originated and how it relates to the overall narrative and research findings.

Point 7: The introduction section mentions various facts without providing any reference at all.

Response 7: Thank you very much for your comment. The Introduction has been updated with two new references labeled [1] and [2]. These references confirm the facts presented in this section, providing a strong academic basis for the assertions made. The inclusion of these references ensures that the information presented is verifiable and based on existing research.

Point 8: In Sections 2 and 3 the references are presented loosely at the end of each subsection. In order to achieve both better clarity and more reliability each claim must be immediately followed by a relevant reference/source so it is made clear what is a sourced fact and what is the researcher’s opinion.

Response 8: Thank you for your comment. In the revised version, the reference list has been expanded to increase the breadth and depth of academic sources supporting our arguments and assertions. In addition, the references have been redistributed throughout the text so that they are no longer found only in the introductory paragraphs of each chapter or section. Now the references are interwoven throughout the text so that they are better aligned with the individual assertions and key points in the various paragraphs. This approach provides a more detailed citation style and ensures that each important statement is directly supported by a corresponding reference.

Point 9: Besides improving the references to the existing text it would be a good idea to provide a more detailed presentation of state of the art research in the field of data ingestion and processing, based on previously published works in the field.

Response 9: Thank you for your valuable suggestion. In the revised version a new section was added (2.1 “State of the art in data ingestion tools”, pages 3-5) that examines main capabilities and common applications of state-of-the-art (SOTA) data ingestion tools. In this section two comparative tables (Table 1 and Table 2, page 4) were added. Table 1 contains a qualitative comparison between main functionalities of Apache Flume, Apache NiFi and Apache Kafka Connect data ingestion tools with the recommendation for the optimal tool for each of the functionalities. Table 2 presents a comparison between Apache Flume, Apache NiFi and Apache Kafka Connect data ingestion tools performance indicators with the recommendation for the optimal tool based on their respective capabilities related to the specific indicator.

Point 10: A lot of the article’s references are derived from online documentation and informational Webpages and not enough from peer reviewed articles or reviews.

Response 10: Thank you very much for your valuable comment. We completely agree that online documentation is not a desirable academic source. However, in the case of rapidly evolving technologies such as data processing, in some cases it is necessary to use credible online sources since there are no such references available in scientific bibliographical databases. However, in the revised version we replaced as many references to web sources as possible with citations to scientific papers. Some references to online documentation have been retained, but they have been augmented with additional valid scientific references to conference and journal papers or book chapters.

Point 11: With regards to data processing it is important to provide the reader with as many sources as possible backing the authors’ claims concerning the various tools.

Response 11: The list of references has been substantially expanded, especially those focusing on the domain of data processing. The revised version now contains 49 references, compared to the original version which contained only 29. Additionally, in the revised version, we replaced as many references to web sources as possible with citations to scientific papers.

Point 12: With regards to data ingestion, it is important to provide the reader with sources about the current landscape, the design philosophy and the actual needs that lead to the proposed Dataphos Publisher, rather than just describe its components and implementation.

Response 12: Thank you very much for your valuable comment. The Section 2.1 has been modified to more fully address the current landscape and design philosophy of data ingestion. This includes a more in-depth look at current data ingestion methods as well as the specific requirements that motivated Dataphos Publisher's development. Additionally, new references (2, 3, 4, 5, 7) have been added to provide deeper insights into the most recent advancements and trends in this topic. These enhancements attempt to align Dataphos Publisher's detailed description with a broader understanding of its relevance and application in the dynamic setting of data management.

Point 13: Maybe adding a comparative table that presents summary information on all the four tools used would also help with streamlining the findings.

Response 13: Thank you very much for your helpful comment. To streamline the findings a new Discussion section was added in the revised version of the paper (page 23 and 24) containing a flowchart (Figure 11, page 24) for selection of the optimal data processing tool. The information was summarized through two (2) distinct use-cases in Sections 4.1 and 4.2 (pages 24 and 25, respectively) which explain in detail the selection process for the optimal tool in commonplace data processing scenarios.

Also, we have considerably expanded the references section and added new citations to existing research regarding the selection of optimal data engineering tools for different usage scenarios.

Point 14: Some more additional minor notes are:

In line 327 the title of the subsection is “Challenges of Apache Spark” but it should be “Challenges of Apache Flink” since this is what the subsection covers.

After line 435, figure 10 is inserted halfway through a sentence. Move the figure at the start or the end of the paragraph to improve readability.

Response 14: Thank you, in the revised version these issues have be corrected as suggested.

Point 15: The level of English of the article is very high and only some minor proofreading might be necessary.

Response 15: Thank you for your suggestion. In the revised version, English spelling and grammar have been re-checked and improved throughout the manuscript by professional services.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have addressed all critical aspects of the first revision and have made the necessary changes.

Typos:
- sec 2.1, delete “(SOTA)” acronym.
- sec 3.4, “in Java, PythonGo, ..“, should be in Java, Python, Go, …

Author Response

Response to Reviewer 1 Comments

Point 1: The authors have addressed all critical aspects of the first revision and have made the necessary changes.

Typos:

- sec 2.1, delete “(SOTA)” acronym.

- sec 3.4, “in Java, PythonGo, ..“, should be in Java, Python, Go, …

Response 1: The typos have been corrected in the revised version. Also, we have made several amendments throughout the text and in figures:

Improvements have been made in wording, punctuation, formatting, and the use of references.
Importantly, all Figures have been enhanced in quality and detail.

Authors deeply appreciate Reviewer 1's thoughtful feedback and the recognition of the efforts we have made in addressing the critical aspects pointed out in the first revision. Your constructive comments have greatly helped us to refine our paper and ensure its academic rigor and clarity.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The numerous points of the first review were addressed in a satisfactory manner. The manuscript in its current form presents a very interesting read in a cutting edge technological field, is well sourced and contains valuable new insights.

Author Response

Response to Reviewer 2 Comments

Point 1: The numerous points of the first review were addressed in a satisfactory manner. The manuscript in its current form presents a very interesting read in a cutting edge technological field, is well sourced and contains valuable new insights.

Response 1: Authors sincerely thank Reviewer 2 for acknowledging our efforts to address the points raised in the first review. We are delighted to hear that the paper now presents an interesting read and contributes valuable insights in this cutting-edge technological field. Your feedback and guidance have greatly improved the quality of our work. Thank you for your support and constructive feedback.

Author Response File: Author Response.pdf

Article Menu

Building Advanced Web Applications Using Data Ingestion and Data Processing Tools

Further Information

Guidelines

MDPI Initiatives

Follow MDPI