Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessFeature PaperArticle

Peer-Review Record

An Efficient Distributed SPARQL Query Processing Scheme Considering Communication Costs in Spark Environments

Appl. Sci. 2022, 12(1), 122; https://doi.org/10.3390/app12010122

by Jongtae Lim¹

, Byounghoon Kim¹, Hyeonbyeong Lee¹, Dojin Choi¹, Kyoungsoo Bok² and Jaesoo Yoo^1,*

Reviewer 1:

Dominik Tomaszuk

Reviewer 2: Anonymous

Appl. Sci. 2022, 12(1), 122; https://doi.org/10.3390/app12010122

Submission received: 7 December 2021 / Revised: 18 December 2021 / Accepted: 21 December 2021 / Published: 23 December 2021

(This article belongs to the Special Issue Integration of AI and Database Technologies, Its Applications)

Round 1

Reviewer 1 Report

The paper proposes a new distributed SPARQL query processing scheme considering communication costs in Spark environments to reduce I/O costs during SPARQL query processing.

The paper is quite innovative and well suited to be considered an introductory text at an advanced level for the specific academic and industrial research communities.

Minor comments:
1. ref 5 is not correct. It is better to cite RDF 1.1 Concepts and Abstract Syntax https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/
2. SPARQL is not a Simple Protocol and RDF Query Language. It is a recursive acronym for SPARQL Protocol and RDF Query Language.
3. ref 11 is not correct. The first point is that SPARQL is now 1.1 (the author cite 1.). The second point is that it is better to cite this as a standard with editors etc.
4. It is better to use a monospace font for the inline code.
5. In chapter 3 -> In Section 3
6. Figure 3 and Figure 5 can be more readable if the authors prepare those figures as vectors.

Overall, the approach is not absolutely novel, however, the motivation and experimets justify my choice to accept the paper.

Author Response

Dear sir,

We would like to sincerely thank you for your attentive indications and good comments. Our paper was partially rewritten to revise and complement your comments. Please refer to the attached file, named "Reply to the reviewer1.docx".

Many thanks.

Jaesoo Yoo

Author Response File: Author Response.docx

Reviewer 2 Report

Dear authors,

This article addresses an interesting topic that could be of interest to the community. The reported research may be a valuable finding that deserves to be published. The content structure of this article follows a logical order. However, the quality level of the writing is unacceptable in some parts of it: text do not seem to be well constructed, the details are imprecise, the use of some verb tenses is not correct and / or some expressions are not clear. This must be improved to reach the minimum level expected for a scientific journal article.

Below, I provide a few specific comments and possible recommendations for improving the article.

General comment

A question that usually appears in articles that are based on an implementation solution is whether they have any link / data that enable other researchers / reviewers to verify what is presented in the article. I did not seem to appreciate that the authors provided this information.

Abstract: is appropriate and reflects enough what readers would expect to find in the article.

1. Introduction

Line 35: “SPARQL is an ontology query language” should say “SPARQL is an RDF query language” since this is no used (usually) to query ontologies but RDF datasets.

Lines 36-38: the text “which mainly consists of PREFIX, SELECT, and WHERE clauses, which express an RDF graph set used in a query, query results, and conditions to produce query results, respectively” is quite imprecise. It should be rephrased.

Lines 40-43: The context is not clear in sentence “As web-based services increase in scale, it is no longer feasible that all RDF graphs are stored in a single repository, and performance degradation occurs when a large scale of RDF graph are queried and processed”. It has never been feasible for all RDF graphs to be stored in a single repository, or, what do you mean by “all”?

2. Related work

Line 84: consider changing the beginning of the sentence “[39] proposed…” to “In [39], Leida and Chu proposed proposed…”. Same situation in the beginning of sentences 93, 104, 111 and 139.

Perhaps the authors could develop a little more the aspects (nature, causes, etc.) that surround the problem that occurs when a large amount of RDF graphs have to be stored in a single node and why, mentioned in the last part of this section (to highlight the value of what they propose as a solution).

3. The Proposed Distributed SPARQL Query Processing Scheme

At the end of section 1 the sections of the article are described, but starting with the first paragraph of this section, authors talk about “chapters”. Please choose one or the other (journal articles are usually made up of sections).

5. Conclusions

Although the authors could provide more details about the possible applications and potential use of the distributed SPARQL query processing scheme they propose, the content meets what a reader would expect to find in this section.

Typographic errors: Not detected.

References:

The number of references is sufficient for the research. All references are in the same format and are apparently well formatted. However, authors could provide the DOI reference in all those who have one.

Please, check if reference 24 on line 45 is repeated.

Check if references [43,44,45] should be [43-45] (authors should use the same style).

Additional comments

Are the authors aware of this reference?

Amann, B., Curé, O., & Naacke, H. (2018). Distributed SPARQL Query Processing: a Case Study with Apache Spark. NoSQL Data Models: Trends and Challenges, 1, 21-55.

Is not important, but in addressing the topic of distributed SPARQL query processing schemes, I would expect to read something about concepts like query mediation / mediators, mediation Layer,...

Author Response

Dear sir,

Many thanks.

Jaesoo Yoo

Author Response File: Author Response.docx

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Very poor paper, that has not the dignity to be even considered to be reviewed.
It looks like a work written by a teen-ager that is not brillant in writing.

The text is full of English errors, misuse of words, repetitions of concepts, no clear description of ideas, confusing sentences, lack of methodological soundness.
In other words, what a scientific paper should not be.

With this premises, it is impossible for me to read it all: the number of corrections I put in the paper (you can see the attached manuscript with my annotations)
were so many that I decided to stop reading the paper after 5 pages only.

In Details:
- You write "query processing scheme" but you mean
"query-processing scheme".
- "WHERE clause conditions", you should write either "WHERE clause" or "WHERE condition", not both;
by the way, how many conditions are in the WHERE clause? it appears that you confuse the concept of "condition" and the concept of "predicate" or "sub-condition". This fact creates a lot of confusion in the reader.
- You introduce the name "Spark" without any reference. References must be put immediately after the first occurrence of a name.
- In Section 2, you say that the Map-Reduce programming paradigm is an extension of Spark. Are you sure? as far as I know, Spark is natively based on Map-Reduce and there are extensions (such as SparkSQL) that hide the Map-Reduce approach, to simplify processing.
- Section 2 is in general written in a confused way. In particular, since you identified a pool of characteristics/features that can characterize parallel processing of SPARQL queries, you should put them into a table and refer to it, by isolating those contributions from the other ones. Furthermore, you should always denote in what aspect your work differs from the previous proposals.
Again, you mix papers based on Spark and papers that are not based on Spark. Make order!!!
- Section 3 starts immediately with subsection 3.1. That's very bad!!! An introductory sentence or paragraph is essential to conduct the reader through the remainder of the section.
- The first paragraph of Section 3.1 is simply "awful". What a mess!!!
- In Section 3.1 you are introducing fundamental concepts on the fly, without any formalism. Very bad!!!
Formal concepts must be introduced formally, with clear and numbered definitions. The concept of SPARQL query must be formmaly specified!!! You cannt rely on the hypothesis that "the reader alredy knows SPQEQL in details"!!! And you have also to define what a SPARQL query is supposed to produce.
Only at this point, you can introduce the "Problem": what kind of problem are you addressing? It is not enough to say "to quickly execute SPARQL queries in a distributed environment", I suppose this is not the first proposal in this sense.
- You write "I/O data cost and communication cost increase during query processing".
However, "to increse" is a relative concept, i.e., there are basic costs that increase. So, with respect to what basic costs (and in what configuration/scenario) you are considering they increase? Prehaps a previous execution scheme?
- You refer to a "RDF index" without reporting its definition. What is it?

At this point (the end of page 5) I decided to stop reading the paper.

You can understand that your paper is far far far away from the minimal dignity to be considered for publication. I do not know what to say: if you are unable to properly write and your level in English writing is low, involve some experienced authors that can help you in this sense.
You did not made it possible for me to evaluate the technical contribution of the paper.

Comments for author File: Comments.pdf

Author Response

Dear Reviewer,

We would like to sincerely thank you for your attentive indications and good comments. We asked native speakers to carefully review our manuscript.

Please refer to the attached file and revised manuscript about the detailed revisions.

Many thanks.

Jaesoo Yoo

Author Response File: Author Response.docx

Reviewer 2 Report

This paper presents a cost-based distributed SPARQL query processing for distributed RDF data in the Spark environment. The authors claim that existing SPARQL query processing schemes either do not consider the communication cost or rely on special assumptions such as data replication. The proposed scheme takes the communication, join, and queuing costs into consideration when processing a SPARQL query on distributed RDF data. The disk I/O cost is also reduced due to the in-memory Spark environment. The query processing process is presented, which consists of decomposing the user query into subqueries, grouping subqueries based on related nodes via indexing and adjacent subqueries, generating possible query execution paths, calculating costs of query execution paths, selecting the optimal query execution path with the least cost, performing query processing tasks at the slaves, and integrating the intermediate results sent from the slaves into the final query result at the master. The cost calculation process is described. The performance evaluation results are also reported.

The paper is easy to follow. The overall logic flows well. The discussion on related work is proper. However, the paper could be improved in several aspects:

* The relationship between the proposed scheme and the scheme from reference [43] is unclear. From the related work discussion, the two schemes are quite similar to each other, and both consider the communication and other relevant costs. What is new for the proposed scheme? Also, what was the existing scheme used in the experiments? The last sentence of the first paragraph of Section 4 is confusing.

* The query processing procedure is described via specific examples. It is unclear if all the general cases are considered. The authors may either present the general algorithms/processing rules or cite related references or sources (e.g., a source code website).

* Cost estimation is essential for the cost-based query processing scheme presented in the paper. Although how to combine the costs of subqueries for a node or query path is discussed, it is unclear how to estimate the cost of a basic operation such as a join or a union. Is it estimated by an assumed analytical formula or via a query sampling method like the one used for estimating local query costs for a multidatabase system in the literature? Can such an estimation capture a dynamic (varying workloads) environment?

* The performance evaluation for the paper is rather weak. The datasets used are small. The details of the tested queries (randomly generated?) are not given. Some performance observations (such as why the performance of “Group” is better than that of “Optimal” in Figure 7? and why the performance difference between “Proposed” and “SparkRDF” for Figures 8 (d) is smaller than other subfigures) need to be analyzed and discussed. The claim “Figure 10(a) shows the small communication and join costs due to fewer intermediate results than other queries, whereas Figure 10(b) shows the large communication and join costs incurred due to larger intermediate results.” is not supported by the figures.

* Generating all the query execution paths could incur a big overhead especially when the query is large. Was this overhead included in the query performance observed in the experiments?

Author Response

Dear Reviewer,

We would like to sincerely thank you for your attentive indications and good comments. We asked native speakers to carefully review our manuscript.

Please refer to the attached file and revised manuscript about the detailed revisions.

Many thanks.

Jaesoo Yoo

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The general impression I had in the first version of the paper is confirmed:
"a homework made by a teen ager without interest for the work itself".
This is the impression I still have reading the paper. This impression is confirmed by the way you followed to correct the paper: since I signalled a specific error, then you made the correction incitically, introducing a lot of errors too. It seems that your English-speaking person does not know written English.

Focusing on the issues, they are the following ones.
1) Quality of English
2) Quality of presentation
3) Lack of essential information
4) Very poor results

Issue 1
-------
If you do not know how to use inversion, do not use it.
I signalled that "query processing scheme" is wrong and that should become "query-processing scheme", but if you say (e.g.) "for query-processing", this is wrong too, you have to write "for query processing" (without the line, because now the main word is "processing", so "query" refers to "processing".
And if you write "query execution plan", of course this is the same situation as before, so the correct version is "query-execution plan", because "query" refers to "execution" and not to "plan".
This is why I say that your approach recalls me a teen ager!!!

Finally, it seems that you do not how to use articles ...

Issue 2
-------
The quality of presentation is poor throught the overall paper. Often, sentences are not clear, you used the wrong tense for verbs and you assume wrong meaning for words.
I already told you that some concepts are necessarily relative, if you say that something "increases", it is necessary to say with respect to what situation it increases.
Again this is the approach of a teen ager, not the approach of a researcher. And you have not tried to correct the problem.
Furthermore, you repeated three times what Spark is!!! Why? this give the impression that the paper is a collage of pieces coming from other papers, so the suspect of plagiarism arises naturally.
Section 2 is short and confused. I told you to use a table to resume the characteristics of the approaches you compare with, and re-organize the presentation. But othing.
In fact, it seems that there are several approaches based on Spark, but at the end you cite only one. So, how is the situation?

Issue 3
-------
I find absolutely absurd that in a paper that
presents how to optimize the execution of a SPARQL query, no SPARQL queries are reported at all. This approach migh be acceptable only for very specilized workshps, not for a journal paper.
In my previous review, I told you to introduce a speicfic subsection in Section 2, moving sentences and paragraphs that tried to present SPARQL features from the Introduction to a novel subsection of Section 2.
So, when you say that joins can affect the performance, you should explain when and why joins are necessary in a SPARQL query, how RDF graphs are queried, the semantics of the query, and so on.
A reader that is not familiar with SPARQL should be quickly introduced to it: you should widen the audience!!! Thjis is particularly true for a journal like Applied Sciences.

In Section 4, what queries did you actually performed on the data sets? How many joins where necessary?

The baseline approach is never explicitely mentioned and referrd to, but it appears in Figures as SparkRDF. So, is this the baseline? Give a reference and explain how it works, by highlighting its limitations, otherwise it is impossible to understand the improvement you propose.

By the way, I understood that your technique finds the "optimal" execution plan, but in Figure 7 there are "Optimal execution plan" and "Proposed" ... so, does this mean that the optimal one is not actually optimal? is your technique more than optimal?

You always claim that RDF graphs are now large, but you never say "how much large".
As well as you never say "why RDF graphs are stored in a distributed way".
Furthermore, storeage does not means "computational power": is the storage server a node in the peer-to-peer network on which Spark is executed? In my vision, this is not mandatory and probably it is often unrealistic, so to process the RDF graph you have to move its RDF representation to the nodes of the peer-to-peer network, before processing it.

So, you understand that if so much information is missing, it is not possible for another research to validate your approach. Remmebr, the mantra of science is "Reproducibility, Reproducibility,Reproducibility".

Issue 4
-------
The improvement you obtain w.r.t. SparkRDF is negligible. In the best case, only 14%, but it means 5 seconds over 35 seconds.
So, this does not appear to be a significant scientific contribution.
Furthermore, are you sure that how data are distributed does not affect performances?
Are you sure that more complex queries are executed faster than by the competitors?
Are you sure that your technique properly performes when the number of nodes in the peer-to-peer network increases?
Of course not, because you have not considered these aspects.

So, this is why I propose to reject this paper. It has not at all the dignity for a journal.

Article Menu

An Efficient Distributed SPARQL Query Processing Scheme Considering Communication Costs in Spark Environments

Further Information

Guidelines

MDPI Initiatives

Follow MDPI