Next Article in Journal
Energy Investment Risk Assessment for Nations along China’s Belt & Road Initiative: A Deep Learning Method
Next Article in Special Issue
A Sensor Data-Driven Decision Support System for Liquefied Petroleum Gas Suppliers
Previous Article in Journal
Two-Photon Imaging to Unravel the Pathomechanisms Associated with Epileptic Seizures: A Review
 
 
Article
Peer-Review Record

A Hybrid Approach Combining R*-Tree and k-d Trees to Improve Linked Open Data Query Performance

Appl. Sci. 2021, 11(5), 2405; https://doi.org/10.3390/app11052405
by Yuxiang Sun, Tianyi Zhao, Seulgi Yoon and Yongju Lee *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2021, 11(5), 2405; https://doi.org/10.3390/app11052405
Submission received: 17 January 2021 / Revised: 22 February 2021 / Accepted: 23 February 2021 / Published: 8 March 2021

Round 1

Reviewer 1 Report

In order to efficiently organise, retrieve and evaluate LOD cloud, the authors propose a novel hybrid approach that combines the index and the live exploration approaches for improved LOD join query performance.

The proposed approach permits to discover interlinked data distributed across multiple resources by using a two-step index structure; it combines a disk-based 3D R*-tree with the extended multidimensional histogram and flash memory-based k-d trees (Figure 6).

The authors also propose a hot-cold segment identification algorithm to identify regions of high interest (determine ways for realocating data in SSD and HDD) (Figure 8).

For experiments they used: Lehigh University Benchmark (LUBM) dataset with the needed modify of the original version to support all join query types. Their results demonstrate the efficiency of their approach - the obtained query performance of HYBRID is always superior to that of other methods for all join query types used (based on the real benchmark dataset) (Figure 12).

The paper is well structured and the references used are in accordance with the content.

Strengths:

  • Regarding the amount of implied storage, the proposed HYBRID approach, compared with QUAD, DARQ, and MIDAS, requires the least amount of main memory because it can significantly reduce the filtering result size (Figure 13)
  • offers the best performance of join query processing by reducing unnecessary data scanning

Weaknesses:

  • it is not clear when the RECENT attribute (marking that the data have recently been accessed) is set to 1 (Figure 1)
  • In row 292 is written: "From our experiments, we find that the 3D R * -tree stored in SSD does not have much performance effect." Please be more specific with this affirmation.
  • There is no explanation for the source of the elements which are used in eq.1
  • Hot-cold segment identification algorithm requires a clearer, a more detailed explanation
  • the expression from 385 raw is not clear
  • the dynamically update of the two-step index structure is a sensible problem that can influence the efficiency of the entire approach 

Author Response

We thank anonymous reviews for their helpful suggestions and comments.

It is not clear when the RECENT attribute (marking that the data have recently been accessed) is set to 1 (Figure 7)

This is modified in page 9 (in row 365-367) (revised as below).

Our algorithm applies the aging mechanism of HotDataTrap and its recency capturing mechanism sets the corresponding RECENT bit to 1 if any data are accessed.

In row 292 is written: "From our experiments, we find that the 3D R * -tree stored in SSD does not have much performance effect." Please be more specific with this affirmation.

This is modified in page 8 (in row 340) and 15 (in row 583-587) (revised as below).

Figure 14 shows the join query performance of the real LOD dataset listed in Table 2, when the 3D R*-tree is kept in SSD and HDD, respectively. We find that the performance difference of 3D R*-tree on SSD and HDD is not apparent through this experimental result. Hence, we put 3D R*-tree in HDD since the cost of SSD is higher than that of the HDD.

There is no explanation for the source of the elements which are used in eq.1

This is modified in page 9 (in row 379) (revised as below).

where Hot(#), All(#), and Cold(#) denote the number of hot data, all data, and cold data in the decision table, respectively.

Hot-cold segment identification algorithm requires a clearer, a more detailed explanation

This is explained in page 10 (in Figure 8)

The expression from 385 row is not clear

This is modified in page 11 (in row 437-438) (revised as below).

A similar algorithm is applied to β for the (?s, p, o)  (?s, p, ?o) query pattern (e.g., select ?n where {?f foaf:lastName Lucy. ?n foaf:knows ?f}).

The dynamically update of the two-step index structure is a sensible problem that can influence the efficiency of the entire approach 

This is explained in page 7 (in row 296-298), and 17 (in row 650-653).

Because calculating the count frequencies would be expensive, counting and inserting the frequencies can be performed as batch processing after compressed triple numbers are established.

In future work, it is desirable to discuss the maintenance problem of our index structure in detail. After initializing the two-step index structure, the index must be dynamically updated using up-to-date data. Recently, Vidal et al. [32] addressed this issue; however, a more detailed study is required.

Reviewer 2 Report

Summary

 

The topic of the paper is how the Linked Open Data (LOD) cloud can be more effectively organised, queried, and evaluated. The authors propose a novel, two-step approach combining indexing and live exploration approaches.

 

The paper begins with an introductory section, in which the authors discuss the challenges inherent in querying the LOD cloud with current approaches. After briefly introducing their proposed novel hybrid approach to addressing the described challenges, they provide an overview of the structure of the paper.

 

The second section presents the background of the authors' research. It begins with an overview of LOD. This includes an introduction to the structure of RDF data and how it is queried using SPARQL. Problems inherent querying large amounts of RDF-structured LOD are also highlighted in this section. The following subsection looks at physical storages systems for large databases and discusses the advantages and disadvantages of hard disk drive (HDD) and solid-state drive (SSD) systems. The potential benefits of a hybrid storage system combining HDDs and SSDs are outlined.

 

Section 3 introduces the authors' proposed hybrid index system for querying LOD data. First, an extended MDH technique for efficient joint query processing without significant storage demand is introduced. Then, the authors describe a two-step index structure which is an extension of the widely adopted R*-tree structure. The next subsection describes how the authors segment data into frequently ("hot”) and less frequently accessed ("cold") to improve access speed by moving hot data to SSD. To conclude this section, the authors explain how efficient, two-step SPARQL queries may be designed access LOD data.

 

Section 4 presents the experimental evaluation of the proposed approach described in the previous section. Evaluation is done by comparing the authors' proposed approach to three popular alternatives: QUAD, DARQ and MIDAS. Using an appropriate methodology, the authors show that their approach outperforms the alternatives in performance and memory use. Subsequently, they evaluate the query performance on SSD, HDD, and the author's hot/cold segmented storage, showing that the hybrid storage approach is slightly less performant than SSD but much faster than pure HDD solutions.

 

Section 5 closes the paper with conclusions and an outlook to future work.

 

Comments and Recommendations

 

The language used in generally very good. Some stylistic errors are nevertheless present, which could be rectified by having the paper read by a native speaker. Figures and tables are generally appropriate and of good quality.

 

The research presented here is novel and topical. The methodology applied for evaluation is appropriate and well described. However, the relevance of the work to the Special Issue is not made clear. Linked Open Data is indeed relevant to healthcare applications, and the method the authors' have developed could indeed positively affect Cloud Computing, Big Data, and Internet of Things in Healthcare and Industry, but the authors do not even mention the word "healthcare" anywhere in their paper. In its current state, the paper is barely within the scope of the Special Issue.

 

The relevance could be made clearer by 1) explaining the relevance of LOD to the healthcare sector, 2) showing examples of LOD used in healthcare, 3) explaining the impact of the authors research on healthcare, e.g. in the conclusions, and 4) using healthcare-related examples throughout the paper.

 

The methods compared to the authors' proposed approach in the evaluation section should be mentioned and explained in the background section. This is, after all, the state-of-the-art benchmark the authors use to measure the success of their work.

 

Subsection headings should not immediately follow section headings - some text should be given between the two.

 

To summarise, this is a well-written paper presenting very good research in the field of LOD. However, it needs additional work to bring it fully within the scope of the Special Issue.

Author Response

We thank anonymous reviews for their helpful suggestions and comments.

The research presented here is novel and topical. The methodology applied for evaluation is appropriate and well described. However, the relevance of the work to the Special Issue is not made clear. Linked Open Data is indeed relevant to healthcare applications, and the method the authors' have developed could indeed positively affect Cloud Computing, Big Data, and Internet of Things in Healthcare and Industry, but the authors do not even mention the word "healthcare" anywhere in their paper. In its current state, the paper is barely within the scope of the Special Issue.

The relevance could be made clearer by 1) explaining the relevance of LOD to the healthcare sector, 2) showing examples of LOD used in healthcare, 3) explaining the impact of the authors research on healthcare, e.g. in the conclusions, and 4) using healthcare-related examples throughout the paper.

This is explained in page 3 (in row 112-119) (revised as below).

Especially, the emergence of LOD has been making an excellent revolution in the healthcare sector. For example, corona virus disease 19 (COVID-19) is an infectious disease caused by a newly discovered coronavirus. In the midst of a global pandemic, Big Data such as LOD are showing an outstanding capability to analyze complex interactions among groups of people and location. Through the work such as sharing and analyzing available LOD in the world, a cure for COVID-19 will be discovered. Efficient use of the large-scale LOD could find to a way to divine how the virus is spreading and how the number of infections can be reduced.

The methods compared to the authors' proposed approach in the evaluation section should be mentioned and explained in the background section. This is, after all, the state-of-the-art benchmark the authors use to measure the success of their work.

 ⇒ This is explained in page 5-6 (in Subsection 2.3).

Subsection headings should not immediately follow section headings - some text should be given between the two.

 ⇒ This is modified in page 6 (in row 234-237), and 12 (in row 456-458).

To summarise, this is a well-written paper presenting very good research in the field of LOD. However, it needs additional work to bring it fully within the scope of the Special Issue.

This is explained in page 3 (in row 112-119).

Reviewer 3 Report

The authors propose a novel method to address the efficient storage and query processing speed for large RDF graphs in the Linked Open Data. 

Their approach uses a 2-step (filtering and refinement) index structure that combines a 3D R*-tree and k-d trees to efficient storage  and to  achieve high join-query performances.

The manuscript is very well written, easy to follow and understand. The experimental evaluation is well-conducted and the results are very promising.

There is only one point that I find confusing. In section 4.1 the authors mention that DARQ requires the least amount of storage, a claim that is supported by Figure 13. Yet, the authors later claim that their approach requires the least amount of storage amongst all 4 compared. This latter statement is contradicting the former one. I think the author should clarify this point.

Overall, I believe this is a very good manuscript.

Extensive empirical experiment is conducted to show the performances of their systems. Their approach outperforms all baselines compared with.  That is, their join query perform much faster on all SPARQL query types (star, chain, cycles, complex, and trees).

 

 

Author Response

We thank anonymous reviews for their helpful suggestions and comments.

There is only one point that I find confusing. In section 4.1 the authors mention that DARQ requires the least amount of storage, a claim that is supported by Figure 13. Yet, the authors later claim that their approach requires the least amount of storage amongst all 4 compared. This latter statement is contradicting the former one. I think the author should clarify this point.

This is modified in page 14 (in row 534-535) (revised as below).

Although DARQ requires the smallest storage space, it uses large memory to computer the join of intermediate results at the control site.

Round 2

Reviewer 2 Report

Most of the my comments and recommendations have been addressed. Whilst I still think a better example could help make this contribution more relevant to the special issue, it can be accepted in its current form.

Back to TopTop