Next Article in Journal
Modeling the Accuracy of Estimating a Neighbor’s Evolving Position in VANET
Previous Article in Journal
Technical and Economic Feasibility of a Stable Yellow Natural Colorant Production from Waste Lemon Peel
Previous Article in Special Issue
Semantic and Syntactic Interoperability for Agricultural Open-Data Platforms in the Context of IoT Using Crop-Specific Trait Ontologies
 
 
Article
Peer-Review Record

Linked Data Platform for Solanaceae Species

Appl. Sci. 2020, 10(19), 6813; https://doi.org/10.3390/app10196813
by Gurnoor Singh 1,†,‡, Arnold Kuzniar 2,*,‡, Matthijs Brouwer 1, Carlos Martinez-Ortiz 2, Christian W. B. Bachem 1, Yury M. Tikunov 1, Arnaud G. Bovy 1 and Richard G. F. Visser and Richard Finkers 1,1,*
Reviewer 1: Anonymous
Reviewer 2:
Appl. Sci. 2020, 10(19), 6813; https://doi.org/10.3390/app10196813
Submission received: 13 August 2020 / Revised: 10 September 2020 / Accepted: 14 September 2020 / Published: 28 September 2020
(This article belongs to the Special Issue Semantic Technologies Applied to Agriculture)

Round 1

Reviewer 1 Report

The authors propose a structured data approach, namely Linked Data, to create a semantic platform that combines both unstructured data from the scientific literature and structure data from publicly widely-used biological datasets for three Solanaceae species.

They propose an approach to incorporate and combine geno/pheno-typic data from many unstructured or structured data sources into a single semantically structured platform for Solanaceae species that could be used to address the challenge of candidate gene prediction (in crop species). Their work differs from others by offering data that is (1) in an interoperable format and (2) by offering both genotypic and phenotypic data. The document is well written and easy to follow. The authors provide 2 test cases to demonstrate the usability, utility, and effectiveness of their plateform.

 There are a few typos in the manuscript such as:

Page 4 line 120: “Softerware” -> “Software”

Appendix B Table A1: “genrated" -> “generated”

 

It is unclear to me whether the graphs presented on Figure 5 are generated through their platform. If it is, I suggest the authors to make it clearer.

In the Biological use case 2, the authors propose to test their platform on a candidate gene prediction task. Although their platform often results in accurate prediction, they also provide an example where their system fails to predict the candidate gene correctly. I would thus suggest the author to have such indication in their prediction results (i.e., Table 3), so as to let the user aware of the possible failure. Maybe displaying a prediction certainty/uncertainty into the resulting table could be helpful for the community. Although I am aware of how difficult the task would be. Another possibility would be to let the user know whether or not the result is plainly “doubtful”.

The presented platform offers great potential and great features but is only applicable to a limited number of Solanaceae species. It would be beneficial to the scientific community if the platform could be extended, or would offer the ability for a user to extend it by themselves.

Finally, in the conclusion, it seems clear to me that their platform (Pbg-ld) offers more capabilities than Planteome, however, the authors question the quality of the data provided by the KNETMiner software. I suggest the authors to cite or justify such a claim, such that the reader clearly understands the advantage of Pbg-ld over KNETMiner.

 

 

Author Response

The changes suggested by reviewer 1 are highlighted with yellow color in the manuscript. 

 

The authors propose a structured data approach, namely Linked Data, to create a semantic platform that combines both unstructured data from the scientific literature and structure data from publicly widely-used biological datasets for three Solanaceae species.

They propose an approach to incorporate and combine geno/pheno-typic data from many unstructured or structured data sources into a single semantically structured platform for Solanaceae species that could be used to address the challenge of candidate gene prediction (in crop species). Their work differs from others by offering data that is (1) in an interoperable format and (2) by offering both genotypic and phenotypic data. The document is well written and easy to follow. The authors provide 2 test cases to demonstrate the usability, utility, and effectiveness of their platform.

Response:

We thank the referee for correctly summarizing it with a positive response. 

There are a few typos in the manuscript such as:

Page 4 line 120: “Softerware” -> “Software”

Response: The suggested changes have been adopted.

Appendix B Table A1: “genrated" -> “generated”

Response: The suggested changes have been adopted.

 

It is unclear to me whether the graphs presented on Figure 5 are generated through their platform. If it is, I suggest the authors to make it clearer.

Response: The data for Figure 5 was obtained using the web based API endpoint of Pbg-ld called /countFeatures. Figure 5 caption has been updated accordingly.

FROM
Figure5: Bar Chart  to count the total number of genomic feature 3 genome graphs. These three genomic graphs are https://plants.ensembl.org/Solanum_lycopersicum, http://solgenomics.net/genome/Solanum_lycopersicum, http://solgenomics.net/genome/Solanum_pennellii.

TO

Figure 5. Bar charts of genomic feature counts for three Solanum genomes (graph IRIs): Ensembl: S. lycopersicum (http://plants.ensembl.org/Solanum_lycopersicum); SGN:  S. lycopersicum (http://solgenomics.net/genome/Solanum_lycopersicum); SGN: S. pennellii (http://solgenomics.net/genome/Solanum_pennellii). The data were obtained through the pbg-ld Web API /countFeatures endpoint.

In the Biological use case 2, the authors propose to test their platform on a candidate gene prediction task. Although their platform often results in accurate prediction, they also provide an example where their system fails to predict the candidate gene correctly. I would thus suggest the author to have such indication in their prediction results (i.e., Table 3), so as to let the user aware of the possible failure. Maybe displaying a prediction certainty/uncertainty into the resulting table could be helpful for the community. Although I am aware of how difficult the task would be. Another possibility would be to let the user know whether or not the result is plainly “doubtful”.

 

Response: The workflow of the biological use case 2 depends upon both, Pbg-ld data platform and a third party tool called QTLSearch to prioritize candidate genes for a given trait. It will not be possible to add calculations of certainty / uncertainty score using the third party tool. However, Table 2, already presents the list of possible candidate genes for the selected traits and their references published in literature. 

Further, we realised a mistake in the manuscript. A paragraph on page 16 line 325-331 had redundant text by mistake, and should be replaced by the text which summarizes the results of our workflow. This has been correctly rectified now.

FROM

Out of the selected 5 QTLs for metabolic traits, 3 QTLs which relate to the following traits, soluble solids, lycopene beta-cyclase activity and phenolic compounds (2-phenylethanol, phenylacetaldehyde), have known candidates. Further, these candidate genes are already annotated with related GO terms in publically available databases like UniProt. While, for one of the QTL regions which relates to terpenoids, Terpene synthase is a known candidate gene, however, Terpene synthase is not annotated with GO terms which show an association with the trait of interest (i.e. terpenoids). The 2 GO terms related to this gene are DNA binding and DNA methylation. Lastly, for the QTL region selected for volatile compounds ( i.e. 3-methylbuthanal, 3-methylbuthanol) there are no well-known candidate genes in the QTLs that had experimentally proven significance.

TO

Out of the total 5 QTLs, our workflow performed significantly well in detecting candidate genes for the QTLs of soluble solids, lycopene beta-cyclase activity, and phenolic compounds. Our workflow did not perform well in the detection of candidate genes within the QTL for terpenoids on chromosome 1. This is most probably due to the fact that this QTL region is not well annotated, and there are no GO terms related to Terpenoids. Lastly, our workflow predicts a candidate gene called LDH, for the previously unknown QTL region associated with volatile compounds.

The presented platform offers great potential and great features but is only applicable to a limited number of Solanaceae species. It would be beneficial to the scientific community if the platform could be extended, or would offer the ability for a user to extend it by themselves.

 

Response: Figure 1 illustrates the data generation and ingestion pipeline. All tools used in the generation of data like, QTLtable miner ++, SIGA.py, Openrefine, grlc and the virtuoso linked data platform are open source tools and can be used by the bigger scientific community for other plant species. 

We have also added these lines in the Discussion section on page 18 from line 358 - 360. i.e.

Lastly, the open accessibility of all the above mentioned tools used to generate and publish these data sets, offers the ability for the scientific community users to extend this tool for other crop species themselves.

Finally, in the conclusion, it seems clear to me that their platform (Pbg-ld) offers more capabilities than Planteome, however, the authors question the quality of the data provided by the KNETMiner software. I suggest the authors to cite or justify such a claim, such that the reader clearly understands the advantage of Pbg-ld over KNETMiner.

Response: KNETminer, has recently released a new version, in which the integrated datasets of both tomato and potato are no more freely accessible. Hence we have removed the remark on the quality of the data and appended this information in our discussion section. 

However, in  the current version of KNETMiner, in which it provides free-access to integrated data sets for only some important crops like wheat (Triticum aestivum), rice (Oryza sativa), Arabidopsis Thaliana. The integrated data set for other crop species like, tomato (S. lycopersicum), potato (S. tuberosum), and sorgum (Sorgum bicolor) is are not openly accessible.

Reviewer 2 Report

The proposed computing architecture solution is very robust and is clearly expressed in the article. This solution is usually recommended in the bibliography for the exploitation of linked open data.

Improvements:

The authors could explain in more detail how they pass the data from the selected sources to the RDF format to carry out the integration of data and metadata. The article explains the ontologies and tools used but not how it was done. In the conclusions section, they detail quite a few improvements that could be made, including the automation of this process.

The platform uses various tools and this can complicate its use by new researchers.

An improvement to the article could be an evaluation of the platform by a small number of genetic selection experts.

Author Response

The changes suggested by reviewer 2 are highlighted with green color in the manuscript. 

The proposed computing architecture solution is very robust and is clearly expressed in the article. This solution is usually recommended in the bibliography for the exploitation of linked open data.

Response: Thank you for the positive response 

 

Improvements:

The authors could explain in more detail how they pass the data from the selected sources to the RDF format to carry out the integration of data and metadata. The article explains the ontologies and tools used but not how it was done. In the conclusions section, they detail quite a few improvements that could be made, including the automation of this process.

Response: Figure 1 illustrates in detail the data generation and ingestion pipeline. The idea here is to convert all non-RDF resources like QTL information in scientific literature, genome annotation in GFF3 files to RDF based resources. Once everything as a RDF based resource (including known ontologies), Open links virtuoso is used as a host to this linked data platform. The following lines have been added in the manuscript on page from line 73.

All data sets used in the creation of this linked data platform were initially classified as non-RDF based and RDF based data sources. The first step in our data ingestion pipeline was to convert non-RDF based data sources to RDF based data graphs. Later these RDF data graphs were integrated together and published as a linked data platform.

 

The platform uses various tools and this can complicate its use by new researchers.

An improvement to the article could be an evaluation of the platform by a small number of genetic selection experts.

Response: The list of authors include authors with expertise in the bioinformatics approaches and authors with molecular biology / genetics expertise. The latter group has driven the bioinformatics developments and contributed with testable hypotheses used in the manuscript as use-case's.  

We can understand the reviewers comment that the initial set-up might be complicated for new researchers, and that help from a data scientist / bioinformatics scientist would be required. The authors are approachable for suggestions; but we expect that many research teams nowadays will have skilled personnel available to assist them locally as well.

Back to TopTop