Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

OpenGDC: Unifying, Modeling, Integrating Cancer Genomic Data and Clinical Metadata

Appl. Sci. 2020, 10(18), 6367; https://doi.org/10.3390/app10186367

by Eleonora Cappelli^1,†

, Fabio Cumbo^2,*,†

, Anna Bernasconi³

, Arif Canakoglu³

, Stefano Ceri³

, Marco Masseroli³

and Emanuel Weitschek⁴

Reviewer 1: Anonymous

Reviewer 2:

Surya Saha

Reviewer 3: Anonymous

Appl. Sci. 2020, 10(18), 6367; https://doi.org/10.3390/app10186367

Submission received: 31 July 2020 / Revised: 31 August 2020 / Accepted: 3 September 2020 / Published: 12 September 2020

(This article belongs to the Section Applied Biosciences and Bioengineering)

Round 1

Reviewer 1 Report

Cappelli et al presents a data/integration layer API over the GDC API.

The article is well written and easy to follow, I have only a few minor comments.

Is the software GUI only? GUI tools are often not suitable to be included in other automated pipelines, although they provide a friendly option, they have a limited audience.

Was not clear if the GMQL was supported by OpenGDC, or was just used "externally" over the generated data, why not include in the pipeline? (The authors did mention that OpenGDC is an extraction tool; however, GMQL would provide an advanced query based extraction as shown in the use cases.)

The authors should review the use cases Line citations in the Listings. (example, l:418 "We first use a simple query (lines 1-6 in Listing 2)" should be lines 3-8 ? l:442 "(lines 4 and 14 in Listing 3);", etc)

Author Response

Cappelli et al presents a data/integration layer API over the GDC API.

The article is well written and easy to follow, I have only a few minor comments.

Thank you for the positive evaluation and for suggesting minor comments that improved our manuscript.

Is the software GUI only? GUI tools are often not suitable to be included in other automated pipelines, although they provide a friendly option, they have a limited audience.

We are currently providing OpenGDC with a GUI only. However, we are already working on extending some features including the possibility to run our software from a command line interface. This will allow us to propose our software to the BioConda community to make it available to a bigger audience of bioinformaticians and developers.

OpenGDC allows to extract data from the GDC and standardize them with a well-defined schema. Therefore, starting from the standardized data, the researcher can exploit his own data extraction, processing and analysis tools to extract the information of interest; We adopted GMQL, which is a powerful query tool, suitable for this type of data. In the future, we may consider to include GMQL in the pipeline, while still leaving the option of applying a different extraction and query tool.

The authors should review the use cases Line citations in the Listings. (example, l:418 "We first use a simple query (lines 1-6 in Listing 2)" should be lines 3-8 ? l:442 "(lines 4 and 14 in Listing 3);", etc)

All references to the lines in Listing 1, 2, and 3 have been fixed. Thanks to the reviewer for reporting this issue.

Reviewer 2 Report

The manuscript builds upon the widely used Genomic Data Commons (GDC) resources established by National Cancer Institute by creating a federated database and associated tools. It describes the Browser Extensible Data (BED) format and OpenGDC, a tool to extract and integrate genomic and clinical data of The Cancer Genome Atlas (TCGA) from GDC. The resources also include a repository containing TCGA data enhanced with information extracted from external public databases, i.e., GENCODE, HGNC, miRBase and NCBI annotations. The manuscript is well written but could use some grammar improvements (see below).

The authors provide a number of valuable resources and I am a big proponent of data accessibility and reuse. However, I have a few concerns.

A case is made that OpenGDC is fulfilling a need created due to the delay in transition to the second representation at NCI. I am not familiar with upcoming developments at NCI but it seems like this tool will be rendered somewhat obsolete once NCI makes the transition. Can the authors please comment on this issue and clarify in the paper? Are there other NCI tools and extensions that will provide the functionality offered by OpenGDC in the future?

I have seen many a noble initiative fold as they could not be sustained long term. I would like the authors to mention in the paper how this resource will be updated and supported by funding as additional data sets are added to GDC and integrated with external data sources i.e., GENCODE, HGNC, miRBase and NCBI annotations. What is the minimum length of time the authors expect to maintain the resource?

Major revisions

The software is stable although slow. Since these are large data sets, it will help if the download size is estimated and presented to the user before the actual download starts. Existing free disk space should also be communicated, especially if its running low. Checkpointing is required in all data tools today. A user should be able to restart the download and conversion from the point it was stopped or interrupted.

It is not clear to me which repository contains the code for OpenGDC which I expect is open source. Please create a publicly accessible issue tracker for the software. Github would be a good place to host the code for the long term if the authors are looking for suggestions. Please create a repository (e.g. github) for this paper, include a license and generate a DOI for it using Zenodo so that the code is archived and citable.

Minor revisions

Data redundancy is a common problem when federating data which is often times, exacerbated by identifier bleed and ambiguity. Was this a challenge for the authors? It might be an interesting discussion point if the authors were to elaborate on how they arrived at the set of heuristics used for resolving the redundancy.

There are a number of grammatical errors in sentence construction and missing words (see examples below). A thorough copy editing by a native English speaker will improve the readability of the manuscript.

Ln 56: and some other only or also. Confusing construction

Ln 75: in order to “create”

There is no control to pause the long conversion process once its started. There is no estimate of time required (i.e., a progress bar).

There are lots of runtime errors in the conversion process

java.io.FileNotFoundException: ./appdata/gencode/gencode.v22.annotation.gtf (No such file or directory)

at java.base/java.io.FileInputStream.open0(Native Method)

at java.base/java.io.FileInputStream.open(FileInputStream.java:219)

at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)

at java.base/java.io.FileInputStream.<init>(FileInputStream.java:112)

at opengdc.integration.Gencode.loadGencodeTableByType(Gencode.java:31)

at opengdc.integration.Gencode.extractGencodeInfo(Gencode.java:135)

at opengdc.parser.GeneExpressionQuantificationParser.convert(GeneExpressionQuantificationParser.java:184)

at opengdc.action.ConvertDataAction.execute(ConvertDataAction.java:112)

at opengdc.Controller.execute(Controller.java:21)

at opengdc.GUI$24.run(GUI.java:909)

Author Response

The authors provide a number of valuable resources and I am a big proponent of data accessibility and reuse. However, I have a few concerns.

Thank you very much for evaluating our manuscript and for raising few concerns that enable us to improve it. Please find a point to point answer below.

At the best of our knowledge, there are no other NCI tools that offer the same features provided by OpenGDC. The software is characterised by a modular architecture and the implementation of new features or new converters is pretty easy and we are currently working on this point to extend its functionalities to the other programs hosted on the GDC. We are also planning to integrate it in BioConda and other bio-oriented environments like Galaxy as an external synchronous/asynchronous data source tool.

As a direct consequence of our previous answer, we are planning to maintain our software in the long period. Additionally, we will provide support for all the BioConda and Galaxy users who are interested in using our software for their research. OpenGDC is currently hosted on GitHub, so we will use the software versioning portal as the main channel to provide technical support. Additionally, the software is part of an European H2020 Project called GeCo that will provide funds for its maintenance and update.

Major:

The software is stable although slow. Since these are large data sets, it will help if the download size is estimated and presented to the user before the actual download starts. Existing free disk space should also be communicated, especially if it is running slow. Checkpointing is required in all data tools today. A user should be able to restart the download and conversion from the point it was stopped or interrupted.

Currently, a system based on checkpoints is already implemented. It is possible to interrupt the software execution by closing the interface. A recovery procedure for both the download and conversion processes is already implemented in OpenGDC and it is automatically triggered every time the software is executed with the same configuration of the previous run. We checked and improve such functions and released a new version of the software on GitHub.

The software has been made open source and it is available on GitHub at https://github.com/DEIB-GECO/OpenGDC

Additionally, the source code of OpenGDC is also available on ZENODO at https://doi.org/10.5281/zenodo.4000250

We added both these references under the section “Availability of software and data”.

Minor:

At page 7, rows 279-298 (of the previously submitted paper) we briefly discuss the issue related to data redundancy in metadata, which is due to a partial mismatch between the novel data model imposed by GDC and the clinical/biospecimen supplements, continuously submitted by data providers. We agree with the reviewer that a more thorough discussion on how the set of rules to resolve such redundancy is extracted can be an interesting addition to the manuscript. We have added the following text at rows 293-312 page 8 in the revised version of the manuscript (highlighted in blue):

“The preliminary profiling activity was used to provide guidance to create a list of data redundancy heuristics — with the aim to remove the redundant metadata attributes and their values — applied by the Data redundancy solver (at the center of Figure 1).

The heuristics have been primarily devised as a result of a long email exchange with the GDC Support team ([email protected]) that helped us to understand how the ingestion process works: a restricted number of attributes from the supplements are already provided with a defined mapping to the data model attributes, while for others the relation is still uncertain (i.e. not curated yet by the GDC) — for these we reconstructed common semantics through a semi-automated approach.

Moreover, clinical and biospecimen supplements cover overlapping semantics spaces (as it can be understood by their definitions in Section 2.2). Thus we make the deliberate decision of extracting only one of them.

Finally, the new data model entities are non overlapping but the APIs provide their content in a nested fashion. For example, a project is related to a case with a functional dependency, therefore the project information can be univocally reached through the case entity. As a consequence, any information related to the case__project group is redundant w.r.t. the one given by a dual attribute with the same suffix. Analogously, aliquots are comprised in analytes (N aliquots are in 1 analyte), therefore we keep the information that is most specific, pertaining to the aliquot.

We have summarized our approach to solve redundancy in four rules. These cover the whole space of possibilities at the time of writing this manuscript; however this set will be updated as the need for new rules will arise, in conjunction with updates of OpenGDC scheduled releases:”

We read again the whole manuscript and further improved it by editing the English language

There is no control to pause the long conversion process once it is started. There is no estimate of time required (i.e., a progress bar).

Please see our answer to the first Major point about the recovery procedure. At the current state of development, OpenGDC does not provide a progress bar, but it informs the user on how many files are going to be downloaded or converted in the log window, which is a good approximation of how long the process will take. Additionally, in the new released version of the software we added some log messages to let the user follow the progress of both the download and conversion processes.

There are lots of runtime errors in the conversion process

The runtime errors were raised when the JAR file was launched from a location on the file system that is different from the folder in which the JAR file is located. The problem has been promptly fixed in the new version of the software. Thanks to the reviewer for reporting this issue.

Reviewer 3 Report

The authors presents OpenGDC software which is to retrieve the genomic and clinical data from GDC. It would be a useful tool, but there are some issues to be concerned.

The authors focus on dataset produced by TCGA projects in the manuscript, although there are many other data, such as CPTAC, TARGET, and so on, at GDC. In addition, it seems that the overall design of the OpenGDC software to be too much specified to the TCGA data. Please explain how the authors can expand the software to general GDC datasets.
There have been several tools to retrieve and analyze the TCGA data, including some R packages like TCGAbiolinks as well as cbioportal, Xena, and so on. The authors should describe the merits of the OpenGDC much clearly.

Author Response

The authors presents OpenGDC software which is to retrieve the genomic and clinical data from GDC. It would be a useful tool, but there are some issues to be concerned.

Thank you for your positive comments. Please find the answers to the raised issues below.

The authors focus on dataset produced by TCGA projects in the manuscript, although there are many other data, such as CPTAC, TARGET, and so on, at GDC. In addition, it seems that the overall design of the OpenGDC software to be too much specified to the TCGA data. Please explain how the authors can expand the software to general GDC datasets.

The GDC provides the genomic data processed through an harmonization pipeline, one for each specific type of experiment. Therefore, the genomic data of each project (such as CPTAC, TARGET) can be processed by OpenGDC and thus standardised as described in our manuscript. Instead, the biospecimen and clinical data (see Paragraph “2.2. Metadata format”), do not follow a well-defined scheme, therefore it is necessary to apply specific procedures to be able to standardize them. Moreover, the data organized in the GDC Data Model (see Paragraph “2.2. Metadata format”) are easily managed by our tool since they have the same graph structure for each project currently hosted on the GDC.

In the current version of OpenGDC we focused on TCGA programm only. Indeed, it was a very long project and the work described in our paper is the result of more than three years work of two research groups. We will consider these additional datasets in future software updates, implementing ad-hoc procedures for the management of biospecimen and clinical data, thus expanding our dataset with both genomic data and metadata.

There have been several tools to retrieve and analyze the TCGA data, including some R packages like TCGAbiolinks as well as cbioportal, Xena, and so on. The authors should describe the merits of the OpenGDC much clearly.

Thank you for your precious suggestion. We mentioned these tools in section “Background”, lines 106-111 in the revised version of the manuscript (highlighted in blue):
“Moreover,there are several state-of-the-art tools to retrieve and analyze TCGA data, including some R packages like (i) TCGAbiolinks [33], which provides algorithms for data mining and analysis of cancer genomics,108(ii) cbioportal [34], an open platform for interactively exploring multidimensional cancer genomics data sets in the context of clinical data and biologic pathways, (iii) Xena [35], an easy-to-use cancer genomics visualization tool for large public data resources of GDC.”

We also have added lines 113-115 in section “Background” in the revised version of the manuscript (highlighted in blue):
“In particular, OpenGDC provides a structured data format of the different types of genomic experiments through a single schema, and considers the clinical and biospecimen information as strict defined structured metadata.”

Round 2

Reviewer 2 Report

The authors have addressed my concerns satisfactorily. Congratulations on your publication!

Reviewer 3 Report

The authors clearly addressed the issues concerned.

Article Menu

OpenGDC: Unifying, Modeling, Integrating Cancer Genomic Data and Clinical Metadata

Further Information

Guidelines

MDPI Initiatives

Follow MDPI