1. Background
The large amount of genomic data generated by Next Generation Sequencing (NGS) technologies [
1,
2] and their related clinical data brings significant value for medical research, especially for cancer studies [
3]. Thanks to NGS techniques, different types of experimental data are produced, whose storage and analysis can be very demanding [
4,
5,
6]. More and more often researchers have to face big biological data [
7,
8], frequently lacking integrated data models and accessible schema representations. Thus, storing, retrieving, integrating, comparing, and analyzing heterogeneous biomedical data becomes a major challenge.
In cancer research, several organizations are involved in the collection, management and publication of genomic and clinical data. In particular, the Genomic Data Commons (GDC [
9,
10]) is a recent initiative of the National Cancer Institute (NCI) with the aim of creating a unified system to promote the sharing of these data. The GDC supports several programs and defines bioinformatics pipelines: it provides Clinical/Biospecimen Supplements and genomic data harmonization procedures related to DNA-sequencing [
11], RNA-sequencing [
12,
13], miRNA-sequencing [
14], Copy Number Variation [
15] and DNA-methylation [
16]. The processed data is publicly available through the GDC portal, which deals with different cancer programs; The Cancer Genome Atlas (TCGA) [
17] is the most relevant project within the GDC, collecting genomic and clinical data of 33 different tumor types of over 11,000 patients [
18].
TCGA data were available at its own portal until late 2016, but since early 2017 they were migrated to the new GDC portal, resulting in a major change of genomic and clinical/biospecimen formats and schema. In the GDC portal, experimental data (i.e., DNA-sequencing, RNA-sequencing, miRNA-sequencing, Copy Number Variation and DNA-methylation data) are produced from harmonization procedures applied on different analysis strategies, improving the quality of data available at the old TCGA portal. Indeed, the GDC provides a programmatic access to interact with these harmonized data through Application Programming Interfaces (APIs), e.g., to obtain aliquot Universal Unique Identifiers (UUIDs) that identify uniquely GDC experiments. The harmonization procedures provide standardized and comparable data, depending on the type of NGS experiment, regardless of the program where they were generated.
For what concerns metadata, Clinical/Biospecimen Supplements were represented in an unstructured format in the TCGA portal; conversely, the GDC introduced a new structured data model (i.e., the GDC Data Model). The transition is however still incomplete: the GDC provides some relevant clinical/biospecimen information only in the old unstructured format and some other only or also in the new format. Correspondingly, the GDC exposes two different methods for retrieving clinical and biospecimen information. The first one is the direct download of Supplements from the GDC portal in XML format, which is semi-structured and does not adhere to a specific data model. The second one is through the GDC APIs, which allow downloading structured information according to the GDC Data Model and provide output in JSON format. These methods allow reaching two different materializations of the metadata, partly overlapping with each other.
The GDC is proceeding with the migration from the first representation to the second one, importing and inserting the data contained in the first within the second. However, in this transitory phase (that has lasted for several months and will probably last for a long time), much of the information in the first model is not yet replicated to the second, and there is no single source that provides information from both models. In order to obtain a comprehensive representation of such information, it is therefore necessary to extract data using a pipeline that deals with model differences and identifies, manages and removes the overlapping information. The first contribution of our work is the design and development of such a pipeline, such that the clinical and biospecimen data (referred to as metadata) are represented with a common format.
In our work, we solve the issues arisen in the transition from the TCGA data portal to the GDC one, providing genomic data and their associated clinical/biospecimen metadata in a standardized format, making both of them seamless, straightforward and easy to be used. We enhance GDC harmonized data by applying a state of the art data model for genomics, in order to uniform genomic and clinical/biospecimen data. We automatically standardize data by mapping them to such unique common schema, thereby supporting scientists in data integration and analyses [
19,
20,
21]. We also integrate information extracted from external public databases, i.e., GENCODE [
22], HGNC [
23], miRBase [
24] and NCBI genome annotations [
25], enriching the content of the experimental data.
This work is an evolution of another project, TCGA2BED [
26], which faced partially similar but much simpler issues, focusing on the old TCGA portal. Unlike TCGA2BED, beside extending TCGA genomic data and standardize the format in which they are provided by the GDC, we integrate, normalize and make non-redundant their multiple metadata available with different representations; we do so by mapping them to a unique data model and widely exploiting the GDC APIs to interact with and extract GDC data. Our main contribution is the integrative representation of experimental and clinical/biospecimen data by applying the Genomic Data Model (GDM [
27]); this then allows querying them, together with other data from multiple sources, uniformly and comprehensively through the GenoMetric Query Language (GMQL [
28,
29]) directly on a new publicly available repository of standardized data. GDM consists of two parts, one describing processed datasets with a genomic region-based format, and one describing the metadata. For the former one, we map the content of GDC data to GDM, thereby transforming the experimental data of the GDC into a new data collection, which we denote as OpenGDC, harmonized and extended by linking with other public databases. For the latter one, the Clinical and Biospecimen Supplements (which are semi-structured, not part of a data model) are extracted and merged with all the information on clinical and biospecimen data available through the GDC APIs (which is structured and adheres to the GDC Data Model), and finally converted to the metadata format of GDM, used by OpenGDC.
Other works have dealt with the problem of storing, retrieving and enhancing data of the GDC, almost all of them are focused on the TCGA program. Among them, we mention: (i) TCGA-assembler 2 [
30], a software pipeline which allows downloading TCGA data from the GDC defining filtering criteria to merge the extracted data files of samples into a single data table, and finally to process them; (ii) The International Cancer Genome Consortium (ICGC [
31]), which provides a data portal to characterize genomic abnormalities in different cancer types, including data from TCGA; (iii) The Seven Bridges Cancer Genomics Cloud (CGC [
32]), which allows accessing data from public cancer genomic datasets (e.g., TCGA) and analyzing them in the cloud by using bioinformatics tools and workflows. All these works are of great interest and improve the access to GDC data; in particular they aggregate them, identify important genomic features, and analyze them with cloud computing resources. Moreover, there are several state-of-the-art tools to retrieve and analyze TCGA data, including some R packages like (i) TCGAbiolinks [
33], which provides algorithms for data mining and analysis of cancer genomics, (ii) cbioportal [
34], an open platform for interactively exploring multidimensional cancer genomics data sets in the context of clinical data and biologic pathways, (iii) Xena [
35], an easy-to-use cancer genomics visualization tool for large public data resources of the GDC. Conversely, our approach is different, as it aims at facilitating the use of TCGA data of the GDC by providing it in a standardized and extended format, and enriched with multiple integrated metadata. In particular, OpenGDC provides a structured data format of the different types of genomic experiments through a single schema, and considers the clinical and biospecimen information as strict defined structured metadata. For a more detailed overview of the available tools for TCGA data we refer to the work [
36], where the authors identify two main categories of TCGA tools, for data
Extraction and for
Integrative data analysis. We can use this distinction and classify our novel system in the first category.
The rest of this manuscript is organized as follows.
Section 2 presents the Genomic Data Model and its application to the different data types retrieved from the GDC. Here, we also describe the pipeline used to build metadata from Clinical/Biospecimen Supplements and from additional information retrieved through GDC APIs, as well as we illustrate the detection and removal procedure of redundant metadata attributes. In
Section 3 we show the architecture of our novel software system, OpenGDC, for the extraction, harmonization and extension of genomic data and metadata from the GDC. We also describe the structure of the created FTP repository, containing all the public accessible genomic and clinical data of the TCGA program of the GDC and their harmonized and extended OpenGDC version already produced using our software system.
Section 4 shows examples of querying and processing of the new OpenGDC data with GMQL, highlighting the advantages provided by the performed harmonization and extension. In
Section 5 we discuss the main aspects of our contribution, summarize our final remarks and mention future developments.
2. Methods
In this section we describe the standardization of GDC experimental data and Clinical/Biospecimen Supplements through the application of GDM, which provides a representation of the genomic experimental data in Browser Extensible Data (BED) format [
37] and of its biological/clinical properties (i.e., metadata) in a key-value format. Genomic data are extended with additional information extracted from external public databases. Using GDM, experimental data are unified into a single format, thus becoming homogeneous, coherent, and comparable. Metadata are also unified, as the original GDC metadata formats are all associated with a single format of key-value pairs, although the keys and the number of pairs may vary across different datasets. Because of the heterogeneous nature of data, it is not possible to know a priori all the clinical, biological and experimental properties of the experimental samples; these are produced as a result of metadata mapping. Futhermore, to generate metadata we develop intelligent procedures for identifying redundant metadata information that are present on the two different sources of the GDC: Clinical/Biospecimen Supplements from the data portal and GDC Data Model information from the API.
In the next two subsections we detail the genomic data and metadata formats obtained by applying the Genomic Data Model to all open data types provided by the GDC.
2.1. The Genomic Data Format
For genomic data, we use a free-BED data representation, in which fix coordinate fields (chromosome, start position, end position, strand) and we include additional fields according to the specific type of experiment; for every data type we provide a specific ready-to-use schema in XML format. We implemented automatic procedures for converting the original GDC genomic data into such free-BED format; to index our BED output files, we introduce
opengdc_id, an extension of the aliquot Universal Unique Identifier (UUID, that is the unit of analysis for GDC genomic data identifying a sample analyzed portion). Since in the GDC an aliquot relates to different data types,
opengdc_id concatenates the
aliquot uuid with the specific
data type. In the following, we provide an overview of the input and output data of our standardization procedures; for a detailed description of all input and output fields of each data type, the reader may refer to the OpenGDC Format Definition (
Supplementary File 1).
Gene Expression Quantification data are provided in the GDC for each aliquot in three tab-delimited files, each of which presents the Ensembl ID of the gene and one of the following values:
FPKM, the number of Fragments Per Kilobase of transcript per Million mapped reads;
FPKM-UQ, the Upper Quartile normalized FPKM value;
counts, the number of reads aligned to each gene, calculated by HT-Seq.
We merge the content of these files using the common Gene_Ensembl field. Then, we extract additional information to describe the gene regions. In the final free-BED structure we include the genomic coordinates (i.e., chromosome, start position, end position and strand), the gene_symbol from GENCODE (human genome version GRCh38 annotation), and the corresponding entrez_gene_id from the NCBI genome annotation.
MiRNA Expression Quantification data are derived from the sequencing of the micro RNA (i.e., miRNA). They contain information about the nucleotide sequence and the expression of miRNAs. One file per aliquot is provided by the GDC, where each row refers to a single miRNA and contains its expression computed on all reads aligning to that particular miRNA. In the free-BED output we consider all fields provided in input, with the addition of the miRNA genomic coordinates extracted from miRBase and the corresponding entrez_gene_id and gene_symbol extracted from HGNC.
Isoform Expression Quantification data contain expression profiles calculated for each isoform of the miRNA sequence. The GDC provides one file for each aliquot, where each row refers to a single isoform. For the free-BED structure, all input fields are left unchanged with the exception of the isoform_coords field, which is parsed to obtain separate genomic coordinate fields. As an addition, we retrieve the corresponding entrez_gene_id and gene_symbol from HGNC.
A copy number variation (CNV) is a variation in the number of copies of a given genomic segment per cell. The GDC provides two data types related to CNVs: Copy Number Segment (including both germline and somatic CNVs) and Masked Copy Number Segment (including only somatic CNVs). The internal representation is the same for both data types. A single experiment is represented by a tab-delimited file, where each row refers to a single CNV. For the free-BED representation we reuse all input fields except for the sample id; we add the strand field—required for the BED standard—which we always set to ‘unknown’ using the wildcard character ‘*’.
Masked Somatic Mutation experiments discover mutations by aligning DNA sequences derived from tumor samples to sequences derived from normal samples and to a reference sequence. A Mutation Annotation Format (MAF) file is used to specify, for each sample, the discovered putative or validated mutations and to categorize those mutations (SNP, deletion, or insertion) as ‘somatic’ (i.e., originating in the tissue) or ‘germline’ (i.e., originating from the germline), as well as to specify additional information about the mutations. Four MAF files for each tumor sample are provided by the GDC, each representing DNA-sequencing data. Each file is generated by a specific analysis pipeline [
38,
39,
40,
41] and includes 125 attributes. By merging the four input files, we defined a free-BED structure with 18 fields including the main information, such as genomic coordinates, the corresponding
gene_symbol and
entrez_gene_id (if the mutation involve a gene), the type of mutation, the tumor and matched normal sequencing alleles 1 and 2, and the aliquot barcode/UUID for the tumor and matched normal samples.
A DNA methylation experiment consists in deep sequencing of bisulfite-treated DNA. It can be obtained as the covalent modification of cytosine bases at the C-5 position, generally within a CpG sequence context. If DNA methylation occurs in promoter regions, it is an epigenetic mark that represents the repression of the transcripts of the promoter gene. We consider both Illumina Infinium HumanMethylation27 (HM27) and HumanMethylation450 (HM450) DNA methylation platforms, used for measuring the level of methylation at 27,578 and 485,577 known CpG sites as beta values (respectively for HM27 and HM450). By using probe sequence information provided in the manufacturer manifest, HM27 and HM450 probes are remapped to the GRCh38 reference genome. These probe coordinates are then used to identify the associated transcripts from GENCODE, the associated CpG island (CGI), and the position of the CpG site in reference to the island. For each methylated site the GDC reports a list of gene symbols; the genes that fall within 1500 bp from the methylated site are used, considering the gene as starting from the transcription start site (TSS) to the end of the gene body. For each aliquot, the GDC provides a tab-delimited Methylation Beta Value data file with 11 fields. We define a free-BED structure composed of 18 fields, which includes all original fields with the addition of the strand, the entrez_gene_id retrieved from GENCODE or HGNC, the ensembl_transcript_id, the position_to_tss (distances in base pairs of the CpG site from each associated transcript start site; negative values indicate that the CpG site is located downstream with respect to the TSS), and the cgi_coordinate (i.e., the start and end coordinates of the CpG island associated with the CpG site). Moreover, we filtered out the methylation sites with missing beta values (i.e., not measured or with unreliable measurement) and reported the gene symbol that is at minimum bp distance from the methylated CpG dinucleotide, in case this is outside a gene region.
2.2. Metadata Format
Each experimental BED file is associated with a metadata file containing a list of key-value pairs. Also metadata files are indexed with an opengdc_id, which identifies the pair of BED-metadata files. To populate the OpenGDC metadata files, we retrieve clinical/biospecimen information from the GDC data type called Clinical and Biospecimen Supplements. In addition, we consider other properties retrieved using the GDC APIs (specifying aliquot uuid and data type as parameters).
Clinical and Biospecimen Supplements are a special data type that contains data documentation; this information is stored in two different XML format files, originally provided by Biospecimen Core Repositories (BCRs) under contract of the National Cancer Institute (NCI). A
Clinical Supplement is a collection of information about demographics, medical history (i.e., diagnosis, treatments, follow ups, and molecular tests), and family relationships (i.e., exposure and history) of a particular patient. A
Biospecimen Supplement instead includes information associated with the physical sample taken from a patient and its processing.
2.3. Metadata Extraction And Composition
The content of an OpenGDC metadata file is obtained by taking into account:
the BCR Biospecimen and Clinical Supplements,
the information retrieved through the GDC APIs,
additional manually curated attributes computed within our standardization pipelines.
Given a converted experimental data file in free-BED format, identified by an
opengdc_id, the corresponding metadata file is generated according to the pipeline shown in
Figure 1.
On the top left corner of
Figure 1, we consider
Biospecimen and Clinical Supplements; they are organized by
patient (identified by the
bcr_patient_uuid attribute), with a patient typically related to many aliquots. Multiple OpenGDC metadata files are created, one for each aliquot reported in the patient biospecimen file. We replicate the full content of the Clinical Supplement of a patient over all metadata files regarding the aliquots of the patient. The resulting metadata attribute keys start with the
clinical__ prefix. A Biospecimen Supplement, instead, contains a unique section on the patient, but also distinct sections on multiple samples, their portions, and the resulting aliquots. In each aliquot metadata file we replicate the common parts about the patient (and, in case, about related samples/portions), while the remaining content of the biospecimen file is divided among the different metadata files according to the specific aliquot each of them refers to. The resulting metadata attribute keys start with the
biospecimen__ prefix.
On the bottom left corner of
Figure 1, we query
GDC Data Model elements using the GDC RESTful APIs. We call the API services once for each aliquot listed in a Biospecimen Supplement and each data type of interest, by specifying the
aliquot uuid and the
data type, and then associate with each OpenGDC data file all information retrieved in the obtained response. The extracted attributes describe a data file along different GDC Data Model conceptual areas (i.e., administrative, biological, clinical and analysis). Relevant administrative entities include the P
rogram (i.e., the broad framework of goals to be achieved by multiple experiments, such as TCGA), the P
roject (i.e., the specifically defined piece of work that is undertaken or attempted to meet a single requirement, such as TCGA-LAML—which refers to Acute Myeloid Leukemia), the C
ase (i.e., the collection of all data related to a specific subject in the context of a specific project, such as a patient). Among biological entities there are S
ample (i.e., any material sample taken from a biological entity for testing, diagnostic, propagation, treatment, or research purposes) and A
liquot (i.e., pertaining to a portion of the whole; any one of two or more samples of something, of the same volume or weight). Clinical entities include T
reatment (i.e., therapeutic agents provided, or to be provided, to a patient to alter the course of a pathologic process) and D
iagnosis (i.e., data from the investigation, analysis and recognition of the presence and nature of disease, condition, or injury from expressed signs and symptoms). Analysis entities include harmonization pipelines such as “Copy Number Variation” and “Methylation Liftover”, each related to one data type.
In case an OpenGDC data file corresponds to n original GDC files, the JSON response to the corresponding API call is divided in n partitions, each containing information on one single GDC original file and on the related aliquot (the information of the latter one is replicated in each partition). Then, in the final OpenGDC metadata file, we group the information from the original files (by concatenating multiple values in a single key-value pair), while we consider the aliquot information only once. All these metadata attribute names are prefixed with gdc__ and obtained by flattening the hierarchical structure of the JSON responses, i.e., through concatenation of JSON keys at each traversed level of the response structure.
As an addition to GDC inputs, we generate a set of
manually curated key-value pairs (gathered in the group of metadata keys prefixed with
manually_curated__). These contain information that is missing in the GDC and derived from other sources or specified by our system. We add the data format (e.g., BED file textual format), URLs of the data and metadata files on the FTP server publicly offered by OpenGDC (see
Section 3 for details about the OpenGDC software and the FTP repository), the genome built (i.e., reference assembly), the
id, checksum, size and
download date of the data file, and the status of the tissue, which indicates if it is of a normal or control sample.
Combining Clinical/Biospecimen Supplement information with GDC Data Model information leads to value redundancy, which is due to the fact that there does not exist a specific data model for the Supplement data and it is impossible to determine a priori which information are non-overlapping. We ascertained the presence of attributes holding different names but same semantics and associated values. We profiled all input data, obtaining sets of different keys that present same values within a same metadata file. Example groups of key-value pairs with different keys and same value, along with the corresponding chosen candidate key preserved in each group, are shown in
Table 1.
The preliminary profiling activity was used to provide guidance to create a list of data redundancy heuristics—with the aim to remove the redundant metadata attributes and their values—applied by the
Data redundancy solver (at the center of
Figure 1).
The heuristics have been primarily devised as a result of a long email exchange with the GDC Support team (
[email protected]) that helped us to understand how the ingestion process works: a restricted number of attributes from the supplements are already provided with a defined mapping to the data model attributes, while for others the relation is still uncertain (i.e., not curated yet by the GDC)—for these we reconstructed common semantics through a semi-automated approach.
Moreover, clinical and biospecimen supplements cover overlapping semantics spaces (as it can be understood by their definitions in
Section 2.2). Thus we make the deliberate decision of extracting only one of them.
Finally, the new data model entities are non overlapping but the APIs provide their content in a nested fashion. For example, a project is related to a case with a functional dependency, therefore the project information can be uniquely reached through the case entity. As a consequence, any information related to the case__project group is redundant w.r.t. the one given by a dual attribute with the same suffix. Analogously, aliquots are comprised in analytes (N aliquots are in 1 analyte), therefore we keep the information that is most specific, pertaining to the aliquot.
We have summarized our approach to solve redundancy in four rules. These cover the whole space of possibilities at the time of writing this manuscript; however this set will be updated as the need for new rules will arise, in conjunction with updates of OpenGDC scheduled releases: The preliminary profiling activity was used to define the following list of heuristics to remove the redundant metadata attributes and their values, which is applied by the
Data redundancy solver (at the center of
Figure 1):
when a field from the BCR Biospecimen Supplement is redundant w.r.t. a field of the BCR Clinical Supplement, keep the first one;
when a field belonging to the case group is redundant w.r.t. a case__.project group field, keep the first one;
when a field belonging to the analytes group is redundant w.r.t. a analytes__aliquots group field, keep the second one.
To facilitate the use of metadata key-value pairs, in case keys are very long and cumbersome, we simplify them through the
Data renaming module, which applies renaming rules according to a match-and-replace strategy based on regular expressions. With respect to the original keys retrieved from the GDC APIs, we usually leave unchanged the rightmost part (i.e., last subgroup and name of the attribute); this ensures that the attributes remain uniquely identified. As an example,
gdc__cases__samples__portions__analytes__aliquots__aliquot_id becomes
gdc__aliquots__aliquot_id. The three levels of the resulting attribute, separated by double underscore, identify respectively an attribute retrieved through the GDC APIs (“
gdc”), belonging to the “
aliquots” entity of the GDC Data Model, and indicating specifically the identifier of the represented aliquot (i.e., “
aliquot_id”). Examples of renaming rules and their results are shown in
Table 2.
4. Use Case Examples
In this Section, we show some examples of application of the GenoMetric Query Language (GMQL [
28]) on the OpenGDC standardized data in order to highlight the advantages of our data representation in terms of information retrieval and integrative processing. GMQL is a high-level domain-specific query language. It can be executed in the system architecture described in [
29], which is specific for genomic data processing. The current available version of the GMQL system uses Apache Spark [
43] as its backbone; along with other design choices, this provides high scalability in cloud computing. The GMQL system contains a multiplicity of public genomic datasets from a variety of sources [
44], ready to be used within tertiary analysis pipelines (as shown in [
29]); among other sources, it features all the datasets available in the OpenGDC FTP service, providing an interface for browsing and processing data curated in OpenGDC. The produced datasets are also made available within another system, GenoSurf (GenoSurf is available at [
45]) [
46], a semantic search engine based on a Conceptual Model [
47] that integrates TCGA data, imported by OpenGDC, with several sources such as ENCODE [
48], Roadmap Epigenomics [
49], and 1000 Genomes [
50], among others, using the META-BASE integration framework [
51].
In the following, we propose three use cases along with their GMQL queries (the corresponding GMQL queries are available in the
Supplementary File 6, ready to be executed on [
52]) (alternatively expressible using the Python package [
53]); we focus on query aspects, acting on both region data and metadata, which highlight the strengths of the datasets produced by OpenGDC, i.e.: (1) enabling the combined use of metadata derived from the GDC Data Model, the Clinical/Biospecimen Supplements, and our manually curated additions; (2) providing positional information (i.e., genomic coordinates) in a standardized structure, which encourages data inter- and intra-source interoperability; (3) allowing joined use of different data types even from different sources (e.g., gene expression and methylation, or mutations and gene annotations) based on common gene identifiers (e.g., the HUGO gene symbol), or genomic positions.
4.1. Use Case 1: for Kidney Cancers, Find Mutations and Their Number in Each Exon
For this example, we consider TCGA public somatic mutation data samples of Kidney Adenoma and Adenocarcinoma patients—which are contained in three TCGA projects, i.e., Kidney Chromophobe (KICH), Kidney Renal Clear Cell Carcinoma (KIRK) and Kidney Renal Papillary Cell Carcinoma (KIRP)—and extract novel mutations (i.e., not listed in dbSNP [
54]) in gene exons. For each sample, we count the mutations occurring in each exon, filter-out the exons without any mutation, and finally return the remaining mutated exons, equipped with their number and the maximum number of mutations in one exon.
In this example: (1) we use GDC mutation data in combination with a GENCODE annotation dataset—demonstrating the interoperability of OpenGDC curated data with other sources; (2) we use seamlessly metadata from GDC APIs (i.e., first and second conditions in line 2 in Listing 1) and Clinical Supplements (third and fourth conditions in line 3 and 4 in Listing 1)—this is not possible on the GDC portal, where only the former are supported; (3) we select three TCGA projects together by using the characterization of the tissue and the classification of diseases (note that the OpenGDC normalized metadata attribute gdc__disease_type represents the type of malignant disease (The disease is categorized by the World Health Organization’s (WHO) International Classification of Diseases for Oncology (ICD-O).), while the attribute gdc__project__disease_type contains the full name for the project. The output dataset contains in total 227 samples with 15,517 exon regions and 296 distinct metadata attributes.
4.2. Use Case 2: in Breast Invasive Carcinoma, Find the Genomic Regions Whose Mirna Expression Counts Result above Average in at Least 10 % of Tumoral Samples
We translate these specifications into selecting TCGA miRNA expression samples corresponding to patients who are affected by primary tumors of Breast Invasive Carcinoma, and into selecting the miRNA regions that exhibit a value of reads_per_million_mirna_mapped (In the miRNA Expression Quantification data type, it is the read normalized count in reads-per-million-miRNA-mapped associated with each miRNA ID.) above the average of the dataset in 10% or more of such samples. We first use a simple query (lines 3–8 in Listing 2) to evaluate the average of miRNA normalized reads. In order to obtain the lightest query possible in terms of computational time, from the selected TCGA dataset we PROJECT only the required field, MERGE all samples into one, compute the average as a metadata attribute (avg_reads) and MATERIALIZE a small dataset in order to get the required average value (531.6 for the considered data). We then perform a query to filter out miRNA regions that present a reads_per_million_mirna_mapped value equal or below the calculated average of the dataset (lines 11–13 in Listing 2). In addition, we use COVER to extract in one sample only the remaining miRNA regions that are present in at least 10% of the dataset samples and equip each extracted region with: (1) the number of samples in which the region is expressed above average; (2) the list of co-located genes, using specifically the entrez_gene_id region attribute—which is a new attribute added in the OpenGDC data, with respect to the original GDC data. The output dataset contains a sample with 102 miRNA regions (with reads_per_million_mirna_mapped above average) out of the 1881 distinct ones considered in the initial dataset.
4.3. Use Case 3: in a Comparative Study, For Both Normal and Tumoral Tissue Samples of Each Patient Affected by Cholangiocarcinoma Extract the Expression and Average Promotorial Methylation Levels of Each Gene
In the OpenGDC standardized data of TCGA, using with value “normal” our
manually_curated__tissue_status metadata attribute, added with respect to the original GDC data, we can select normal samples of five different types at once (i.e., Blood Derived Normal, Solid Tissue Normal, Buccal Cell Normal, EBV Immortalized Normal, Bone Marrow Normal—corresponding to sample type codes 10–14
https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes). Similarly, the value “tumoral” of the same attribute refers to ten different types of cancer samples (corresponding to sample type codes 01-09 and 40). Since methylation sites of interest for gene expression regulation are typically located in the surroundings of a gene TSS, we consider methylation values only in the promotorial region of each gene, extracted around the gene TSS from 2000 bases upstream to 1000 bases downstream (lines 5 and 19 in Listing 3); for gene expression data we only keep the
fpkm values and the
gene_symbol (line 5 in Listing 3), while for methylation data only the
beta_values (line 10 in Listing 3).
Note that the code described in Listing 3 lines 1–13 for normal samples is repeated in lines 15–27 for tumoral samples. For methylation data, we compute the average beta_value in each gene promoter. With the MAP at line 13 we associate each gene expression and promotorial region (in each sample of the normal N_EXPR dataset) with the average of the methylation beta_values in the gene promotorial region (in a sample of the normal N_METH dataset); N_EXPR and N_METH samples are matched only if belonging to the same tissue sample (uniquely identified by the gdc__samples__sample_id).
At Listing 3 line 30 the datasets resulting from line 13 and line 27 are combined using a JOIN operation, which allows associating each gene promotorial region with the gene_symbol and the gene expression fpkm value and methylation avg_beta_value from both the normal and tumoral samples of a patient. Note that the equi predicate on_attributes can only be applied thanks to the addition of the gene_symbol attribute in the OpenGDC gene expression data (as original GDC data did not include it).
Lines 33–38 in Listing 3 are only needed for shaping results into a convenient format, as it can be appreciated in
Table 4, which contains an excerpt from the result dataset (in the column names of
Table 4 we use the subscripts
n and
t for
normal and
tumoral, respectively).
Occurrences of null in the average beta values correspond to cases where no methylation probes are located in the specified gene promotorial region. Overall, the output dataset contains 9 samples with about 60,670 distinct regions each.
5. Conclusions and Future Work
In this work, we presented a novel approach and its implementation in a set of automatic procedures able to extract, integrate, extend and standardize genomic and clinical data of The Cancer Genome Atlas as included in the Genomic Data Commons portal. Our approach and software were applied to multiple data types obtained from different types of NGS experiments (i.e., Gene-, miRNA-, Isoform-Expression Quantification, Masked Somatic Mutation, Copy Number Segment, Masked Copy Number Segment, Methylation Beta Value). Additionally, we considered clinical and biospecimen information about the experimental data.
To reach our objective, we took advantage of the Genomic Data Model, which allowed us to represent an experimental sample by its genomic regions and its related metadata. The genomic regions are defined by their genomic coordinates (chr, left, right, strand) and genomic features, which are produced by the specific NGS experiment. Conversely, metadata report clinical and biological properties in attribute-value pair format.
Based on the GDM representation, we implemented OpenGDC, a software for retrieving TCGA experimental data in the GDC portal, which is then processed with ad-hoc procedures for each data type. Our standardization procedure provides all the data in free-BED format, which contains a set of experiment-specific fields in addition to the genomic coordinates. In order to obtain this standardized format, the software is able to automatically extract additional features from external data sources (e.g., GENCODE, HGNC and miRBase), which are not provided in the original GDC data files. The software also integrates experimental data with clinical and biospecimen information derived from different GDC sources.
Our pipeline extracts metadata attributes from the original Clinical and Biospecimen Supplements and from the GDC RESTful APIs. The obtained attributes are merged in a single metadata file for each experiment, using a tab-delimited key-value format. Then, two software components are used in the metadata pipeline: (i) the Data Redundancy Solver, to detect and remove redundant metadata attributes, and (ii) the Data Renaming Module, to redefine attribute names. In particular, data profiling is performed to identify redundant attributes, i.e., with the same values and different names. All these procedures and their input/output data types are thoroughly described in the OpenGDC Format Definition document, available as Additional File 1.
We collected the standardized genomic data and metadata in a FTP repository, which we made publicly available at [
42]. We also showed usage examples of these data through the application of GMQL queries, to highlight the validity and utility of our approach. They demonstrate that our data representation facilitates data retrieval, integrated processing and analyses, especially thanks to the combination of the filtering on specific clinical/biospecimen attributes and the extraction of genomic features.
Future work concerns the application of our data representation and software pipeline to other projects integrated in the GDC portal and to other cancer-related repositories, in order to facilitate knowledge discovery over multiple cancer data. Additionally, we plan to use our approach and software in order to further enhance the data integration among different biomedical public repositories. Finally, we are going to take advantage of the standardized data, which is easily processable by several state of the art bioinformatics tools, in order to perform new knowledge extraction analyses about cancer.