*Article* **E**ff**ects of High-Dose Ionizing Radiation in Human Gene Expression: A Meta-Analysis**

**Dimitrios S. Kanakoglou 1,2,**† **, Theodora-Dafni Michalettou 1,2,3,**† **, Christina Vasileiou 1,3 , Evangelos Gioukakis 1,3 , Dorothea Maneta 1,3, Konstantinos V. Kyriakidis 1,4 , Alexandros G. Georgakilas <sup>3</sup> and Ioannis Michalopoulos 1,\***


Received: 13 February 2020; Accepted: 9 March 2020; Published: 12 March 2020

**Abstract:** The use of high-dose Ionizing Radiation (IR) is currently one of the most common modalities in treatment of many types of cancer. The objective of this work was to investigate the effects of high-dose ionizing radiation on healthy human tissue, utilizing quantitative analysis of gene expression. To this end, publicly available transcriptomics datasets from human samples irradiated with a high dose of radiation and non-irradiated (control) ones were selected, and gene expression was determined using RNA-Seq data analysis. Raw data from these studies were subjected to quality control and trimming. Mapping of RNA-Seq reads was performed by the partial selective alignment method, and differential gene expression analysis was conducted. Subsequently, a meta-analysis was performed to select differentially expressed genes across datasets. Based on the differentially expressed genes discovered by meta-analysis, we constructed a protein-to-protein interaction network, and we identified biological pathways and processes related to high-dose IR effects. Our findings suggest that cell cycle arrest is activated, supported by our top down-regulated genes associated with cell cycle activation. DNA repair genes are down-regulated in their majority. However, several genes implicated in the nucleotide excision repair pathway are upregulated. Nevertheless, apoptotic mechanisms seem to be activated probably due to severe high-dose-induced complex DNA damage. The significant upregulation of CDKN1A, as a downstream gene of TP53, further validates programmed cell death. Finally, down-regulation of TIMELESS, signifies a correlation between IR response and circadian rhythm. Nonetheless, high-dose IR exposure effects regarding normal tissue (radiation toxicity) and its possible long-term outcomes should be studied to a greater extend.

**Keywords:** high-dose ionizing radiation; RNA-Seq; differential gene expression; DNA damage response

### **1. Introduction**

Regarding human exposure to Ionizing Radiation (IR), doses below 0.1Gy are classified as "low" [1], while doses normally used in medical procedures, such as Radiation Therapy (RT) (2-3Gy) are classified as high [2,3]. As such, apart from understanding the biological consequences of rare very high-dose exposures like in the case of nuclear accidents, a major field of radiobiological and clinical interest is the optimization of RT which usually involves moderate to high fraction doses.

The main therapeutic modality employed in RT are photon beams in the form of low Linear Energy Transfer (LET) radiation (X-rays, Gamma-rays), although high LET radiation (protons, alpha particles, and other heavy ions) are sometimes incorporated due to their precise dose localization. Radiation particles deposit more energy on the targeted tumor areas (a phenomenon known as the "Bragg peak" [4]) and have a higher Relative Biological Effectiveness (RBE) [5–7], while photon beams deposit a relatively small quantity of energy that disperses further to the surrounding healthy tissue due to scattering phenomena. In general, the idea behind RT is that the rapidly proliferating cancer cells are usually more sensitive to radiation than normal cells, while normal cells can usually repair themselves at a faster rate and retain their normal function. Therefore, the goal is to inhibit cancer cell multiplication potential, eventually leading to cell death, while minimizing dosage absorption in normal tissue, to prevent toxicity [5,8]. Nonetheless, IR exposure effects regarding healthy tissue (radiation toxicity) and its possible long-term outcomes should be studied to a greater extend.

Cancer and healthy cells are targeted alike, either directly through damage on their cellular molecules and especially on DNA strands or indirectly via the formation of free radicals, a phenomenon referred to as oxidative stress [9]. In essence, oxidative stress is a procedure of water radiolysis and involves the formation of intermediate, partially reduced oxygen species, collectively termed as Reactive Oxygen Species (ROS), that give rise to the formation of hydroxyl radicals that produce a number of adverse biological reactions by attacking structural and functional molecules [10], thus resulting in generalized cellular stress. Hydroxyl radicals can sometimes indirectly produce Single-Strand Breaks (SSBs) and a plethora of base and sugar lesions in DNA molecules, which can be cytotoxic or mutagenic [11], as well as crosslinks between two complementary DNA strands [12]. On the other hand, direct DNA damage primarily involves the induction of Double-Strand Breaks (DSBs) that represent the most lethal types of DNA damage, leading to cell death or genomic instability if left unrepaired. Finally, closely spaced DNA lesions (referred to as complex or clustered DNA damage) that may occur after IR exposure have been suggested to be highly repair-resistant or non-repairable. Therefore, they are considered highly significant biological lesions [13]. This continuously challenging process may lead to genomic instability and cancer [14], concurrently fueling DNA Damage Response (DDR) activation [15] which constitutes the main component of IR effects on a cellular level.

In general, DDR can be defined as the synthesis of functions (sensors, transducers, effectors) that orchestrate DNA damage sensing and signal transduction, triggering either DNA repair, cell survival, or cell death (apoptosis). Furthermore, pathways of cell cycle checkpoint control are also essential components of DDR [16,17]. Main pathways of DNA repair include Base Excision Repair (BER) and Nucleotide Excision Repair (NER), which repair DNA base damages, and Mismatch Repair (MMR), which corrects base mispairs and small loops that are often found in repetitive sequence DNA. In addition, Homology-dependent Recombination (HR) and Non-Homologous End Joining (NHEJ) act alone or together to repair DSBs and complex events such as inter-strand crosslinks [18,19]. Dysregulation of DDR mechanisms can cause several human disorders that are associated with cancer susceptibility, accelerated aging, and developmental abnormalities [20]. Moreover, like other types of stress, radiation exposure affects the development of the immune system through radiation-induced apoptosis, differentiation, and induction of inflammatory environment via different components of DDR [21,22].

The complete pattern of biological responses to different doses and radiation types is unclear and currently one of the most important questions in radiation biology. The general consensus is that results of IR exposure in any living organism involve a topical and/or systemic stress. A variety of responses is induced, including—but not limited to—oxidative stress in the irradiated area or in the whole body (through systemic non-targeted effects), DDR, DNA repair, and pro-inflammatory pathway initiation [23]. From a systems biology perspective, the aforementioned cellular mechanisms, as well as other related ones, can be examined through altered gene expression. Thus, in this work, we performed a Differential Gene Expression Analysis (DGEA), in human tissues exposed to high-dose IR, taking advantage of the wealth of publicly available RNA-Sequencing (RNA-Seq) data, as previously suggested [24–26]. To this end, we carefully selected five datasets of healthy human cell samples, and in each of them, we identified Differentially Expressed Genes (DEGs) between irradiated and non-irradiated cells. Finally, we performed a meta-analysis, highlighting the common ground of high-dose effects.

### **2. Results**

### *2.1. Data Collection, Filtering, Pre-Processing and Mapping*

ENA queries identified 71 projects that fulfilled our search criteria. After manual curation, the Bioproject [27] accession numbers of the selected datasets (Table 1) were PRJNA494581 [28], PRJNA450083 [29], PRJNA421022 [30], PRJNA436999 [31], and PRJNA396832. We downloaded RNA-Seq-related FASTQ files of those Bioprojects. After the initial quality control, we performed soft trimming on raw RNA-Seq data, choosing PHRED score Q = 20 [32,33], removing on average ~2.4% of the nucleotide reads with more than 1% probability of an incorrect base call. PRJNA396832 trimming rate was 7.7% with soft trimming, due to adapter contamination and poor quality of reads.


**Table 1.** Information of experiment accessions and sample description.

Mapping of the trimmed reads to the reference genome and transcriptome (gentrome) was estimated at a rate of ~83%. There were distinct outliers during the quality control, trimming, and alignment processes, and this served as the first indicator that PRJNA396832 and PRJNA450083 were not of the desired quality for conducting our analysis. Without the outliers, mapping rate was at ~87% across the remaining 3 datasets (PRJNA421022, PRJNA436999, and PRJNA494581).

### *2.2. Di*ff*erential Gene Expression*

It was suggested that the minimal number of biological replicates required for RNA-Seq based DGEA is 6 [34]. PRJNA396832 and PRJNA450083 had a single biological replicate for each condition, and were therefore excluded for subsequent analysis, as statistical significance could not be estimated. Studies that contained combinations of experimental conditions (PRJNA436999 2 Gy for 6 and 24 h and PRJNA494581 2 Gy and 5 Gy for 20 h), were split into distinct studies (Table 2) for DGEA. DESeq2 was used to identify DEGs in PRJNA421022 (Supplementary Materials, Table S1) in one of the distinct studies of PRJNA436999 (Supplementary Materials, Table S2) and in one of the distinct studies of

**Cell Type** iPSC-Derived

**Bioproject**

PRJNA494581 (Supplementary Materials, Table S3). From 2006 genes that were considered DEGs in at least one dataset, 542 overlapped (Figure 1). Cardiomyocytes (IMR90) (HLE) **Dose** 5 Gy 2 Gy 2 Gy 2 Gy 5 Gy **Time Point** 48 h 6 h 24 h 20 h 20 h

**DEG Counts** 721 353 908 59 1003

Primary Human Lung Fibroblasts

Human Lens Epithelial Cells

*Int. J. Mol. Sci.* **2020**, *21*, x FOR PEER REVIEW 4 of 17

DGEA. DESeq2 was used to identify DEGs in PRJNA421022 (Supplementary Materials, Table S1) in one of the distinct studies of PRJNA436999 (Supplementary Materials, Table S2) and in one of the distinct studies of PRJNA494581 (Supplementary Materials, Table S3). From 2006 genes that were

**Table 2.** Statistically significant gene counts derived from the differential gene expression analysis for each comparison within each dataset. The table includes information about experimental sample

**Accession PRJNA421022 PRJNA436999 PRJNA494581**

parameters, dataset accession numbers, and total mapped gene counts for each dataset.

considered DEGs in at least one dataset, 542 overlapped (Figure 1).

**Table 2.** Statistically significant gene counts derived from the differential gene expression analysis for each comparison within each dataset. The table includes information about experimental sample parameters, dataset accession numbers, and total mapped gene counts for each dataset. Meta-analysis was conducted using the three lists that derived from the DESeq2 analysis, yielding 1322 DEGs (Supplementary Materials, Table S4): 872 DEGs were found in at least one DESeq2 analysis but not in the meta-analysis; 187 genes that were not characterized as DEGs in any


analysis. The DEG list of our meta-analysis agrees with all PCR validations.

**Figure 1.** Venn diagram illustrating overlapping Differentially Expressed Genes (DEGs) across selected DEG lists and meta-analysis results. **Figure 1.** Venn diagram illustrating overlapping Differentially Expressed Genes (DEGs) across selected DEG lists and meta-analysis results.

*2.3. Functional Enrichment Results* Functional enrichment analysis of up- and down-regulated genes resulting from meta-analysis produced lists of statistically significant Gene Ontology (GO) [35] biological processes, KEGG [36] Meta-analysis was conducted using the three lists that derived from the DESeq2 analysis, yielding 1322 DEGs (Supplementary Materials, Table S4): 872 DEGs were found in at least one DESeq2 analysis but not in the meta-analysis; 187 genes that were not characterized as DEGs in any DESeq2 analyses were considered DEGs by meta-analysis (Figure 1).

biological pathways (Table 3), and gene-targeting transcription factors (Table 4). A Protein–Protein Interaction (PPI) network of DEGs was also constructed (Figure 2). **Table 3.** Enriched biological processes (Gene Ontology—GO) and enriched biological pathways (KEGG) for up- and down-regulated genes after meta-analysis. Some of the genes that were identified as DEGs in the original RNA-Seq analyses were PCR-validated by their own groups. For PRJNA494581 dataset [28], ddPCR validated CDKN1A, AREG, H2BC13 (HIST1H2BL), GDF15, H3C11 (HIST1H31), H1-4 (HIST1H1E), H2BC10 (HIST1H2BF), and TP53INP1 as DEGs and B2M and RPL13A as non-differentially expressed genes. For PRJNA421022 dataset [30], qPCR validated PLK1, BIRC5, AURKB, KIF20A, TOP2A, and CCNA2 as down-regulated and CDKN1A and FDXR as up-regulated genes. Moreover, qPCR did not validate ANGPTL4 and SOGA3 as up-regulated genes, even if those genes were estimated as DEGs in their RNA-Seq data analysis. The DEG list of our meta-analysis agrees with all PCR validations.

### *2.3. Functional Enrichment Results*

Functional enrichment analysis of up- and down-regulated genes resulting from meta-analysis produced lists of statistically significant Gene Ontology (GO) [35] biological processes, KEGG [36] biological pathways (Table 3), and gene-targeting transcription factors (Table 4). A Protein–Protein Interaction (PPI) network of DEGs was also constructed (Figure 2).



pathways.


GO:0006260 DNA replication GO 0 hsa03030 DNA replication KEGG 0

E2F1\_Q3\_01 E2F1 E2F transcription factor 1 4.39E-05

**Table 4.** Enriched gene-targeting transcription factors for down-regulated genes after meta-analysis. GO:0097193 Intrinsic apoptotic signaling pathway GO 0.010992521 GO:0071496 Cellular response to external stimulus GO 0.01930771 GO:0104004 Cellular response to environmental stimulus GO 0.032524732

*Int. J. Mol. Sci.* **2020**, *21*, x FOR PEER REVIEW 5 of 17

**Gene Set Description Source FDR Up-Regulated Genes ↑** GO:0072331 Signal transduction by p53 class mediator GO 0.0022959

**Figure 2.** Protein–protein interaction network of DEGs from meta-analysis. Formulated protein clusters in the center of the network are associated with cell cycle processes and multiple DNA repair **Figure 2.** Protein–protein interaction network of DEGs from meta-analysis. Formulated protein clusters in the center of the network are associated with cell cycle processes and multiple DNA repair pathways.

### **3. Discussion**

RNA-Seq is gradually becoming the predominant technique for transcriptome analysis, superseding microarrays. RNA-Seq technology is more sensitive in detecting genes with low expression levels, it lacks the associated background noise of hybridization-based techniques, and it is more reproducible [24]. Furthermore, a crucial limitation of microarrays is their inability to study the expression of genes for which no probe is available on the chip. So far, there is no gold-standard methodology for analyzing the transcriptome using RNA-Seq technology. Poorly designed pipelines for differential gene expression can have detrimental effects, compromising the experimental results. Suboptimal pre-processing of raw data, directly affects the mapping process, resulting in poor mapping rates (<60%) [37].

An extensive debate [38–40] on the effects of the pre-processing step of trimming, suggests that reads should be carefully trimmed. Illumina Next-Generation Sequencing (NGS) platforms produce sequences of between 25–250 nucleotides, the colorimetric signals of which are translated by an internal Illumina software (CASAVA) to base calls and are represented in FASTQ [41] file format. Minimal trimming (Q < 10) keeps low quality base calls in NGS analyses, adding unreliable and random sequences to the final dataset [38]. However, hard trimming (Q > 30) of reads can have a particularly strong negative impact on RNA-Seq-based gene expression estimates, as it introduces unpredictable and unwanted biases [40]. Hence, we used soft trimming (Q = 20) as a balance, preserving biological information that was not ideally recorded, while discarding non-sense information.

The expected mapping rate of RNA-Seq reads is between 70% and 90% when mapped against the human genome and slightly less when mapped against the transcriptome [37]. In conventional RNA-Seq mapping, multiple reads are mapped across splice junctions [42]. The inability of conventional algorithms, like BWA [43], to handle spliced transcripts renders them obsolete. Splice-aware aligners [44], such as Hisat2 [45], handle mapping in a more efficient manner. However, these traditional approaches require significant computational resources amidst an explosively growing storage-hungry environment [46]. Alignment-independent methods, such as Salmon [47], bypass the mapping step and proceed to quantify directly transcript abundance, boasting a more lightweight and significantly faster novel approach. Our mapping rate, minus the outliers, was calculated at a satisfactory level of ~87% using Salmon's selective alignment method.

The resulting datasets combined with DESeq2 which utilizes estimates of dispersions and logarithmic fold changes by incorporating data-driven prior distributions [48], yielded substantial statistically significant results. Instead of using transcript levels, as in a classical meta-analysis, we chose Mosteller–Bush, a method which is based on the combination of weighted *z*-values for the independent studies, which are calculated from the *p*-values produced by DESeq2. PCR-based validations showed that our meta-analysis outperformed original RNA-Seq data analyses.

Although a fraction of differentially expressed genes overlap among the 3 DESeq2 studies, a meta-analysis was conducted to enhance the validity of our DGEA (Figure 1). In all three DESeq2 DEG lists (Supplementary Materials, Table S1–S3) and in the meta-analysis DEG list (Supplementary Materials, Table S4), the number of statistically significant under-expressed genes considerably surpasses the number of over-expressed genes. Such a result may indicate that cell cycle arrest has been activated, which is supported by the top down-regulated genes (i.e., MKI67, CCNA2, CDK1, PLK1, and CDCA3) identified by the meta-analysis and associated with cell cycle activation [49]. This difference between up- and down-regulated genes, as well as cell cycle arrest activation, coincides with the primary results for PRJNA421022 dataset [30]. At the same time, observed up-regulation of CDKN1A, responsible for cell cycle G1 phase arrest, in response to a variety of stress stimuli further upholds this suggestion. p21 protein coded by this gene (known to be interdependent with tumor suppressor protein TP53) is also responsible for inhibiting cellular proliferation in response to DNA damage [50]. In addition, it is highly correlated with DNA repair, while also being instrumental in the execution of apoptosis. Although p21 is assumed to play a key role as "genome guardian", it can alternatively act as the mediator of genomic instability, cellular senescence, and carcinogenesis under

certain circumstances, like IR exposure and TP53 deficiency [51]. Another substantial result is the significant up-regulation of GDF15, associated with the response to oxidative stress and induction of inflammatory environment, thus coinciding with DDR and IR response in general [49,52].

TP53 is essential in DDR mechanisms through its downstream responses, which include cell cycle arrest, DNA repair, and apoptosis. The accurate transition from G1 phase of the cell cycle to S phase is crucial for a controlled cell proliferation, and its misregulation promotes oncogenesis [53]. G1 arrest provides the cell adequate time to repair the DNA damage. Should repair be unsuccessful, TP53 levels drop and CDK-cyclin protein kinase activity resumes, leading to entry into S phase and possible apoptosis triggering. [54]. In our results, for samples selected over 20 h post irradiation, TP53 indeed shows no altered expression, while MDM2, as a TP53 downstream gene transcriptionally activated by it [55], is over-expressed.

Regarding enrichment results for down-regulated genes (Table 3), cell cycle checkpoint as well as various DNA repair mechanisms are over-represented. These results may indicate that DNA repair genes (i.e., MSH2, MSH6, XRCC3, and POLA2) are suppressed, due to programmed cell death. However, NER-associated genes DDB1, DDB2, and XPC where found up-regulated. DNA repair in general, due to its complexity, requires balanced expression of its genes in order to avoid erroneous repairs [56]. Over-represented gene-targeting factors E2F1, E2F2, RB1, TFDP1, and TFDP2 (Table 4) are connected through involvement in cell cycle G1/S phase transition, TP53 regulation, and cellular senescence [49]. Moreover, under-expression of DNA repair genes, mediated by the RB/E2F pathway, may play a causal role in senescence induction [57]. In the case of up-regulated genes, p53 signaling pathway, DDR, and apoptotic mechanisms seem to be activated. In addition, platinum drug resistance (Table 3) could arise from increased DNA repair, decreased mismatch repair, defective apoptosis, and altered oncogene expression [58].

Regarding the PPI network (Figure 2), edges corresponding to protein interactions, were constructed with sizeable restrictions regarding the validity of their sources and their assigned score. Distinct clusters were formulated, coinciding with the main components of high-dose IR response. Simultaneously, through these clusters, the network validates the biological processes and pathways derived from the enrichment results. Two major, densely packed clusters are formed, representing multiple cell cycle processes. These clusters are indicative of a collection of DNA repair pathways (GO:0006281), such as DSBs repair (RAD51, BLM, DNA2), Mismatch Repair (MSH6, MSH2), Homologous Recombination (XRCC3), and NER (DDB1, DDB2, XPC, POLE). Furthermore, TIMELESS, a gene found in the center of the network due to its contribution to DSB repair, also acts as a circadian rhythm pathway regulator [49]. Additional circadian rhythm-related results are the under-expression of TYMS and the over-expression of CRY2.

Circadian genes are known to regulate a variety of cellular processes, including cell cycle, apoptosis, and DNA damage repair [59]. Both oxidative defense mechanisms and repair of X-ray induced DSBs in DNA are synchronized by circadian rhythms; thus, RT timing needs to be coordinated as we enter personalized medicine [60]. Furthermore, disruption in circadian gene expression is associated with increased incidence of cancers and gliomas [61]. Finally, the circadian clock system also controls various parameters of the immune system and its biological defense functions [62]. This correlation between circadian clock dysregulation and IR response may reveal possible underlying mechanisms of chronic inflammatory disease development. In addition, understanding the interplay between circadian rhythm, cell cycle, cell proliferation, and DNA repair will deliver benefits in RT by reducing its side effects on healthy tissues.

Biological response to IR (high doses), especially at the organism level, is complicated and partially unknown. RNA-Seq as an -omics methodology provides information on gene expression for several thousand proteins. This opens a unique opportunity to approach this difficult task of delineating the mechanisms triggered after radiation-induced stress, at a systems biology level. Therefore, in this study, after critical screening of several RNA-Seq datasets and applying state-of-the-art bioinformatics and meta-analysis, we were able to identify: (A) 1322 DEGs (371 up-regulated, 951 down-regulated, Supplementary Materials, Table S4); (B) cell cycle checkpoint activation, apoptosis, and various down-regulated repair genes. The last probably relates to late post-irradiation time points (20–48 h), where repair is expected to be completed and the cell is sent to apoptosis. Another suggestion is that transcriptional up-regulation of DNA repair genes by genotoxic stress (p53 activation, Tables 3 and 4) is counteracted by possible DNA damage that blocks transcription [56]; (C) indication of cellular senescence; (D) association of IR response with the circadian clock. *Int. J. Mol. Sci.* **2020**, *21*, x FOR PEER REVIEW 9 of 17 Another suggestion is that transcriptional up-regulation of DNA repair genes by genotoxic stress (p53 activation, Table 3 and Table 4) is counteracted by possible DNA damage that blocks transcription [56]; C) indication of cellular senescence; D) association of IR response with the circadian clock.

#### **4. Materials and Methods 4. Materials and Methods**

data.

Our RNA-Seq analysis involves a pipeline (Figure 3) of in silico processes, each with its own specific parameters (Appendix A, Table A1), where files undergo a series of transformations. In each step, data are manipulated in a way that information is retained and expanded by additional meta-data. Our RNA-Seq analysis involves a pipeline (Figure 3) of *in silico* processes, each with its own specific parameters (Appendix A, Table A1), where files undergo a series of transformations. In each step, data are manipulated in a way that information is retained and expanded by additional meta-

**Figure 3.** Differential gene expression analysis workflow is comprised of 4 distinct steps: A) Data collection from the online repositories ENA and EnsEMBL. B) Quality control and trimming. A conventional analysis pipeline of RNA-Seq data starts with the pre-processing of the raw reads with FastQC, MultiQC, and Trim Galore! C) Gene abundance quantification with Salmon and tximeta by mapping of the reads to a reference genome and/or transcriptome. D) Differential gene expression analysis with DESeq2, where gene expression levels of all mapped transcripts are quantified and normalized in order to define differentially expressed genes. **Figure 3.** Differential gene expression analysis workflow is comprised of 4 distinct steps: (A) Data collection from the online repositories ENA and EnsEMBL. (B) Quality control and trimming. A conventional analysis pipeline of RNA-Seq data starts with the pre-processing of the raw reads with FastQC, MultiQC, and Trim Galore! (C) Gene abundance quantification with Salmon and tximeta by mapping of the reads to a reference genome and/or transcriptome. (D) Differential gene expression analysis with DESeq2, where gene expression levels of all mapped transcripts are quantified and normalized in order to define differentially expressed genes.

#### *4.1. Datasets 4.1. Datasets*

We searched for datasets available in public repositories to identify the studies that performed RNA-Seq in normal and ionized tissues. The appropriate datasets were identified using the European Nucleotide Archive (ENA) advanced search engine [63]. We narrowed our search down to human Illumina RNA-Seq studies which involved ionizing radiation. More specifically, in the read domain of ENA advanced search, our query was: *"instrument\_platform = "ILLUMINA" AND library\_strategy = "RNA-Seq" AND tax\_eq (9606)"*, and in the study domain, our query was: *"(study\_name = "\*ionizing\*"* We searched for datasets available in public repositories to identify the studies that performed RNA-Seq in normal and ionized tissues. The appropriate datasets were identified using the European Nucleotide Archive (ENA) advanced search engine [63]. We narrowed our search down to human Illumina RNA-Seq studies which involved ionizing radiation. More specifically, in the read domain of ENA advanced search, our query was: *"instrument\_platform* = *"ILLUMINA" AND library\_strategy* = *"RNA-Seq" AND tax\_eq (9606)"*, and in the study domain, our query was: *"(study\_name* = *"\*ionizing\*"*

*OR study\_title = "\*ionizing\*" OR study\_description = "\*ionizing\*" OR study\_name = "\*alpha particle\*" OR study\_title = "\*alpha particle\*" OR study\_description = "\*alpha particle\*" OR study\_name = "\*irradiation\*"*

*OR study\_title* = *"\*ionizing\*" OR study\_description* = *"\*ionizing\*" OR study\_name* = *"\*alpha particle\*" OR study\_title* = *"\*alpha particle\*" OR study\_description* = *"\*alpha particle\*" OR study\_name* = *"\*irradiation\*" OR study\_title* = *"\*irradiation\*" OR study\_description* = *"\*irradiation\*" OR study\_name* = *"\*X-ray\*" OR study\_title* = *"\*X-ray\*" OR study\_description* = *"\*X-ray\*" OR study\_name* = *"\*X ray\*" OR study\_title* = *"\*X ray\*" OR study\_description* = *"\*X ray\*" OR study\_name* = *"\*gamma ray\*" OR study\_title* = *"\*gamma ray\*" OR study\_description* = *"\*gamma ray\*" OR study\_name* = *"\*positron\*" OR study\_title* = *"\*positron\*" OR study\_description* = *"\*positron\*" OR study\_name* = *"\*radiotherapy\*" OR study\_title* = *"\*radiotherapy\*" OR study\_description* = *"\*radiotherapy\*" OR study\_name* = *"\*ionising\*" OR study\_title* = *"\*ionising\*" OR study\_description* = *"\*ionising\*" OR study\_name* = *"\*IR-induced\*" OR study\_title* = *"\*IR-induced\*" OR study\_description* = *"\*IR-induced\*") AND tax\_tree(9606)"*. The first query yielded 10,541 studies and the second query yielded 504 studies. Afterwards, we selected the overlapping study names. The studies were then manually curated in order to exclude non-irradiated, UV-irradiated, and tumor samples.

### *4.2. Raw Read Evaluation*

Extensive quality control was performed on RNA-Seq data of each sample using FastQC (*version 0.11.8*) [64], and summaries were produced by MultiQC (*version 1.8*) [65] to evaluate the integrity of RNA-Seq experiments. Quality control reports mainly involved the analysis of sequence accuracy, presence of PCR artifacts, and adaptor sequences that were not automatically cleaned by Illumina platforms, GC content, k-mer levels, etc. Surgical elimination of low-quality regions, known as "trimming", was performed—when necessary—by Trim Galore! (*version 0.6.4*) [66], a wrapper package around Cutadapt (*version 2.8*) [67], and FastQC. Consequently, results where re-evaluated with FastQC and MultiQC to verify that the quality of raw data had improved after the trimming process.

### *4.3. Sequence Alignment*

Mapping aligns trimmed sequence reads against a known genome and transcriptome. Its efficiency mainly depends on the bioinformatics tools used and the quality of the sequences. Reads were directly mapped into *Homo sapiens* reference genome and transcriptome FASTA-formatted sequences. To this end, we used the latest release of Salmon (*version 1.1.0)* [47] which adopts a selective-alignment algorithm in order to overcome the shortcomings of lightweight approaches, without the additional computational burden of traditional alignment [68]. We produced the transcriptome index for Salmon via the partial selective alignment method, mapping the transcriptome to the genome, extracting the relevant portion out of the genome, and, finally, indexing it along with the transcriptome.

### *4.4. Transcript Quantification*

The entirety of the statistical analysis was performed using packages provided by Bioconductor (*BiocManager version 3.10*) [69,70], a suite for analyzing high-throughput genomic data in R (*version 3.6.2*) [71] statistical programming language. All R code was executed through RStudio server (*version 1.2.5033*) [72]. Transcript-level quantification was estimated using tximeta (*version 1.4.3*) [73], an expansion of tximport (*version 1.14.0*) [74].

### *4.5. Di*ff*erential Gene Expression Analysis*

Transcript-level quantification data were processed for DGEA, using DESeq2 (*version 1.26.0*) [48]. Studies with more than two distinct conditions were split in order to analyze genes that were differentially expressed between cells exposed to a specific dose for a specific time point and their corresponding control samples. Exported lists containing statistically significant differentially expressed genes include metrics such as Log<sup>2</sup> Fold Change (Log2FC), *p*-values, and False Discovery Rate (FDR)-adjusted [75] *p*-values for each gene. The lists were further annotated using org.Hs.eg.db (*version 3.8.2*) [76] to include HGNC [77] gene symbols and gene names. The threshold for statistical significance was set at the adjusted *p*-value < 0.05, as non-adjusted *p*-values are not to be considered [75]. statistical significance.

#### *4.6. Meta-Analysis* Representation Analysis (ORA) [81] method which performs a statistical evaluation of the fraction of

based

*4.6. Meta-Analysis* 

DEG lists were further combined in a meta-analysis (Figure 4) to identify genes of differential expression across studies. In order to achieve optimal results, only DEG lists derived from similar condition groups regarding dose and time of post-irradiation collection were considered. To this end, we excluded the DEG list derived from the PRJNA436999 samples that were collected 6 h post-irradiation, leaving only the DEG list from the samples that were collected after more than 20 h. Finally, the DEG list acquired from the PRJNA494581 samples after 2 Gy irradiation was also excluded, as it was considered an outlier, due to its low DEG number. genes in a particular pathway found among the set of genes showing changes in expression. Our terms of interest include biological process GO terms [35], KEGG biological pathways [58], and genetargeting transcription factors. The statistical significance of each over-representation of biological terms was estimated using hyper-geometric distribution. *p*-values were FDR-adjusted, and terms with adjusted *p*-values < 0.05 were considered statistically significant. In order to investigate the interactome of the DEGs and identify possible underlying cell mechanisms, we constructed their Protein–Protein Interaction (PPI) network, using STRING (*version 11.0*) [82]. Edges of the network, corresponding to protein interactions, were determined solely

performed using the WEB-based Gene Set Analysis Toolkit (WebGestalt) [80]. We selected Over-

*Int. J. Mol. Sci.* **2020**, *21*, x FOR PEER REVIEW 11 of 17

excluded, as it was considered an outlier, due to its low DEG number.

from the weighted *z*-score sum, using the normal distribution function (*Φ*):

*4.7. Functional Enrichment Analysis and Gene Network Construction* 

= Φ

⎝

where *p<sup>i</sup>* is the DESeq2-derived *p*-value, and *n<sup>i</sup>* is the number of samples of study *i* and *k* the number of studies. Finally, *p*-values underwent FDR adjustment, and 0.05 was selected as threshold for

To evaluate the efficiency of our method, we compared the experimentally validated DEGs derived

from the studies the RNA-Seq data of which we chose for analysis with our own estimations.

<sup>⎛</sup><sup>∑</sup> 

Φ()

⎞

∑ <sup>⎠</sup>

DEG lists were further combined in a meta-analysis (Figure 4) to identify genes of differential expression across studies. In order to achieve optimal results, only DEG lists derived from similar condition groups regarding dose and time of post-irradiation collection were considered. To this end, we excluded the DEG list derived from the PRJNA436999 samples that were collected 6 h postirradiation, leaving only the DEG list from the samples that were collected after more than 20 h. Finally, the DEG list acquired from the PRJNA494581 samples after 2 Gy irradiation was also

Our meta-analysis combined unadjusted *p*-values of each study for every gene, using a weighted version of Stouffer meta-analysis [78], proposed by Mosteller and Bush [79]. For each gene and study, its two-tail unadjusted *p*-value was converted into an one-tail *p*-value, based on the sign of the corresponding Log2FC. For each one-tailed *p*-value, the corresponding *z*-score was calculated using the inverse normal distribution function (*Φ*−1). Meta-analysis *p*-value for each gene was calculated

**Figure 4.** Functional enrichment analysis workflow. Differential Gene Expression Analysis (DGEA) derived genes, subdued to meta-analysis, yielded the final DEGs of utmost statistical significance. Inputting these DEGs into WebGestalt and STRING generated the resulting enriched terms and PPI network, respectively. **Figure 4.** Functional enrichment analysis workflow. Differential Gene Expression Analysis (DGEA)-derived genes, subdued to meta-analysis, yielded the final DEGs of utmost statistical significance. Inputting these DEGs into WebGestalt and STRING generated the resulting enriched terms and PPI network, respectively.

**Supplementary Materials:** Supplementary materials can be found at www.mdpi.com/xxx/s1. Our meta-analysis combined unadjusted *p*-values of each study for every gene, using a weighted version of Stouffer meta-analysis [78], proposed by Mosteller and Bush [79]. For each gene and study, its two-tail unadjusted *p*-value was converted into an one-tail *p*-value, based on the sign of the corresponding Log2FC. For each one-tailed *p*-value, the corresponding *z*-score was calculated using the inverse normal distribution function (Φ−<sup>1</sup> ). Meta-analysis *p*-value for each gene was calculated from the weighted *z*-score sum, using the normal distribution function (Φ):

$$p = \Phi\left(\frac{\sum\_{i=1}^k n\_i \Phi^{-1}(p\_i)}{\sqrt{\sum\_{i=1}^k n\_i^2}}\right)$$

where *p<sup>i</sup>* is the DESeq2-derived *p*-value, and *n<sup>i</sup>* is the number of samples of study *i* and *k* the number of studies. Finally, *p*-values underwent FDR adjustment, and 0.05 was selected as threshold for statistical significance.

To evaluate the efficiency of our method, we compared the experimentally validated DEGs derived from the studies the RNA-Seq data of which we chose for analysis with our own estimations.

### *4.7. Functional Enrichment Analysis and Gene Network Construction*

To highlight the biological background of DEGs, a functional enrichment analysis (Figure 4) was performed using the WEB-based Gene Set Analysis Toolkit (WebGestalt) [80]. We selected Over-Representation Analysis (ORA) [81] method which performs a statistical evaluation of the fraction of genes in a particular pathway found among the set of genes showing changes in expression. Our terms of interest include biological process GO terms [35], KEGG biological pathways [58], and gene-targeting transcription factors. The statistical significance of each over-representation of biological terms was estimated using hyper-geometric distribution. *p*-values were FDR-adjusted, and terms with adjusted *p*-values < 0.05 were considered statistically significant.

In order to investigate the interactome of the DEGs and identify possible underlying cell mechanisms, we constructed their Protein–Protein Interaction (PPI) network, using STRING (*version 11.0*) [82]. Edges of the network, corresponding to protein interactions, were determined solely based on text mining sources with high confidence (Table A1).

**Supplementary Materials:** Supplementary materials can be found at http://www.mdpi.com/1422-0067/21/6/1938/ s1. Table S1: Statistically significant DEGs (Adj.*p*-value<0.05) derived from DGEA of RNA-Seq data from human iPSC-Derived Cardiomyocytes after X-ray irradiation, using DESeq2. The experiment consists of 3 control samples and 3 irradiated with 5 Gy X-ray radiation for 48 hours [Bioproject: PRJNA421022]; Table S2: Statistically significant DEGs (Adj.*p*-value<0.05) derived from DGEA of RNA-Seq data from Primary Human Lung Fibroblasts (IMR90) after X-ray irradiation, using DESeq2. The experiment consists of 3 control samples and 3 irradiated with 2 Gy X-ray radiation for 24 hours [Bioproject: PRJNA436999]; Table S3: Statistically significant DEGs (Adj.*p*-value<0.05) derived from DGEA of RNA-Seq data from Human Lens Epithelial Cells after X-ray irradiation, using DESeq2. The experiment consists of 5 control samples and 5 irradiated with 5 Gy X-ray radiation for 20 hours [Bioproject: PRJNA494581]; Table S4: Statistically significant DEGs derived from a meta-analysis comparing DESeq2 outputs from irradiated and control samples of three datasets [PRJNA421022, PRJNA436999, PRJNA494581].

**Author Contributions:** Conceptualization, I.M. and A.G.G.; methodology, I.M., D.S.K., T.-D.M., C.V., E.G., D.M. and K.V.K.; software, I.M., D.S.K., T.-D.M.; validation, D.S.K. and T.-D.M.; formal analysis, I.M., K.V.K. and T.-D.M.; investigation, D.S.K. and E.G.; data curation, D.S.K.; writing—original draft preparation, I.M., A.G.G., D.S.K. and T.-D.M.; writing—review and editing, I.M., A.G.G., D.S.K. and T.-D.M.; visualization, D.S.K., T.-D.M.; C.V., D.M., E.G. and K.V.K.; supervision, I.M.; project administration, I.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Abbreviations**


### **Appendix A**

**Table A1.** Specific arguments used during the analysis. Unless explicitly specified otherwise, default parameters were used.


### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
