Metascape Gene List Analysis Report

metascape.org1

Bar Graph Summary

Figure 1. Bar graph of enriched terms across input gene lists, colored by p-values.
Metascape only visualizes the top 20 clusters. Up to 100 enriched clusters can be viewed here.
The top-level Gene Ontology biological processes can be viewed here.

Gene Lists

User-provided gene identifiers are first converted into their corresponding H. sapiens Entrez gene IDs using the latest version of the database (last updated on 2023-09-01). If multiple identifiers correspond to the same Entrez gene ID, they will be considered as a single Entrez gene ID in downstream analyses. The gene lists are summarized in Table 1.

Table 1. Statistics of input gene lists.
Name Total Unique
MyList 100 100

Gene Annotation

The following are the list of annotations retrieved from the latest version of the database (last updated on 2023-09-01) (Table 2).

Table 2. Gene annotations extracted
Name Type Description
Gene Symbol Description Primary HUGO gene symbol.
Description Description Short description.
Biological Process (GO) Function/Location Descriptions summarized based on gene ontology database, where up to three most informative GO terms are kept.
Kinase Class (UniProt) Function/Location Detailed kinase classes.
Protein Function (Protein Atlas) Function/Location Protein Function (Protein Atlas)
Subcellular Location (Protein Atlas) Function/Location Subcellular Location (Protein Atlas)
Drug (DrugBank) Genotype/Phenotype/Disease Drug information for the given gene as target.
Canonical Pathways Ontology Canonical Pathways
Hallmark Gene Sets Ontology Hallmark Gene Sets

Pathway and Process Enrichment Analysis

For each given gene list, pathway and process enrichment analysis have been carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways, CORUM, WikiPathways, and PANTHER Pathway. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. More specifically, p-values are calculated based on the cumulative hypergeometric distribution2, and q-values are calculated using the Benjamini-Hochberg procedure to account for multiple testings3. Kappa scores4 are used as the similarity metric when performing hierarchical clustering on the enriched terms, and sub-trees with a similarity of > 0.3 are considered a cluster. The most statistically significant term within a cluster is chosen to represent the cluster.

Table 3. Top 20 clusters with their representative enriched terms (one per cluster). "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the user-provided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.
GO Category Description Count % Log10(P) Log10(q)
WP2882 WikiPathways Nuclear receptors meta-pathway 10 10.00 -7.03 -3.11
GO:0040008 GO Biological Processes regulation of growth 13 13.00 -6.90 -3.11
GO:0032119 GO Biological Processes sequestering of zinc ion 3 3.00 -6.86 -3.11
hsa05200 KEGG Pathway Pathways in cancer 11 11.00 -5.82 -2.26
GO:0010038 GO Biological Processes response to metal ion 9 9.00 -5.51 -2.01
R-HSA-1280218 Reactome Gene Sets Adaptive Immune System 12 12.00 -5.05 -1.83
R-HSA-211859 Reactome Gene Sets Biological oxidations 7 7.00 -5.03 -1.83
R-HSA-6785807 Reactome Gene Sets Interleukin-4 and Interleukin-13 signaling 5 5.00 -4.52 -1.41
hsa05202 KEGG Pathway Transcriptional misregulation in cancer 6 6.00 -4.35 -1.29
GO:0071900 GO Biological Processes regulation of protein serine/threonine kinase activity 7 7.00 -4.34 -1.29
GO:0097006 GO Biological Processes regulation of plasma lipoprotein particle levels 4 4.00 -4.33 -1.29
R-HSA-2022090 Reactome Gene Sets Assembly of collagen fibrils and other multimeric structures 4 4.00 -4.30 -1.29
GO:0009725 GO Biological Processes response to hormone 11 11.00 -4.30 -1.29
WP5094 WikiPathways Orexin receptor pathway 6 6.00 -4.25 -1.27
GO:0042445 GO Biological Processes hormone metabolic process 6 6.00 -4.19 -1.27
hsa04927 KEGG Pathway Cortisol synthesis and secretion 4 4.00 -4.19 -1.27
GO:0006656 GO Biological Processes phosphatidylcholine biosynthetic process 3 3.00 -4.12 -1.27
WP2880 WikiPathways Glucocorticoid receptor pathway 4 4.00 -4.06 -1.27
R-HSA-9759194 Reactome Gene Sets Nuclear events mediated by NFE2L2 4 4.00 -3.86 -1.12
R-HSA-453279 Reactome Gene Sets Mitotic G1 phase and G1/S transition 5 5.00 -3.86 -1.12

To further capture the relationships between the terms, a subset of enriched terms has been selected and rendered as a network plot, where terms with a similarity > 0.3 are connected by edges. We select the terms with the best p-values from each of the 20 clusters, with the constraint that there are no more than 15 terms per cluster and no more than 250 terms in total. The network is visualized using Cytoscape5, where each node represents an enriched term and is colored first by its cluster ID (Figure 2.a) and then by its p-value (Figure 2.b). These networks can be interactively viewed in Cytoscape through the .cys files (contained in the Zip package, which also contains a publication-quality version as a PDF) or within a browser by clicking on the web icon. For clarity, term labels are only shown for one term per cluster, so it is recommended to use Cytoscape or a browser to visualize the network in order to inspect all node labels. We can also export the network into a PDF file within Cytoscape, and then edit the labels using Adobe Illustrator for publication purposes. To switch off all labels, delete the "Label" mapping under the "Style" tab within Cytoscape, and then export the network view.

Figure 2. Network of enriched terms: (a) colored by cluster ID, where nodes that share the same cluster ID are typically close to each other; (b) colored by p-value, where terms containing more genes tend to have a more significant p-value.

Protein-protein Interaction Enrichment Analysis

For each given gene list, protein-protein interaction enrichment analysis has been carried out with the following databases: STRING6, BioGrid7, OmniPath8, InWeb_IM9.Only physical interactions in STRING (physical score > 0.132) and BioGrid are used (details). The resultant network contains the subset of proteins that form physical interactions with at least one other member in the list. If the network contains between 3 and 500 proteins, the Molecular Complex Detection (MCODE) algorithm10 has been applied to identify densely connected network components. The MCODE networks identified for individual gene lists have been gathered and are shown in Figure 3.

Pathway and process enrichment analysis has been applied to each MCODE component independently, and the three best-scoring terms by p-value have been retained as the functional description of the corresponding components, shown in the tables underneath corresponding network plots within Figure 3.

Figure 3. Protein-protein interaction network and MCODE components identified in the gene lists.
GO Description Log10(P)
hsa05200 Pathways in cancer -7.4
GO:0001558 regulation of cell growth -7.2
GO:0040008 regulation of growth -6.8
Color MCODE GO Description Log10(P)
MCODE_1 hsa04915 Estrogen signaling pathway -5.1
MCODE_1 R-HSA-9658195 Leishmania infection -4.9
MCODE_1 R-HSA-9824443 Parasitic Infection Pathways -4.9
MCODE_2 hsa05204 Chemical carcinogenesis - DNA adducts -7.9
MCODE_2 hsa00982 Drug metabolism - cytochrome P450 -7.9
MCODE_2 hsa00980 Metabolism of xenobiotics by cytochrome P450 -7.8

Quality Control and Association Analysis

Gene list enrichments are identified in the following ontology categories: COVID, Cell_Type_Signatures, DisGeNET, PaGenBase, TRRUST, Transcription_Factor_Targets. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. The top few enriched clusters (one term per cluster) are shown in the Figure 4-9. The algorithm used here is the same as that is used for pathway and process enrichment analysis.

Figure 4. Summary of enrichment analysis in COVID11.


GO Description Count % Log10(P) Log10(q)
COVID347 RNA_Wilk_B-cells_patient-C6_Up 6 6 -5.70 -3.10
COVID258 RNA_Wilk_CD14+Monocytes_patient-C7_Up 4 4 -5.40 -2.90
COVID309 RNA_Wilk_CD8+T-cells_patient-C3_Up 4 4 -5.40 -2.90
COVID052 RNA_Xiong_BALF_Up 8 8 -5.10 -2.70
COVID337 RNA_Wilk_B-cells_patient-C1B-severe_Up 5 5 -4.60 -2.30
COVID036 RNA_Sun_Calu-3_0h_Up 7 7 -4.20 -2.00
COVID363 Interactome_Laurent_HEK293_24h_E 7 7 -4.20 -2.00
COVID293 RNA_Wilk_NK-cells_patient-C3_Up 3 3 -4.10 -2.00
COVID271 RNA_Wilk_CD16+Monocytes_patient-C7_Up 3 3 -4.10 -1.90
COVID228 Translatome_Bojkova_Caco-2_24h_Down 5 5 -3.60 -1.60
COVID071 Proteome_Bouhaddou_Vero_E6_24h_Down 4 4 -3.50 -1.60
COVID264 RNA_Wilk_CD16+Monocytes_patient-C3_Up 3 3 -3.50 -1.60
COVID313 RNA_Wilk_CD8+T-cells_patient-C5_Up 3 3 -3.40 -1.50
COVID007 RNA_Blanco-Melo_A549_Down 6 6 -3.30 -1.40
COVID038 RNA_Sun_Calu-3_12h_Up 6 6 -3.30 -1.40
COVID040 RNA_Sun_Calu-3_24h_Up 6 6 -3.30 -1.40
COVID243 RNA_Riva_Vero-E6_24h_Up 6 6 -3.30 -1.40
COVID385 Interactome_Laurent_HEK293_24h_ORF7A 6 6 -3.30 -1.40
COVID386 Interactome_Laurent_HEK293_24h_ORF7B 6 6 -3.30 -1.40
COVID345 RNA_Wilk_B-cells_patient-C5_Up 4 4 -3.20 -1.30
Figure 5. Summary of enrichment analysis in Cell Type Signatures12.


GO Description Count % Log10(P) Log10(q)
M41656 TRAVAGLINI LUNG MUCOUS CELL 9 9 -9.40 -5.80
M41659 TRAVAGLINI LUNG ALVEOLAR EPITHELIAL TYPE 1 CELL 12 12 -8.20 -4.90
M39111 AIZARANI LIVER C7 EPCAM POS BILE DUCT CELLS 2 9 9 -7.30 -4.30
M40175 DESCARTES FETAL EYE CORNEAL AND CONJUNCTIVAL EPITHELIAL CELLS 9 9 -6.90 -3.90
M39174 MURARO PANCREAS ACINAR CELL 14 14 -6.90 -3.90
M39209 HAY BONE MARROW STROMAL 14 14 -6.60 -3.80
M41713 FAN OVARY CL11 MURAL GRANULOSA CELL 11 11 -6.60 -3.70
M40007 BUSSLINGER GASTRIC PREZYMOGENIC CELLS 5 5 -5.80 -3.20
M41669 TRAVAGLINI LUNG BRONCHIAL VESSEL 2 CELL 8 8 -5.50 -2.90
M39303 CUI DEVELOPING HEART C6 EPICARDIAL CELL 7 7 -5.40 -2.90
M40292 DESCARTES FETAL SPLEEN MESOTHELIAL CELLS 7 7 -5.40 -2.90
M41655 TRAVAGLINI LUNG GOBLET CELL 6 6 -5.00 -2.60
M39278 DURANTE ADULT OLFACTORY NEUROEPITHELIUM SUSTENTACULAR CELLS 4 4 -4.70 -2.40
M41705 FAN OVARY CL3 MATURE CUMULUS GRANULOSA CELL 1 7 7 -4.70 -2.40
M41697 TRAVAGLINI LUNG EREG DENDRITIC CELL 10 10 -4.70 -2.40
M41700 TRAVAGLINI LUNG OLR1 CLASSICAL MONOCYTE CELL 11 11 -4.50 -2.30
M39225 LAKE ADULT KIDNEY C6 PROXIMAL TUBULE EPITHELIAL CELLS FIBRINOGEN POS S3 6 6 -4.50 -2.30
M39321 CUI DEVELOPING HEART VASCULAR ENDOTHELIAL CELL 6 6 -4.30 -2.10
M41717 FAN OVARY CL15 SMALL ANTRAL FOLLICLE GRANULOSA CELL 10 10 -4.30 -2.10
M40229 DESCARTES FETAL LIVER MYELOID CELLS 6 6 -4.20 -2.00
Figure 6. Summary of enrichment analysis in DisGeNET13.


GO Description Count % Log10(P) Log10(q)
C0521158 Recurrent tumor 20 20 -12.00 -8.30
C0019158 Hepatitis 18 18 -11.00 -7.30
C0860207 Drug-Induced Liver Disease 16 16 -11.00 -6.70
C3203102 Idiopathic pulmonary arterial hypertension 18 18 -10.00 -6.30
C0042373 Vascular Diseases 17 17 -10.00 -6.20
C4086152 Childhood Astrocytoma 16 16 -9.70 -6.00
C0030297 Pancreatic Neoplasm 17 17 -9.30 -5.70
C0031099 Periodontitis 16 16 -9.10 -5.60
C0007107 Malignant neoplasm of larynx 12 12 -8.70 -5.30
C1868683 B-CELL MALIGNANCY, LOW-GRADE 12 12 -8.70 -5.30
C0025286 Meningioma 15 15 -8.60 -5.20
C0153381 Malignant neoplasm of mouth 16 16 -8.40 -5.10
C0015672 Fatigue 16 16 -8.40 -5.10
C0278488 Carcinoma breast stage IV 14 14 -8.20 -4.90
C0007785 Cerebral Infarction 15 15 -8.10 -4.80
C0154830 Proliferative diabetic retinopathy 9 9 -8.00 -4.80
C0038525 Subarachnoid Hemorrhage 13 13 -7.90 -4.70
C0279550 Adult Rhabdomyosarcoma 13 13 -7.90 -4.70
C0220611 Childhood Rhabdomyosarcoma 13 13 -7.80 -4.60
C0023465 Acute monocytic leukemia 14 14 -7.70 -4.50
Figure 7. Summary of enrichment analysis in PaGenBase14.


GO Description Count % Log10(P) Log10(q)
PGB:00002 Cell-specific: HEPG2 9 9 -4.90 -2.50
PGB:00081 Cell-specific: Bronchial Epithelial Cells 5 5 -4.00 -1.90
PGB:00071 Cell-specific: Vaginal Epithelial 3 3 -4.00 -1.90
PGB:00101 Tissue-specific: Colorectal adenocarcinoma 3 3 -3.50 -1.50
PGB:00082 Cell-specific: Breast cell 3 3 -3.10 -1.20
PGB:00018 Tissue-specific: lung 7 7 -2.90 -1.10
PGB:00014 Cell-specific: DRG 6 6 -2.40 -0.75
PGB:00060 Tissue-specific: retinoblastoma 3 3 -2.40 -0.75
PGB:00022 Tissue-specific: adrenal gland 4 4 -2.30 -0.72
PGB:00004 Tissue-specific: kidney 6 6 -2.30 -0.66
PGB:00034 Cell-specific: OVR278E 3 3 -2.00 -0.48
Figure 8. Summary of enrichment analysis in TRRUST.


GO Description Count % Log10(P) Log10(q)
TRR01256 Regulated by: SP1 19 19 -14.00 -9.50
TRR01158 Regulated by: RELA 11 11 -7.80 -4.70
TRR00875 Regulated by: NFKB1 11 11 -7.80 -4.60
TRR00484 Regulated by: HIF1A 6 6 -6.00 -3.30
TRR01277 Regulated by: STAT3 6 6 -4.80 -2.40
TRR01557 Regulated by: ZEB1 3 3 -4.60 -2.30
TRR00366 Regulated by: FOXO3 3 3 -4.40 -2.20
TRR00869 Regulated by: NFE2L2 3 3 -4.40 -2.10
TRR00645 Regulated by: JUN 5 5 -3.60 -1.60
TRR00908 Regulated by: NR3C1 3 3 -3.10 -1.30
TRR01259 Regulated by: SP3 4 4 -3.10 -1.20
TRR00270 Regulated by: EP300 3 3 -2.90 -1.10
TRR00230 Regulated by: E2F1 4 4 -2.90 -1.10
TRR00110 Regulated by: CEBPB 3 3 -2.80 -1.10
TRR01512 Regulated by: USF1 3 3 -2.70 -0.95
TRR00466 Regulated by: HDAC1 3 3 -2.60 -0.90
TRR00280 Regulated by: ETS1 3 3 -2.60 -0.88
TRR01275 Regulated by: STAT1 3 3 -2.40 -0.76
Figure 9. Summary of enrichment analysis in Transcription Factor Targets.


GO Description Count % Log10(P) Log10(q)
M13012 TGCTGAY UNKNOWN 10 10 -4.90 -2.50
M15719 YYCATTCAWW UNKNOWN 6 6 -4.40 -2.10
M29904 BCL6B TARGET GENES 5 5 -3.60 -1.60
M3647 GR Q6 01 6 6 -3.50 -1.50
M3403 GTGACGY E4F1 Q6 9 9 -3.40 -1.50
M30176 SOX3 TARGET GENES 8 8 -3.30 -1.40
M11838 FOXD3 01 5 5 -3.20 -1.40
M551 TEF1 Q6 5 5 -3.00 -1.20
M6378 TTCYNRGAA STAT5B 01 6 6 -3.00 -1.20
M11587 CHX10 01 5 5 -3.00 -1.20
M18386 STAT5A 02 4 4 -2.90 -1.10
M7349 WYAAANNRNNNGCG UNKNOWN 3 3 -2.90 -1.10
M3263 STAT3 02 4 4 -2.80 -1.10
M17318 CGTSACG PAX3 B 4 4 -2.80 -1.10
M19455 PR 01 4 4 -2.80 -1.10
M16200 ISRE 01 5 5 -2.80 -1.10
M8585 TFIIA Q6 5 5 -2.80 -1.00
M5 STAT 01 5 5 -2.80 -1.00
M17883 STAT5A 01 5 5 -2.80 -1.00
M17779 PAX2 02 5 5 -2.70 -1.00

Reference

  1. Zhou et al., Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature Communications (2019) 10(1):1523.
  2. Zar, J.H. Biostatistical Analysis 1999 4th edn., NJ Prentice Hall, pp. 523
  3. Hochberg Y., Benjamini Y. More powerful procedures for multiple significance testing. Statistics in Medicine (1990) 9:811-818.
  4. Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. (1960) 20:27-46.
  5. Shannon P. et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res (2003) 11:2498-2504.
  6. Szklarczyk D. et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. (2019) 47:D607-613.
  7. Stark C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. (2006) 34:D535-539.
  8. Turei D. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2016) 13:966-967.
  9. Li T. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2017) 14:61-64.
  10. Bader, G.D. et al. An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics (2003) 4:2.
  11. https://metascape.org/COVID.
  12. Subramanian A, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545-15550 (2005).
  13. Pinero J, et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research 45, D833-D839 (2017).
  14. Pan JB, et al. PaGenBase: a pattern gene database for the global and dynamic understanding of gene function. PLoS One 8, e80747 (2013).