Metascape Gene List Analysis Report

metascape.org1

Bar Graph Summary

Figure 1. Bar graph of enriched terms across input gene lists, colored by p-values.
Metascape only visualizes the top 20 clusters. Up to 100 enriched clusters can be viewed here.
The top-level Gene Ontology biological processes can be viewed here.

Gene Lists

User-provided gene identifiers are first converted into their corresponding H. sapiens Entrez gene IDs using the latest version of the database (last updated on 2023-09-01). If multiple identifiers correspond to the same Entrez gene ID, they will be considered as a single Entrez gene ID in downstream analyses. The gene lists are summarized in Table 1.

Table 1. Statistics of input gene lists.
Name Total Unique
MyList 100 100

Gene Annotation

The following are the list of annotations retrieved from the latest version of the database (last updated on 2023-09-01) (Table 2).

Table 2. Gene annotations extracted
Name Type Description
Gene Symbol Description Primary HUGO gene symbol.
Description Description Short description.
Biological Process (GO) Function/Location Descriptions summarized based on gene ontology database, where up to three most informative GO terms are kept.
Kinase Class (UniProt) Function/Location Detailed kinase classes.
Protein Function (Protein Atlas) Function/Location Protein Function (Protein Atlas)
Subcellular Location (Protein Atlas) Function/Location Subcellular Location (Protein Atlas)
Drug (DrugBank) Genotype/Phenotype/Disease Drug information for the given gene as target.
Canonical Pathways Ontology Canonical Pathways
Hallmark Gene Sets Ontology Hallmark Gene Sets

Pathway and Process Enrichment Analysis

For each given gene list, pathway and process enrichment analysis have been carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways, CORUM, WikiPathways, and PANTHER Pathway. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. More specifically, p-values are calculated based on the cumulative hypergeometric distribution2, and q-values are calculated using the Benjamini-Hochberg procedure to account for multiple testings3. Kappa scores4 are used as the similarity metric when performing hierarchical clustering on the enriched terms, and sub-trees with a similarity of > 0.3 are considered a cluster. The most statistically significant term within a cluster is chosen to represent the cluster.

Table 3. Top 20 clusters with their representative enriched terms (one per cluster). "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the user-provided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.
GO Category Description Count % Log10(P) Log10(q)
WP2877 WikiPathways Vitamin D receptor pathway 8 8.00 -6.68 -2.34
M5885 Canonical Pathways NABA MATRISOME ASSOCIATED 13 13.00 -5.93 -1.88
GO:0030879 GO Biological Processes mammary gland development 6 6.00 -5.46 -1.59
R-HSA-1474244 Reactome Gene Sets Extracellular matrix organization 8 8.00 -5.15 -1.49
GO:1903530 GO Biological Processes regulation of secretion by cell 10 10.00 -4.73 -1.34
P06664 PANTHER Pathway Gonadotropin-releasing hormone receptor pathway 3 3.00 -4.72 -1.34
GO:0060349 GO Biological Processes bone morphogenesis 5 5.00 -4.68 -1.34
GO:0002067 GO Biological Processes glandular epithelial cell differentiation 4 4.00 -4.19 -1.13
GO:0071396 GO Biological Processes cellular response to lipid 9 9.00 -4.17 -1.13
WP3942 WikiPathways PPAR signaling pathway 4 4.00 -4.11 -1.12
GO:0071456 GO Biological Processes cellular response to hypoxia 5 5.00 -4.09 -1.12
GO:0035239 GO Biological Processes tube morphogenesis 10 10.00 -4.07 -1.12
hsa04024 KEGG Pathway cAMP signaling pathway 6 6.00 -3.98 -1.10
R-HSA-2871796 Reactome Gene Sets FCERI mediated MAPK activation 3 3.00 -3.79 -1.03
GO:0051051 GO Biological Processes negative regulation of transport 8 8.00 -3.78 -1.02
GO:0045576 GO Biological Processes mast cell activation 3 3.00 -3.75 -1.01
WP2840 WikiPathways Hair follicle development: cytodifferentiation - part 3 of 3 4 4.00 -3.70 -1.01
GO:0006576 GO Biological Processes biogenic amine metabolic process 4 4.00 -3.70 -1.01
GO:0007631 GO Biological Processes feeding behavior 4 4.00 -3.66 -0.98
GO:0031349 GO Biological Processes positive regulation of defense response 7 7.00 -3.33 -0.84

To further capture the relationships between the terms, a subset of enriched terms has been selected and rendered as a network plot, where terms with a similarity > 0.3 are connected by edges. We select the terms with the best p-values from each of the 20 clusters, with the constraint that there are no more than 15 terms per cluster and no more than 250 terms in total. The network is visualized using Cytoscape5, where each node represents an enriched term and is colored first by its cluster ID (Figure 2.a) and then by its p-value (Figure 2.b). These networks can be interactively viewed in Cytoscape through the .cys files (contained in the Zip package, which also contains a publication-quality version as a PDF) or within a browser by clicking on the web icon. For clarity, term labels are only shown for one term per cluster, so it is recommended to use Cytoscape or a browser to visualize the network in order to inspect all node labels. We can also export the network into a PDF file within Cytoscape, and then edit the labels using Adobe Illustrator for publication purposes. To switch off all labels, delete the "Label" mapping under the "Style" tab within Cytoscape, and then export the network view.

Figure 2. Network of enriched terms: (a) colored by cluster ID, where nodes that share the same cluster ID are typically close to each other; (b) colored by p-value, where terms containing more genes tend to have a more significant p-value.

Protein-protein Interaction Enrichment Analysis

For each given gene list, protein-protein interaction enrichment analysis has been carried out with the following databases: STRING6, BioGrid7, OmniPath8, InWeb_IM9.Only physical interactions in STRING (physical score > 0.132) and BioGrid are used (details). The resultant network contains the subset of proteins that form physical interactions with at least one other member in the list. If the network contains between 3 and 500 proteins, the Molecular Complex Detection (MCODE) algorithm10 has been applied to identify densely connected network components. The MCODE networks identified for individual gene lists have been gathered and are shown in Figure 3.

Pathway and process enrichment analysis has been applied to each MCODE component independently, and the three best-scoring terms by p-value have been retained as the functional description of the corresponding components, shown in the tables underneath corresponding network plots within Figure 3.

Figure 3. Protein-protein interaction network and MCODE components identified in the gene lists.
GO Description Log10(P)
M5885 NABA MATRISOME ASSOCIATED -7.0
M5883 NABA SECRETED FACTORS -6.2
GO:0050679 positive regulation of epithelial cell proliferation -6.1

Quality Control and Association Analysis

Gene list enrichments are identified in the following ontology categories: COVID, Cell_Type_Signatures, DisGeNET, PaGenBase, TRRUST, Transcription_Factor_Targets. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. The top few enriched clusters (one term per cluster) are shown in the Figure 4-9. The algorithm used here is the same as that is used for pathway and process enrichment analysis.

Figure 4. Summary of enrichment analysis in COVID11.


GO Description Count % Log10(P) Log10(q)
COVID202 RNA_Vanderheiden_pHAE_48h_Down 7 7 -5.50 -2.20
COVID045 RNA_Wyler_Caco-2_24h_Down 4 4 -5.40 -2.20
COVID047 RNA_Wyler_Calu-3_12h_Down 6 6 -4.40 -1.70
COVID043 RNA_Wyler_Caco-2_12h_Down 5 5 -4.30 -1.60
COVID052 RNA_Xiong_BALF_Up 7 7 -4.20 -1.50
COVID009 RNA_Blanco-Melo_A549-ACE2_Down 6 6 -4.20 -1.50
COVID006 RNA_Appelberg_Huh-7_72h_Up 6 6 -3.30 -1.00
COVID035 RNA_Sun_Calu-3_0h_Down 4 4 -2.60 -0.59
COVID018 RNA_Blanco-Melo_Lung_Up 3 3 -2.60 -0.58
COVID257 RNA_Wilk_CD14+Monocytes_patient-C6_Down 3 3 -2.50 -0.55
COVID034 RNA_Liao_BALF-severe-vs-mild_Up 5 5 -2.50 -0.53
COVID209 Proteome_Li_Urine-recovery-vs-healthy_Down 3 3 -2.10 -0.24
Figure 5. Summary of enrichment analysis in Cell Type Signatures12.


GO Description Count % Log10(P) Log10(q)
M39171 MURARO PANCREAS PANCREATIC POLYPEPTIDE CELL 8 8 -7.30 -3.30
M39118 AIZARANI LIVER C14 HEPATOCYTES 2 8 8 -6.10 -2.50
M39209 HAY BONE MARROW STROMAL 13 13 -5.80 -2.40
M39271 HU FETAL RETINA RPE 8 8 -5.20 -2.10
M39115 AIZARANI LIVER C11 HEPATOCYTES 1 8 8 -5.20 -2.00
M40220 DESCARTES FETAL KIDNEY URETERIC BUD CELLS 8 8 -5.10 -2.00
M39161 GAO LARGE INTESTINE ADULT CA ENTEROENDOCRINE CELLS 8 8 -5.00 -2.00
M40175 DESCARTES FETAL EYE CORNEAL AND CONJUNCTIVAL EPITHELIAL CELLS 7 7 -4.80 -1.80
M40276 DESCARTES FETAL PLACENTA AFP ALB POSITIVE CELLS 6 6 -4.60 -1.80
M40027 BUSSLINGER DUODENAL EARLY IMMATURE ENTEROCYTES 5 5 -4.40 -1.70
M41708 FAN OVARY CL6 PUTATIVE EARLY ATRETIC FOLLICLE THECAL CELL 2 7 7 -4.20 -1.50
M40252 DESCARTES FETAL MUSCLE LYMPHATIC ENDOTHELIAL CELLS 4 4 -4.00 -1.40
M39170 MURARO PANCREAS DELTA CELL 6 6 -3.70 -1.20
M40267 DESCARTES FETAL PANCREAS ACINAR CELLS 4 4 -3.70 -1.20
M40240 DESCARTES FETAL LUNG NEUROENDOCRINE CELLS 4 4 -3.70 -1.20
M40012 BUSSLINGER GASTRIC IMMATURE PIT CELLS 5 5 -3.70 -1.20
M39069 MANNO MIDBRAIN NEUROTYPES HDA2 8 8 -3.50 -1.10
M41675 TRAVAGLINI LUNG ADVENTITIAL FIBROBLAST CELL 6 6 -3.30 -1.00
M40015 BUSSLINGER GASTRIC LYZ POSITIVE CELLS 4 4 -3.30 -1.00
M39147 GAO SMALL INTESTINE 24W C3 ENTEROCYTE PROGENITOR SUBTYPE 1 3 3 -3.30 -0.99
Figure 6. Summary of enrichment analysis in DisGeNET13.


GO Description Count % Log10(P) Log10(q)
C0269102 Endometrioma 10 10 -7.50 -3.30
C0028259 Nodule 9 9 -6.40 -2.70
C0334579 Anaplastic astrocytoma 8 8 -6.40 -2.70
C0024232 Lymphatic Metastasis 11 11 -6.40 -2.70
C4699512 Large-artery atherosclerosis (embolus/thrombosis) 5 5 -6.30 -2.60
C0162820 Dermatitis, Allergic Contact 7 7 -6.10 -2.50
C0555198 Malignant Glioma 13 13 -6.10 -2.50
C0005940 Bone Diseases 9 9 -6.00 -2.50
C4552766 Miscarriage 10 10 -5.80 -2.40
C0205699 Carcinomatosis 7 7 -5.50 -2.20
C0007095 Carcinoid Tumor 8 8 -5.50 -2.20
C0021368 Inflammation 10 10 -5.50 -2.20
C1527336 Sjogren's Syndrome 10 10 -5.30 -2.10
C0205698 Undifferentiated carcinoma 8 8 -5.30 -2.10
C0010278 Craniosynostosis 10 10 -5.30 -2.10
C0008479 Chondrosarcoma 9 9 -5.30 -2.10
C0521158 Recurrent tumor 12 12 -5.20 -2.10
C0024668 Mammary Neoplasms, Experimental 7 7 -5.10 -2.00
C1704272 Benign Prostatic Hyperplasia 12 12 -5.00 -2.00
C0032310 Pneumonia, Viral 3 3 -5.00 -2.00
Figure 7. Summary of enrichment analysis in PaGenBase14.


GO Description Count % Log10(P) Log10(q)
PGB:00014 Cell-specific: DRG 12 12 -7.40 -3.30
PGB:00050 Tissue-specific: trachea 5 5 -4.90 -1.90
PGB:00081 Cell-specific: Bronchial Epithelial Cells 5 5 -4.00 -1.40
PGB:00004 Tissue-specific: kidney 8 8 -3.70 -1.20
PGB:00018 Tissue-specific: lung 8 8 -3.60 -1.20
PGB:00034 Cell-specific: OVR278E 4 4 -3.00 -0.83
PGB:00011 Tissue-specific: spleen 7 7 -2.90 -0.74
PGB:00072 Tissue-specific: uterus 3 3 -2.60 -0.58
PGB:00031 Cell-specific: HUVEC 5 5 -2.10 -0.26
PGB:00001 Tissue-specific: liver 7 7 -2.10 -0.25
PGB:00057 Tissue-specific: ovary 4 4 -2.00 -0.22
Figure 8. Summary of enrichment analysis in TRRUST.


GO Description Count % Log10(P) Log10(q)
TRR00877 Regulated by: NFKBIA 3 3 -4.40 -1.60
TRR00342 Regulated by: FOS 4 4 -4.20 -1.50
TRR01158 Regulated by: RELA 7 7 -3.90 -1.30
TRR01277 Regulated by: STAT3 5 5 -3.70 -1.20
TRR00645 Regulated by: JUN 5 5 -3.60 -1.20
TRR00075 Regulated by: BRCA1 3 3 -3.00 -0.81
TRR01256 Regulated by: SP1 7 7 -2.70 -0.66
TRR01512 Regulated by: USF1 3 3 -2.70 -0.62
TRR00280 Regulated by: ETS1 3 3 -2.60 -0.57
TRR00125 Regulated by: CREB1 3 3 -2.30 -0.39
TRR00875 Regulated by: NFKB1 5 5 -2.30 -0.37
TRR00780 Regulated by: MYC 3 3 -2.20 -0.34
Figure 9. Summary of enrichment analysis in Transcription Factor Targets.


GO Description Count % Log10(P) Log10(q)
M3150 TATA 01 6 6 -3.70 -1.20
M3339 STAT4 01 6 6 -3.60 -1.20
M17512 HNF3 Q6 5 5 -3.30 -1.00
M9129 HMEF2 Q6 4 4 -2.90 -0.76
M6994 TBP 01 5 5 -2.90 -0.73
M9364 SRF Q6 5 5 -2.90 -0.73
M30159 SIX1 TARGET GENES 6 6 -2.80 -0.70
M4061 T3R Q6 5 5 -2.80 -0.69
M4644 DR1 Q3 5 5 -2.70 -0.66
M10704 NKX25 02 5 5 -2.70 -0.64
M9645 GGGYGTGNY UNKNOWN 8 8 -2.70 -0.64
M9955 TGAYRTCA ATF3 Q6 7 7 -2.60 -0.61
M13187 GGARNTKYCCA UNKNOWN 3 3 -2.60 -0.61
M13279 HFH8 01 4 4 -2.30 -0.39
M4372 STAT5A 04 4 4 -2.30 -0.37
M9937 RSRFC4 Q2 4 4 -2.20 -0.35
M2998 LBP1 Q6 4 4 -2.20 -0.32
M30323 ZNF436 TARGET GENES 6 6 -2.20 -0.32
M551 TEF1 Q6 4 4 -2.20 -0.30
M14782 VDR Q3 4 4 -2.10 -0.28

Reference

  1. Zhou et al., Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature Communications (2019) 10(1):1523.
  2. Zar, J.H. Biostatistical Analysis 1999 4th edn., NJ Prentice Hall, pp. 523
  3. Hochberg Y., Benjamini Y. More powerful procedures for multiple significance testing. Statistics in Medicine (1990) 9:811-818.
  4. Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. (1960) 20:27-46.
  5. Shannon P. et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res (2003) 11:2498-2504.
  6. Szklarczyk D. et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. (2019) 47:D607-613.
  7. Stark C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. (2006) 34:D535-539.
  8. Turei D. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2016) 13:966-967.
  9. Li T. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2017) 14:61-64.
  10. Bader, G.D. et al. An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics (2003) 4:2.
  11. https://metascape.org/COVID.
  12. Subramanian A, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545-15550 (2005).
  13. Pinero J, et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research 45, D833-D839 (2017).
  14. Pan JB, et al. PaGenBase: a pattern gene database for the global and dynamic understanding of gene function. PLoS One 8, e80747 (2013).