Gene Set Analysis Using Spatial Statistics

Riffo-Campos, Angela L.; Ayala, Guillermo; Montes, Francisco

doi:10.3390/math9050521

Open AccessArticle

Gene Set Analysis Using Spatial Statistics

by

Angela L. Riffo-Campos

^1,2,

Guillermo Ayala

^2,* and

Francisco Montes

²

¹

Centro de Excelencia de Modelación y Computación Científica, Universidad de La Frontera, Temuco 4780000, Chile

²

Departamento de Estadística e Investigación Operativa, Universidad de Valencia, Avda. Vicent Andrés Estellés, 1, 46100 Burjasot, Spain

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(5), 521; https://doi.org/10.3390/math9050521

Submission received: 26 January 2021 / Revised: 18 February 2021 / Accepted: 23 February 2021 / Published: 3 March 2021

(This article belongs to the Special Issue Spatial Statistics with Its Application)

Download

Browse Figures

Versions Notes

Abstract

Gene differential expression consists of the study of the possible association between the gene expression, evaluated using different types of data as DNA microarray or RNA-Seq technologies, and the phenotype. This can be performed marginally for each gene (differential gene expression) or using a gene set collection (gene set analysis). A previous (marginal) per-gene analysis of differential expression is usually performed in order to obtain a set of significant genes or marginal p-values used later in the study of association between phenotype and gene expression. This paper proposes the use of methods of spatial statistics for testing gene set differential expression analysis using paired samples of RNA-Seq counts. This approach is not based on a previous per-gene differential expression analysis. Instead, we compare the paired counts within each sample/control using a binomial test. Each pair per gene will produce a p-value so gene expression profile is transformed into a vector of p-values which will be considered as an event belonging to a point pattern. This would be the first component of a bivariate point pattern. The second component is generated by applying two different randomization distributions to the correspondence between samples and treatment. The self-contained null hypothesis considered in gene set analysis can be formulated in terms of the associated point pattern as a random labeling of the considered bivariate point pattern. The gene sets were defined by the Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The proposed methodology was tested in four RNA-Seq datasets of colorectal cancer (CRC) patients and the results were contrasted with those obtained using the edgeR-GOseq pipeline. The proposed methodology has proved to be consistent at the biological and statistical level, in particular using Cuzick and Edwards test with one realization of the second component and between-pair distribution.

Keywords:

colorectal cancer; RNA-Seq; paired samples; spatial point pattern

1. Introduction

Global gene expression (transcriptome) can be quantified using DNA microarrays [1] and RNA sequencing (RNA-Seq) technologies [2]. RNA-Seq is widely used to understand and describe the biological mechanisms involved in transcription observed under different experimental conditions. The statistical comparison of the means of the gene expression is known as differential gene expression. This comparison can be performed at the gene level, i.e., a marginal analysis of each gene. There are many pipelines to perform an RNA-Seq data analysis [3]. A list of differentially expressed genes is obtained, the significant genes. Usually, it is expected to find a relationship between these significant genes and the biological mechanisms that underlie the observed phenotype. This biological mechanism is controlled by a gene set. This justifies to analyze the differential expression of gene sets by considering them from the very initial step. This is called gene (enrichment) set analysis [4]. For gene set analysis, the choice of the statistical method, the type of null hypothesis, and the gene-association measure are the most important considerations. In addition, the set of genes can be biologically defined using, among others, GO [5] and/or the KEGG Ontology (KO) [6]. GO include three categories: the first one is the biological objective to which the gene or its product contributes and is called Biological process; the second one is the biochemical activity of a gene product, called Molecular Function; the third is called Cellular Component and it refers to the place in the cell where a gene product is active [5]. On the other hand, KO includes all molecular networks in the categories of Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processes, Organisms Systems, Human Diseases, Brite Hierarchies and Not Included in Pathway or Brite [6].

The methods for gene set analysis can be classified, according to the null hypothesis tested, into self-contained or competitive tests [7]. The basic question can be formulated as: Is there any association between a set of genes and the phenotype? This is a very vague question. A more precise formulation is required. Different interpretations of the question are possible. In [8], they formulate the following two null hypotheses that specify the previous question in two different ways. We reproduce the corresponding null hypotheses. The competitive hypothesis is formulated as “The genes in a gene set show the same pattern of associations with the phenotype compared with the rest of the genes”. The self-contained hypothesis is formulated as “The gene set does not contain any genes whose expression levels are associated with the phenotype of interest.” We are concerned in this paper with the testing of the self-contained null hypothesis using RNA-Seq data observed under two conditions (case-control) using a paired design. A huge literature of gene set analysis exists. A large part of it was developed and implemented for microarray data and later adapted to RNA-Seq data. Good reviews can be found in [9,10]. Gene set over-representation analysis (ORA) consists of evaluating if a previously defined gene set, such as GO terms, is more represented than the others in the list of genes previously selected as differentially expressed, and if this over-representation is unlikely to be due to chance. This is used for instance by GOseq [11]. This approach assumes independent gene expressions, i.e., it starts from a marginal analysis of the gene. Nonetheless, it is well known that, when an alteration occurs, it is not an individual gene, but a gene set that is affected. To take this into account, gene set enrichment analysis (GSEA) [12,13] offers a modification over the previous methods. The GSEA method does not start from a previously selected list of genes differentially expressed, but, instead, it uses the gene set as the initial unit; an example is the SeqGSEA method [14]. However, these tools apply first a gene-level test on the original data.

An alternative is performing the analysis considering first the gene relationships and to focus directly on the gene set differential expression. This idea has been exploited by some methods previously proposed on microarray data [15]. Nevertheless, the gene set differential expression for RNA-Seq data are less studied. Some interesting examples are [16,17]. The method proposed in this paper is designed to test the self-contained hypothesis by using statistical analysis of spatial point patterns.

The study of spatial aggregation or clustering has a long history in the spatial statistics literature [18]. Nonetheless, it is, up to our knowledge, the first time in which spatial statistical methods, i.e., the Cuzick and Edwards test, the Diggle, Morris, and Morton–Jones test and the Diggle test, are applied as an effective approach for gene set analysis. The methodology is implemented in the R package OMICfpp2 available at http://www.uv.es/ayala/software/OMICfpp2_1.0.tar.gz (accessed on 25 February 2021) and https://github.com/JdMDE/OMICfpp2 (accessed on 25 February 2021).

2. Methodology

2.1. General Notation

Let us introduce the basic notation needed later. This paper is concerned with paired design and the notation is given accordingly.

Let N be the number of genes and n the number of pairs of samples. The value of N is much larger than n,

N ≫ n

. Let

x_{i j k}

denote the count corresponding to the i-th gene (

i = 1, \dots, N

), j-th pair (

j = 1, \dots, n

) and k the element within the pair (

k = 1, 2

). The total number of counts for the sample

(j, k)

, its library size, will be denoted by

m_{j k} = \sum_{i = 1}^{N} x_{i j k}

. We assume the values

x_{i j k}

for a given i are in the i-th row of the expression matrix in such a way that each column would be associated with a

(j, k)

sample, so we have an

N \times 2 n

matrix. If we consider the random expression matrix instead of the observed one, then their rows would be dependent random vectors (the expressions of the genes are dependent), and the columns would be independent random vectors (corresponding to different individuals).

We are interested in a gene set collection

S_{1}, \dots, S_{T}

, where

S_{t} \subset G

for

t = 1, \dots, T

and

G = {1, \dots, N}

is the universe of genes. These gene sets do not need to be necessarily disjoint. Given the observed expression matrix

x

and a gene set

S_{t}

, we can extract the matrix corresponding to the rows in

S_{t}

, i.e.,

x_{S_{t}}

. Let

ϕ (S_{t})

be the set composed by the columns of

x_{S_{t}}

,

ϕ (S_{t}) \subset R^{| S_{t} |}

. This set

ϕ (S_{t})

can be considered as a point pattern (it will be called sample point pattern) where the corresponding point process could be denoted as

Φ (S_{t})

. A formal presentation of point process theory can be found in [19].

For the

(j, k)

sample, we have a phenotypic covariable,

y_{j k}

(

\in R

), for instance an experimental factor indicating case or control. The previous point pattern,

ϕ (S_{t})

, and this covariable can be put together in a so-called marked point pattern.

This simple idea is used later to connect two different topics: statistical analysis of marked point patterns and gene set analysis.

2.2. Paired RNA-Seq Samples

The most common setup consists of two groups of samples to be compared. Our data are pairs of RNA-Seq counts quantifying the gene expression, i.e., the samples are grouped in pairs corresponding to two conditions observed on the same individual. We will have a binary phenotypic covariable

y_{j k}

where

y_{j k} = 0

(respectively 1) corresponds to a control (respectively case).

A simple procedure was proposed in [20] to test the null hypothesis of no association between condition and expression. This procedure assumes that, under the null hypothesis, the random count

X_{i j 1}

has a binomial distribution

X_{i j 1} \sim B i (x_{i j 1} + x_{i j 2}, \frac{m_{j 1}}{m_{j 1} + m_{j 2}})

, where

x_{i j 1} + x_{i j 2}

is the count for the i-the gene and the j-th pair. This count and the sum of library sizes,

m_{j 1} + m_{j 2}

, are considered given. The p-value will be calculated as the sum of the probabilities lesser or equal than the observed

x_{i j 1}

value and will be denoted as

p_{i j}

. If we have n pairs of samples, then a p-value will be observed per pair and gene so we will have

p_{i} = (p_{i 1}, \dots, p_{i n})

for the i-th gene. Our original

N \times 2 n

gene expression matrix is transformed in a

N \times n

p-value matrix where the

(i, j)

entry will correspond to the p-value of the j-th pair of the i-th gene. Let the observed matrix of p-values be

{\tilde{p}}_{0}

. The corresponding random p-value matrix will be denoted

{\tilde{P}}_{0}

. Note that the columns of the random matrix

{\tilde{P}}_{0}

are independent but not their rows. Under the null hypotheses of no expression-phenotype association for all genes, the random p-value

{\tilde{P}}_{0} (i, j)

follows approximately a uniform distribution.

2.3. Gene Set Point Pattern

We are interested in the study of the differential expression between conditions for a given (previously defined) gene set and

G = {1, \dots, N}

is the universe of genes. The i-th gene will have its expression profile in the i-th row of the expression matrix. Let

S = {i_{1}, \dots, i_{| S |}}

a given gene set with

S \subset G

. Our approach will test the self-contained null hypothesis of no differential expression for the gene set. The most common approach consists of two steps. Firstly, the null hypotheses of no differential expression per each gene are tested. Secondly, the statistics (or p-values) of these (marginal) tests are aggregated ignoring later the dependencies between them. This point has to be incorporated in the analysis and we deal with it by using point processes.

Our important gene set is

S = {i_{1}, \dots, i_{| S |}}

. The vector

v_{j} = {(p_{i_{1}, j}, \dots, p_{i_{| S |}, j})}^{'} \in {[0, 1]}^{| S |}

contains all the observed p-values for the genes in S corresponding to the j-th pair. We can consider all samples jointly in

ϕ_{S} = {v_{1}, \dots, v_{n}}

. It is a finite set of n points contained in the unit hyper-cube

{[0, 1]}^{| S |}

, a point pattern. Each event corresponds to a sample and each dimension of the point corresponds with a gene. As we are working with a paired design and p-values, this point pattern is a natural description of the differential expression of the gene set in both conditions. No independence between genes is assumed. Let

V_{1}, \dots, V_{n}

be a random sample of n random vectors, distributed as

V

, whose corresponding observed values are

v_{1}, \dots, v_{n}

. Analogously,

ϕ_{S} = {v_{1}, \dots, v_{n}}

is the point pattern and

Φ_{S} = {V_{1}, \dots, V_{n}}

is the point process. In our approach, each event of the point pattern

ϕ (S)

corresponds to a pair of samples, more precisely to the p-values of these pair of samples.

2.4. Testing Differential Expression

We are going to test the gene set differential expression by using a bivariate point process. Each realization will have two components with n points each. The first component is the point process corresponding to the p-values obtained using the original sample classification. This first component will be called cases. The second component is generated by applying a randomization to the original sample classification. The second component will be called controls.

This bivariate point process, under the null hypothesis, would be just a random labeling (with n point per component) of the union of both processes.

This random labeling hypothesis will be tested using different statistical procedures taken from the literature of spatial point processes, and they were designed in order to look for some characteristic of the joint distribution. We provide next a short description of the tests.

We consider a given gene set S. The outline of our procedure is as follows:

Using the original pairs, we obtain the first point pattern corresponding to the original pairs or cases, $ϕ_{0}$ .
We choose a randomization distribution, between-pair or complete distribution, and a number of randomizations B to be performed.
Using the chosen distribution in the previous step, we generate new pairs and a new sample point pattern associated with these pairs $ϕ_{1}, \dots, ϕ_{B}$ .
For the i-th bivariate point pattern $(ϕ_{0}, ϕ_{i})$ , it is tested if it can be considered a random labeling of the point pattern $ϕ_{0} \cup ϕ_{i}$ , and the corresponding p-value is obtained.

2.4.1. Generation of Controls

These are the randomization distributions used to generate the random points.

Between-pair (BP) distribution. The first element of each pair is maintained as the original one. The second element of each pair is obtained randomly permuting the second components of all pairs between them. We have

(y_{i, 1}, y_{γ (i), 2})

for

i = 1, \dots, n

, where

γ

is now a permutation of

(1, \dots, n)

. The number of possible permutations is

n!

.

Complete (C) distribution. Let us choose

I = {i_{1}, \dots, i_{n}}

as a random subset of

{1, \dots, 2 n}

. The indices of

{1, \dots, 2 n}

not in

{i_{1}, \dots, i_{n}}

can be denoted

J = {j_{1}, \dots, j_{n}}

. A random correspondence between I and J will produce the pairs. The number of possible values is

\frac{(2 n)!}{n!}

.

2.4.2. Statistical Tests

Cuzick and Edwards test (CET) was proposed in [21]. We have a bivariate point pattern corresponding with cases and controls. The objective is to detect spatial clustering of cases. It is assumed that their spatial distribution is not homogeneous like in our problem. It is based on distances between nearest neighbor (NN) pairs of points. Let

{z_{j}}_{j = 1, \dots, 2 n}

be the locations of the combined sample where the indices have been randomly permuted. We define for

i = 1, . ., n

δ_{i} = \{\begin{matrix} 1 & if & z_{i} is a case \\ 0 & if & z_{i} is a control \end{matrix}

and

d_{i} = \{\begin{matrix} 1 & if the NN to z_{i} is a case \\ 0 & if the NN to z_{i} is a control \end{matrix}

The statistic is

T = \sum_{i = 1}^{n} δ_{i} d_{i}

, i.e., we are counting the number of cases with a case as the nearest neighbor. It is clear that large values of the statistic are associated with clusters of cases corresponding with the alternative hypothesis of a phenotype-expression association.

Diggle, Morris, and the Morton–Jones test (DMMT). Ref. [22] was proposed within the research of a high risk around a specified point. Consider again the variables

{δ_{i}}_{i = 1, \dots, 2 n}

previously defined. Let

γ_{i} = P (δ_{i} = 1)

and

γ_{(i)}

, the corresponding values ordered according to the distance to the origin. Under the null hypothesis of no differential expression, the maximum likelihood estimators of the

γ_{(i)}

are easily obtained by the pool-adjacent violators algorithm:

{\hat{γ}}_{(i)} = {min}_{s \leq i} {max}_{t \geq i} \frac{\sum_{r = s}^{t} δ_{(r)}}{t - s + 1} .

The maximum likelihood ratio test statistic is given by

T_{D} = 2 \sum_{i = 1}^{2 n} {δ_{(i)} log \frac{{\hat{γ}}_{(i)}}{1 / 2} + (1 - δ_{(i)}) log \frac{1 - {\hat{γ}}_{(i)}}{1 / 2}}

.

Diggle test (DT) [23] studies the possible raised incidence of certain types of cancer near nuclear installations. The test is based on fitting a particular class of a non-homogeneous Poisson point process model to data. Let

λ (x)

be the intensity function of

Φ_{1} (S)

under the alternative hypothesis

H_{1}

. We can assume that

λ (x)

has a multiplicative decomposition as

λ (x; γ) = ρ λ_{0} (x) f (x^{t} x; θ)

, where

x^{t}

is the transpose of

x

and

λ_{0} (x)

would represent the spatial variation in intensity under the self-contained null hypothesis. This null intensity function could be estimated using a kernel estimator from the control sample point process

Φ_{2} (S)

of p-values. For a given realization of

Φ_{2} (S)

, i.e.,

ϕ_{2} (S) = {v_{j}}_{j = 1}^{n}

, the kernel estimator using a Gaussian kernel is given by

{\hat{λ}}_{0} (x) = \frac{\sum_{j = 1}^{n} exp {\frac{- 1}{2 h} {(x - v_{j})}^{t} (x - v_{j})}}{2 π h^{2}} .

The function f can be quite general. However, the following function permits an easy computation of the maximum likelihood estimator of the parameters

θ = {(α, β)}^{t}

:

f (x; α, β) = 1 + α exp (- β x^{t} x)

. Using the Gaussian kernel estimate for the function

λ_{0}

and the proposed f, it is easy to obtain the formulas needed to obtain the maximum pseudo likelihood estimator of

θ = {(α, β)}^{t}

. Note that the null hypothesis of no clustering around the origin corresponds with

α = β = 0

. In order to test this null hypothesis, we compare

D = 2 (L (\hat{α}, \hat{β}) - L (0, 0))

with critical values of

χ_{2}^{2}

. Details can be found in [23]. We have implemented it in the n-dimensional case in our R package OMICfpp2.

2.4.3. Testing the Self-Contained Hypothesis

Under this hypothesis, the original point process,

Φ_{0} (S)

, would be a non-homogeneous Poisson point process in the hyper cube

{[0, 1]}^{| S |}

. Note that, under the alternative hypothesis, the point process will tend to produce clustering around the origin. Many other statistical tests could be used and this could be a good line of future research. The three previous tests have been taken from an epidemiological context.

The points correspond to the p-values for the different genes in our important gene set S. If there is no gene set differential expression, then the

2 n

points are independent and identically distributed following a common unknown distribution. We preserve the original label of the point if it corresponds to an original pair or to a randomly generated pair. No differential expression means that the cases are just a random selection of n points from the total

2 n

points, i.e., the bivariate point pattern is just a random labeling of the original point set. We reformulate the testing of no gene set differential expression in that labeling has been tested in a bivariate point process.

This random labeling has been tested using a Monte Carlo test. Let us give a short description. Let

(ϕ_{0}, ϕ_{i})

be a bivariate point pattern where its first component,

ϕ_{0}

, is the original point pattern and its second component,

ϕ_{i}

, is a control. Let

t_{0}

be any of the three previous statistics evaluated for this bivariate point pattern. The set

ϕ_{0} \cup ϕ_{i}

is randomly partitioned into two new sets of n points. The same statistic is evaluated for this new bivariate point pattern. We repeat the selection process B times independently obtaining the statistics

t_{1}, \dots, t_{B}

. Under the null hypothesis, any order of the vector

(t_{0}, t_{1}, \dots, t_{B})

has the same probability. The Monte Carlo p-value [24] is given by

p = \frac{| {b : | t_{b} | > | t_{0} |, b = 1, \dots, B} |}{B + 1}

. In the experimental study,

B = 100

will be used. This Monte Carlo p-value will be used to test the self-contained null hypothesis.

The sample point pattern for cases is unique, but many sample point patterns corresponding to controls can be generated. For each bivariate point pattern generated, a Monte Carlo p-value is obtained. It is clear that the computational time is greater when more than one control sample point pattern is generated. More than one realization produces different p-values that will be aggregated using meta-analysis techniques for p-values as the Fisher’s method. It is interesting to evaluate if just one realization of controls is enough or if more than realization produces better results. This could be evaluated in Section 3.

2.5. Data

A total of four RNA-Seq data sets with paired (tumor/adjacent normal tissue) samples from CRC patients have been analyzed. The first three data sets correspond to the Bioprojects PRJNA411984 [25], PRJNA413956 [26] and PRJNA218851 [27] with 2, 7, and 18 raw data pairs, respectively. The fourth data set include 50 pairs of preprocessed data (count files) from The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/ (accessed on 26 January 2021)).

We are going to evaluate if just one or multiple realizations are needed to generate the second component of the bivariate point process using the TCGA dataset. The gene sets were defined using GO terms and KEGG ontology. The GO gene set collection uses only the biological process category and gene sets with ten or more genes in the set, resulting in a total of 2815. The KEGG gene set collections have been entirely used and there are a total of 340.

The TCGA dataset has been analyzed using the three tests proposed: CET, DMMT, and DT. One realization (OR) of the between-pair (BP) randomization distribution has been used to generate the second component of bivariate point process. The random labeling hypothesis has been tested using 100 simulations.

The four datasets were analyzed using edgeR-GOseq pipeline. The method edgeR can be found in [28,29]. The GOseq R package [11] allows us to analyze gene sets from GO using RNA-Seq data and also KEGG pathways analysis.

The whole code needed to reproduce our paper can be found in the supplementary file SupplementaryMaterialMethods_pointgene.pdf.

3. Results

3.1. One or Multiple Realizations?

Out of all GO gene sets, 8% reported as significant (p-value < 0.05) using OR were reported too using MR with all tests (Figure 1A). For KEGG, 143 unique gene sets have been reported in OR, and 52 unique gene sets have been reported in MR, which corresponds to a decreasing of 64% (Figure 1B).

3.2. Analyzing the Tests

A total of 80, 0, 1 GO terms and 63, 5, 6 KEGG pathways were reported (p-value < 0.05) by CET, DMMT, and DT, respectively. The DMMT and DT were more conservative than CET at reporting differentially expressed gene sets. No common gene sets were reported between the spatial tests.

The first five gene sets with the lowest p-values, top genes, obtained by each test are compared in order to identify the most appropriate approach according to the biological relevance. However, the number of papers dealing with CRC, associated with the gene sets reported as significant by the test, could not be sufficient criteria to evaluate them because this number is closely related with the method of detection used and its age.

Thus, as a complement to these results, we include a list of biological pathways that have been shown to be associated with CRC (see [30,31,32,33,34]) including EGFR, MAPK, Notch, PI3K, P53, Ras, TGF-

β

, Wnt/

β

-catenin, JAK-STAT, VEGF, or NF-kappaB signaling pathway, and, therefore, could be (but not necessarily) a gold standard. In this sense, 14 KEGG biological pathways and 156 GO terms that represent the signaling pathway were selected to evaluate which tests reported these gene sets more frequently in their results. Of the 156 GO terms, many were made up of a single gene, being subsets of the signaling pathways, so a short list of 15 GO gene sets was used, which includes only the general signaling pathways (see tables in Supplementary Material).

The top GO and KEGG gene sets reported by CET were highly related to CRC and other cancer types, as reported by Comparative Toxicogenomics Database and bibliographic databases (Table 1). Additionally, canonical pathways involved in CRC as PI3K-Akt, JAK-STAT, and Ras signaling pathways were reported in the first places by CET. The DMMT did not report gene sets differentially expressed in GO terms. For KEGG results, the gene sets reported were less associated with CRC than the CET as for the number of articles, but also includes canonical pathways such as EGFR tyrosine kinase inhibitor resistance. In general, the KEGG results obtained by the DT method were less associated (but also related) with CRC than those obtained by CET and DMMT. This was in concordance with the p-value reported in the gene sets by each test.

3.3. Choosing the Randomization Distribution

The between-pair randomization distribution is the most natural one for paired designs, but the complete randomization distribution (forgetting the paired design) could be applied too. Thus, the more appropriate randomization distribution to generate the second component of the point process has been evaluated. A total of 9, 27, and 216 GO terms and 63, 5, and 6 KEGG pathways were reported (p-value < 0.05) by CET, DMMT, and DT, respectively, using complete distribution.

No common gene sets were reported between the spatial tests using either complete distribution or between-pair distribution. The results obtained using complete and between-pair distributions for each test were compared: a total of 9, 0, and 1 GO terms and 11, 1, and 1 KEGG pathways were reported (p-value < 0.05) in common using CET, DMMT, and DT, respectively. These data are not shown and can be found in Supplementary Material. Thus, most of the gene sets reported using CET in complete distribution were also included using the between-pair distribution. The number of GO gene sets reported decreased at least eight times for CET and increases by one hundred percent DMMT and DT. Regarding the biological assertiveness of the results, fewer articles associated with CRC were included in the groups reported using complete distribution, although canonical pathways were included in the results in all spatial tests (Table 2).

3.4. Effect of Sample Size

The methodology has been evaluated for different sample sizes: an extreme case of two pairs (PRJNA411984); two intermediate cases of 7 (PRJNA413956) and 18 pairs of samples (PRJNA218851) and the TCGA dataset with 50 pairs.

The number of gene sets reported in all methods increases with the number of samples (Figure 1C).

All spatial tests using between-pair or complete distribution reported results from 50 pairs of samples (TCGA dataset), with the exception of DMMT using BP distribution. The DMMT and DT report less gene set that are significant when using between-pair distribution, while, in CET, the opposite occurs in all sample sizes.

In the PRJNA411984 dataset (2 pairs), only DMMT and DT using complete distribution reported 18 and 7 significant KEGG gene sets, respectively. Genes grouping using KEGG ontology seems to be more appropriate for using spatial tests than GO categories.

Regarding the consistency in the results reported by each test across datasets, the BP CET reported more results in common between the datasets than the other tests, for both GO (Figure 1D) and KEGG (Figure 1E).

3.5. Comparison with the GOseq Method

The four datasets were analyzed using edgeR-GOseq pipeline. A total number of 1000 permutations has been used, and we have restricted the analysis to Biological process category in GO. The KEGG pathways analysis was done using the default values for the arguments in the package GOseq. Note that the package GOseq uses its own GO and KEGG gene set collections.

A total of 1318, 2834, 2605, and 1613 GO terms are differentially regulated in PRJNA411984 (2 pairs), PRJNA413956 (7 pairs), PRJNA218851 (18 pairs) and TCGA (50 pairs) datasets, respectively, and 415 were reported for all datasets (Figure 2A).

If we use the KEGG pathways: 27, 79, 77, and 62 gene sets are reported in PRJNA411984 (2 pairs), PRJNA413956 (7 pairs), PRJNA218851 (18 pairs), and TCGA (50 pairs) datasets, respectively. Of these, 11 gene sets are shared for all datasets (Figure 2B).

When comparing our results with those obtained by GOseq, we observe that, using the CET method with BP distribution, a large number of gene sets in common for GO terms (Figure 2C). In KEGG ontology, we also include the results obtained DMMT and DT with complete distribution in the comparison because these tests were appropriate for small sample datasets. The results indicate that there is high agreement between the implemented methodologies (Figure 2D).

4. Discussion

One realization proved to be enough to generate the second component of the bivariate point process because, in the case of GO groups, only 8% of the results differ when using one or multiple realizations (Figure 1A). For KEGG, by increasing the number of realizations, the number of gene sets declared as significant decreased; even so, most of the gene sets reported in MR were reported by OR. Furthermore, the biological results were consistent using only one realization. At the computational level, the use of one realization reduces the computing time. Regarding the randomization distribution, between-pair and complete were tested, and the results indicate that, when applying between-pair or complete distribution, the gene sets reported as significant changes depending on the spatial test used and also on the criteria to group genes (Table 1 and Table 2).

If Cuzick and Edwards tests (CET) are used, then a greater agreement has been observed because all GO terms and most of the KEGG gene sets reported by complete distribution were also reported by between-pair distribution. Furthermore, the number of gene sets reported as significant decreases when using complete distribution (Figure 1C). It could be expected because we forget the original design, and the complete distribution produces a greater variability of the realizations. The same signal has been evaluated with a distribution with a higher variability. The biological results were consistent, reporting gene sets highly associated with CRC (Table 1 and Table 2). The same results were observed when reducing the sample size to 18 and 7 pairs (Figure 1D,E), showing a high coincidence between the results obtained in each datasets. However, no significant gene sets were reported when using a 2-pair dataset. The power of our test is really small with such a small sample size.

For small samples (as just with two pairs), the DMMT and DT seem appropriate, since the biological results were consistent, particularly if genes are grouped using KEGG ontology (Figure 1C). For instance, significant KEGG pathways as Rap1 signaling pathway (hsa04015), Hepatocellular carcinoma (hsa05225), Thyroid cancer (hsa05216), Bladder cancer (hsa05219), or Acute myeloid leukemia (hsa05221) were reported by DMMT and DT using complete distribution on the PRJNA411984 dataset (see Supplementary Material). Thus, the results obtained through the proposed methodology were consistent at biological level, even though there are only two pairs of samples.

When comparing the results obtained in all datasets using BP-CET for GO terms and including C-DT and C-DMMT for KEGG pathways, with the results obtained by GOseq (Figure 2), we observed that there was a great coincidence between both methods. This indicates that, in biological terms, they are comparable. However, our approach is completely different and has many possible generalizations. This kind of coincidence is not clear for us. It could be a future line of research.

The spatial statistic is a well developed field of research. In this paper, we have tried to show how simple procedures taken from spatial statistics can be used with good results to test null hypotheses of the omics data field. A lot of different possibilities can be explored. No independence between genes needs to be assumed. We think that, except for such a small sample size like two pairs, the results are good for seven pairs. Obviously, they are better for fifty pairs. It seems that the method performs well with small sample sizes.

More complex experimental designs with more than one covariable (categorical or continuous) could be considered and the methodology adapted without great difficulty.

In our opinion, the complexity of the original data makes a valid simulation study difficult. However, a simulation study is included in the Supplementary Material. It shows the good performance of our methods. Additional comments can be found in the file.

5. Conclusions

Our method performs a gene set analysis without a previous marginal (per gene) differential expression analysis by taking into account the dependencies between genes.

The three tests (CET, DMMT and DT) were applied for the first time into the omics data context in order to evaluate the gene set differential expression analysis. The proposed methodology is able to efficiently report the biological processes associated with the pathology studied.

It is important to note that each statistical test shows complementary biological results, i.e., it is convenient to use all of them and to evaluate all results. An important contribution of this paper is to show how these spatial tests can deal with the well known problem of the low sample sizes assuming the interdependence between genes in the context of gene set analysis.

Supplementary Materials

The following are available online at https://www.mdpi.com/2227-7390/9/5/521/s1, the Data, gene set collections, proportion test, global analysis, edgeR-GOseq pipeline and Groups and test names.

Author Contributions

G.A. and F.M. propose the original idea. The theoretical setup was developed by G.A. The biological analysis has been done by A.L.R.-C. The software implementation was realized by G.A. The R package was built by G.A. All authors have read and agreed to the published version of the manuscript.

Funding

This paper has been partially supported by the Spanish grant DPI2017-87333-R (GA) and by the Chilean ANID/FONDECYT-POSTDOCTORADO N0. 3180486 (ALRC).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We are thankful for previous discussions with Amelia Simó (simo@uji.es).

Conflicts of Interest

The authors declare no conflict of interest.

References

Draghici, S. Statistics and Data Analysis for Microarrays Using R and BioConductor, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
Pevsner, J. Bioinformatics and Functional Genomics; Wiley-Blackwell: Hoboken, NJ, USA, 2009. [Google Scholar]
Conesa, A.; Madrigal, P.; Tarazona, S.; Gomez-Cabrero, D.; Cervera, A.; McPherson, A.; Szcześniak, M.W.; Gaffney, D.J.; Elo, L.L.; Zhang, X.; et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016, 17, 1–19. [Google Scholar] [CrossRef] [PubMed]
Maleki, F.; Ovens, K.; Hogan, D.J.; Kusalik, A.J. Gene Set Analysis: Challenges, Opportunities, and Future Research. Front. Genet. 2020, 11, 654. [Google Scholar] [CrossRef]
Consortium, T.G.O. Gene ontologie: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef]
Kanehisa, M.; Sato, Y.; Kawashima, M.; Furumichi, M.; Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016, 44, D457–D462. [Google Scholar] [CrossRef] [PubMed]
de Leeuw, C.; Neale, B.; Heskes, T.; Posthuma, D. The statistical properties of gene-set analysis. Nat. Rev. Genet. 2016, 17, 353–364. [Google Scholar] [CrossRef]
Tian, L.; Greenberg, S.A.; Kong, S.W.; Altschuler, J.; Kohane, I.S.; Park, P.J. Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. USA 2005, 102, 13544–13549. [Google Scholar] [CrossRef]
Ackermann, M.; Strimmer, K. A general modular framework for gene set enrichment analysis. BMC Bioinform. 2009, 10, 47. [Google Scholar] [CrossRef]
Rahmatallah, Y.; Emmert-Streib, F.; Glazko, G. Gene set analysis approaches for RNA-seq data: Performance evaluation and application guideline. Briefings Bioinform. 2016, 17, 393–407. [Google Scholar] [CrossRef]
Young, M.D.; Wakefield, M.J.; Smyth, G.K.; Oshlack, A. Gene ontology analysis for RNA-seq: Accounting for selection bias. Genome Biol. 2010, 11, R14. [Google Scholar] [CrossRef]
Mootha, V.K.; Lindgren, C.; Eriksson, K.F.; Subramanian, A.; Sihag, S.; Lehar, J.; Puigserver, P.; Carlsson, E.; Ridderstråle, M.; Laurila, E.; et al. PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes PGC-1 α -responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 2003, 34, 267–273. [Google Scholar] [CrossRef]
Subramanian, A.; Tamayo, P.; Mootha, V.K.; Mukherjee, S.; Ebert, B.L.; Gillette, M.A.; Paulovich, A.; Pomeroy, S.L.; Golub, T.R.; Lander, E.S.; et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 2005, 102, 15545–15550. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Cairns, M.J. SeqGSEA: A Bioconductor package for gene set enrichment analysis of RNA-Seq data integrating differential expression and splicing. Bioinformatics 2014, 30, 1777–1779. [Google Scholar] [CrossRef] [PubMed]
Goeman, J.J.; van de Geer, S.A.; de Kort, F.; van Houwelingen, H.C. A global test for groups of genes: Testing association with a clinical outcome. Bioinformatics 2004, 20, 93–99. [Google Scholar] [CrossRef]
Chen, Y.; Lun, A.T.L.; Smyth, G.K. From reads to genes to pathways: Differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 2016, 5, 1438. [Google Scholar]
Law, C.W.; Alhamdoosh, M.; Su, S.; Dong, X.; Tian, L.; Smyth, G.K.; Ritchie, M.E. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Research 2016, 5. [Google Scholar] [CrossRef]
Diggle, P.J. Statistical Analysis of Spatial and Spatio-Temporal Point Patterns, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
Chiu, S.N.; Stoyan, D.; Kendall, W.S.; Mecke, J. Stochastic Geometry and Its Applications, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Kal, A.J.; van Zonneveld, A.J.; Benes, V.; van den Berg, M.; Koerkamp, M.G.; Albermann, K.; Strack, N.; Ruijter, J.M.; Richter, A.; Dujon, B.; et al. Dynamics of Gene Expression Revealed by Comparison of Serial Analysis of Gene Expression Transcript Profiles from Yeast Grown on Two Different Carbon Sources. Mol. Biol. Cell 1999, 10, 1859–1872. [Google Scholar] [CrossRef]
Cuzick, J.; Edwards, R. Spatial Clustering for Inhomogeneus Populations. J. R. Stat. Soc. 1990, B52, 73–104. [Google Scholar]
Diggle, P.; Morris, S.; Morton-Jones, T. Case-control isotonic regression for investigation of elevation in risk around a point source. Stat. Med. 1999, 18, 1605–1613. [Google Scholar] [CrossRef]
Diggle, P.J. A point process modelling approach to raised incidence of a rare phenomenon in the vicinity of a prespecified point. J. R. Stat. Soc. Ser. A (Stat. Soc.) 1990, 153, 349–362. [Google Scholar] [CrossRef]
Barnard, G. Contribution to the discussion of Professor Bartlett’s paper. J. R. Stat. Soc. B 1963, 25, 294. [Google Scholar]
Yamada, A.; Yu, P.; Lin, W.; Okugawa, Y.; Boland, C.R.; Goel, A. A RNA-Sequencing approach for the identification of novel long non-coding RNA biomarkers in colorectal cancer. Sci. Rep. 2018, 8, 2–11. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Zhao, L.; Li, S.; Li, J.; Gao, B.; Wang, F.; Wang, S.; Hu, X.; Cao, J.; Wang, G. Differentially expressed lncRNAs and mRNAs identified by NGS analysis in colorectal cancer patients. Cancer Med. 2018, 7, 4650–4664. [Google Scholar] [CrossRef]
Kim, S.K.; Kim, S.Y.; Kim, J.H.; Roh, S.A.; Cho, D.H.; Kim, Y.S.; Kim, J.C. A nineteen gene-based risk score classifier predicts prognosis of colorectal cancer patients. Mol. Oncol. 2014, 8, 1653–1666. [Google Scholar] [CrossRef]
Robinson, M.D.; Smyth, G.K. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 2008, 9, 321–332. [Google Scholar] [CrossRef] [PubMed]
Robinson, M.D.; Smyth, G.K. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 2007, 23, 2881–2887. [Google Scholar] [CrossRef] [PubMed]
Intracrine VEGF signalling mediates colorectal cancer cell migration and invasion. Br. J. Cancer 2017, 117, 848–855. [CrossRef]
Farooqi, A.A.; de la Roche, M.; Djamgoz, M.B.; Siddik, Z.H. Overview of the oncogenic signaling pathways in colorectal cancer: Mechanistic insights. Semin. Cancer Biol. 2019, 58, 65–79. [Google Scholar] [CrossRef]
Koveitypour, Z.; Panahi, F.; Vakilian, M.; Peymani, M.; Seyed Forootan, F.; Nasr Esfahani, M.H.; Ghaedi, K. Signaling pathways involved in colorectal cancer progression. Cell Biosci. 2019, 9, 1–14. [Google Scholar] [CrossRef]
Soly, W.; Zhanjie, L.; Lunshan, W.; Xiaoren, Z. NF-κB signaling pathway, inflammation and colorectal cancer. Chin. J. Cell. Mol. Immunol. 2009, 6, 327–334. [Google Scholar] [CrossRef]
Sanchez-Vega, F.; Mina, M.; Marra, M.A. Pathways, Oncogenic Signaling Cancer, The Atlas, Genome. Cell 2019, 173, 321–337. [Google Scholar] [CrossRef]

Figure 1. Overlapping between significant gene sets across the methods. (A) Venn diagram of significant GO gene sets obtained using TCGA dataset and OR or MR with all tests. (B) Venn diagram of significant KEGG gene sets obtained using a TCGA dataset and OR or MR with all tests. (C) Number of significant GO and KEGG gene sets (using one realization) for the three tests (CET, DMMT, DT) and the between-pair and complete randomization distributions. (D) Venn diagram of significant GO gene sets for all datasets using BP and CET. (E) Venn diagram of significant KEGG gene sets for all datasets using BP CET.

Figure 2. GO terms and KEGG pathways reported in common between datasets using GOseq and spatial tests. (A) GO terms reported by GOseq; (B) KEGG pathways reported by GOseq; (C) GO terms reported in common by GOseq and between-pair CET; (D) KEGG pathways reported in common by GOseq and between-pair CET, Complete DT, and Complete DMMT.

Table 1. Five gene sets with lowest p-values reported by each test using the between-pair distribution: CET, DMMT, and DT tests. The column headed “n” refers to the number of genes in the gene set. The number of papers “n rep” reporting the gene set association with Colorectal or Colonic Neoplasms have been obtained from the Comparative Toxicogenomics Database (GO terms) and MEDLINE bibliographic database (KEGG ID AND “colorectal cancer”). The asterisk * indicates that the gene set has been described as related to other cancer types.

ID Gene Set	Name	n	Test	p-Value	n rep
GO:0035195	Gene silencing by miRNA	577	CET	<0.00001	4
GO:0007186	G protein-coupled receptor signaling pathway	868	CET	<0.00001	18
GO:0045944	Positive regulation of transcription by RNA polymerase II	975	CET	<0.00001	111
GO:0006357	Regulation of transcription by RNA polymerase II	917	CET	<0.00001	76
GO:0050911	Detection of chemical stimulus involved in sensory	427	CET	<0.00001	0 *
	perception of smell
GO:0045190	Isotype switching	17	DT	0.0450	7
hsa05200	Pathways in cancer	530	CET	<0.001	170
hsa04014	Ras signaling pathway	232	CET	<0.001	12
hsa04020	Calcium signaling pathway	193	CET	<0.001	46
hsa04151	PI3K-Akt signaling pathway	354	CET	<0.001	59
hsa04630	JAK-STAT signaling pathway	162	CET	<0.001	35
hsa05340	Primary immunodeficiency	38	DMMT	0.01	12
hsa01212	Fatty acid metabolism	57	DMMT	0.02	7
hsa00071	Fatty acid degradation	44	DMMT	0.03	13
hsa05167	Kaposi sarcoma-associated herpesvirus infection	186	DMMT	0.03	8
hsa01521	EGFR tyrosine kinase inhibitor resistance	79	DMMT	0.04	7
hsa04512	ECM-receptor interaction	88	DT	<0.001	91
hsa04071	Sphingolipid signaling pathway	119	DT	<0.001	5
hsa03410	Base excision repair	33	DT	0.02	9
hsa05033	Nicotine addiction	40	DT	0.03	8
hsa04659	Th17 cell differentiation	107	DT	0.04	7

Table 2. List of the top five gene sets reported by each statistic test using the complete distribution: CET, DMMT, and DT. The “n” column refers to genes in the set. The number of papers (n rep) reporting the gene set association with Colorectal Neoplasms obtained from the Comparative Toxicogenomics Database (GO terms) and MEDLINE bibliographic database (KEGG ID AND “colorectal cancer”). The * indicates that the gene set was related to other cancer types.

ID Gene Set	Name	n	Test	p-Value	n rep
GO:0050911	Detection of chemical stimulus involved	427	CET	0.0001	0*
	in sensory perception of smell
GO:0035195	Gene silencing by miRNA	577	CET	0.0005	4
GO:0006396	RNA processing	544	CET	0.0036	4
GO:0018149	Peptide cross-linking	26	CET	0.0159	0 *
GO:0070268	Cornification	112	CET	0.0294	0*
GO:0071320	Cellular response to cAMP	52	DMMT	0.0309	6
GO:1990403	Embryonic brain development	13	DMMT	0.0323	14
GO:0071392	Cellular response to estradiol stimulus	34	DMMT	0.0338	13
GO:0051965	Positive regulation of synapse assembly	61	DMMT	0.0364	3
GO:0031145	Anaphase-promoting complex-dependent catabolic process	81	DMMT	0.0368	2
GO:0071363	Cellular response to growth factor stimulus	45	DT	0.0019	14
GO:0050808	Synapse organization	38	DT	0.0019	15
GO:0045190	Isotype switching	17	DT	0.0029	7
GO:0015721	Bile acid and bile salt transport	19	DT	0.0030	2
GO:0002931	Response to ischemia	50	DT	0.0033	33
hsa04630	JAK-STAT signaling pathway	162	CET	<0.001	35
hsa04740	Olfactory transduction	448	CET	<0.001	10
hsa05206	MicroRNAs in cancer	310	CET	<0.001	31
hsa05218	Melanoma	72	CET	<0.001	39
hsa05224	Breast cancer	147	CET	<0.001	5
hsa04215	Apoptosis multiple species	32	DMMT	<0.001	5
hsa04660	T cell receptor signaling pathway	104	DMMT	<0.001	37
hsa04150	mTOR signaling pathway	153	DMMT	<0.001	32
hsa04934	Cushing syndrome	155	DMMT	<0.001	0
hsa04928	Parathyroid hormone synthesis, secretion and action	106	DMMT	<0.001	0
hsa01521	EGFR tyrosine kinase inhibitor resistance	79	DT	<0.001	7
hsa05120	Epithelial cell signaling in Helicobacter pylori infection	70	DT	<0.001	15
hsa03030	DNA replication	36	DT	<0.001	25
hsa04724	Glutamatergic synapse	114	DT	<0.001	14
hsa04012	ErbB signaling pathway	85	DT	<0.001	59

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Riffo-Campos, A.L.; Ayala, G.; Montes, F. Gene Set Analysis Using Spatial Statistics. Mathematics 2021, 9, 521. https://doi.org/10.3390/math9050521

AMA Style

Riffo-Campos AL, Ayala G, Montes F. Gene Set Analysis Using Spatial Statistics. Mathematics. 2021; 9(5):521. https://doi.org/10.3390/math9050521

Chicago/Turabian Style

Riffo-Campos, Angela L., Guillermo Ayala, and Francisco Montes. 2021. "Gene Set Analysis Using Spatial Statistics" Mathematics 9, no. 5: 521. https://doi.org/10.3390/math9050521

APA Style

Riffo-Campos, A. L., Ayala, G., & Montes, F. (2021). Gene Set Analysis Using Spatial Statistics. Mathematics, 9(5), 521. https://doi.org/10.3390/math9050521

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gene Set Analysis Using Spatial Statistics

Abstract

1. Introduction

2. Methodology

2.1. General Notation

2.2. Paired RNA-Seq Samples

2.3. Gene Set Point Pattern

2.4. Testing Differential Expression

2.4.1. Generation of Controls

2.4.2. Statistical Tests

2.4.3. Testing the Self-Contained Hypothesis

2.5. Data

3. Results

3.1. One or Multiple Realizations?

3.2. Analyzing the Tests

3.3. Choosing the Randomization Distribution

3.4. Effect of Sample Size

3.5. Comparison with the GOseq Method

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI