A Statistical Methodology for Evaluating Asymmetry after Normalization with Application to Genomic Data

Leiva, Víctor; Corzo, Jimmy; Vergara, Myrian E.; Ospina, Raydonal; Castro, Cecilia

doi:10.3390/stats7030059

Open AccessArticle

A Statistical Methodology for Evaluating Asymmetry after Normalization with Application to Genomic Data

by

Víctor Leiva

^1,*

,

Jimmy Corzo

²

,

Myrian E. Vergara

³

,

Raydonal Ospina

^4,5

and

Cecilia Castro

⁶

¹

Escuela de Ingeniería Industrial, Pontificia Universidad Católica de Valparaíso, Valparaíso 2362807, Chile

²

Departamento de Estadística, Facultad de Ciencias, Universidad Nacional de Colombia, Bogotá 111321, Colombia

³

Escuela de Ciencias Básicas y Aplicadas, Universidad de La Salle, Bogotá 110231, Colombia

⁴

Departamento de Estatística, LInCa, Universidade Federal da Bahia, Salvador 40170-110, Brazil

⁵

Departamento de Estatística, CASTLab, Universidade Federal da Pernambuco, Recife 50670-901, Brazil

⁶

Centre of Mathematics, Universidade do Minho, 4710-057 Braga, Portugal

^*

Author to whom correspondence should be addressed.

Stats 2024, 7(3), 967-983; https://doi.org/10.3390/stats7030059

Submission received: 2 August 2024 / Revised: 6 September 2024 / Accepted: 7 September 2024 / Published: 9 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

This study evaluates the symmetry of data distributions after normalization, focusing on various statistical tests, including a few explored test named Rp. We apply normalization techniques, such as variance stabilizing transformations, to ribonucleic acid sequencing data with varying sample sizes to assess their effectiveness in achieving symmetric data distributions. Our findings reveal that while normalization generally induces symmetry, some samples retain asymmetric distributions, challenging the conventional assumption of post-normalization symmetry. The Rp test, in particular, shows superior performance when there are variations in sample size and data distribution, making it a preferred tool for assessing symmetry when applied to genomic data. This finding underscores the importance of validating symmetry assumptions during data normalization, especially in genomic data, as overlooked asymmetries can lead to potential inaccuracies in downstream analyses. We analyze postmortem lateral temporal lobe samples to explore normal aging and Alzheimer’s disease, highlighting the critical role of symmetry testing in the accurate interpretation of genomic data.

Keywords:

differential gene expression; genomic data normalization; RNA sequencing; Rp test; statistical tests; symmetry assessment; variance stabilization

1. Introduction

The analysis of genomic data, particularly through ribonucleic acid sequencing (RNA-seq), has advanced our understanding of the genetic foundations of various diseases and conditions [1,2]. RNA-seq is renowned for its ability to quantify gene expression with high accuracy [3,4], enabling comprehensive transcriptional profiling across a wide range of biological contexts. The role of RNA-seq is critical in differential gene expression analysis, proving invaluable across diverse conditions, from non-small-cell lung cancer [5] to avian influenza in mallards [6]. However, the reliability of RNA-seq data is influenced by numerous technical and biological factors that necessitate careful preprocessing [7]. For this preprocessing, normalization plays a pivotal role in ensuring that gene expression comparisons are accurate and free from systematic biases [8]. Normalization techniques are designed to adjust for technical variability while preserving the true biological signals in the data. Recent advancements in RNA-seq normalization, such as the RUV-III method, have enhanced the accuracy of downstream analyses by effectively mitigating unwanted variations [7]. Moreover, exploring the distributional characteristics of genomic data has become increasingly important, especially as it relates to extending normality assumptions in statistical models [9]. This exploration underscores the evolving landscape of genomic data analysis and the critical need for methods with good statistical properties.

A common assumption in the normalization of genomic data is that they are symmetrically distributed around the median. This assumption underpins many statistical tests and models used in genomics and proteomics [10,11]. However, recent studies have begun to critically assess this assumption, particularly in the context of RNA-seq data, where asymmetric distributions may introduce biases that affect the accuracy of differential gene expression analyses [12,13,14,15]. Asymmetric distributions can lead to erroneous conclusions in differential expression studies by inflating or deflating the observed gene expression differences [16]. Despite the recognized importance of data symmetry, there has been limited exploration of post-normalization symmetry in RNA-seq data, highlighting a gap in the current literature [17,18,19].

To bridge this gap, the primary objective of our research is to evaluate the impact of different normalization techniques on the symmetry of data distributions, with an application to RNA-seq data derived from postmortem lateral temporal lobe samples related to Alzheimer’s disease and aging [20]. By comparing various normalization methods, we aim to determine their effectiveness in producing symmetric data distributions, thereby enhancing the precision and reliability of gene expression analyses.

To achieve this objective, we employ the Rp test: a statistical tool which has been few explored and designed to evaluate symmetry in data distributions. While the Rp test itself is not a novel contribution, having been introduced in [21] and based on earlier work in [22], its application within the context of RNA-seq data analysis along with an evaluation of standardization techniques and simulation studies is a novel methodology in the field. Our methodology integrates insights from single-cell analysis and spatial transcriptomics [23,24,25], incorporating established methods such as DESeq2 and edgeR [26,27] along with innovative techniques like DiffChIPL [28,29]. We use cross-validation to rigorously assess the performance of these normalization techniques, focusing on achieving symmetric RNA-seq data distributions. We validate the effectiveness of these methods using the lawstat package of the R software [30,31,32], ensuring the reliability of the data [33]. Additionally, we conduct Monte Carlo simulations to evaluate the sensitivity and specificity of the Rp test and other tests in detecting asymmetry across various scenarios.

In summary, this study contributes to the field of genomic data processing by evaluating the impact of normalization techniques. Through the application of these techniques to RNA-seq data in the context of Alzheimer’s disease research, we aim to improve the precision of gene expression analyses and critically assess the assumption of data symmetry post-normalization.

The remainder of this article is structured as follows. Section 2 outlines our methodology for assessing symmetry in data distributions. In Section 3, we present the results of our simulation studies evaluating the robustness of the Rp test. In Section 4, the application of symmetry tests to real genomic data is discussed. Section 5 provides a comprehensive discussion and conclusions, highlighting the implications of our findings for future research.

2. Methodology

This section outlines our methodology for assessing symmetry in data distributions.

2.1. Statistical Methods for Symmetry Evaluation

To ensure rigorous analysis, our study employs various statistical methods to evaluate symmetry in data distributions. Among these methods, the Bonferroni correction is applied during symmetry evaluation to adjust for multiple testing. With an initial significance level of 0.05 (5%) and four tests conducted, the adjusted significance threshold for each individual test is 0.0125. This adjustment stabilizes the family-wise type I error rate at the nominal level, thereby controlling the probability of committing at least one type I error across all tests [34,35]. We utilize cross-validation to rigorously evaluate the performance of different normalization techniques, partitioning the data into training and test sets to ensure the generalizability of our findings.

The common hypotheses for testing symmetry about the median of the distribution are stated as

H_{0} : F (x) = 1 - F (- x), \forall x \in R,

against the alternative given by

H_{1} : F (x) \neq 1 - F (- x), for some x \in R .

Here,

F (x)

represents the cumulative distribution function of the data. The test statistics are detailed as follows:

•: Cabilio–Masaro (CM) test: This test employs the sample mean ( $\bar{X}$ ), median ( $\hat{θ}$ ), and standard deviation (S) [34]. The CM test statistic is computed as

$CM = \frac{\sqrt{n} (\bar{X} - \hat{θ})}{S},$

where n is the sample size. Under the null hypothesis $H_{0}$ of symmetry, the CM statistic follows a standard normal distribution. The CM test is particularly effective for large sample sizes, where the sample mean and standard deviation are estimators with good statistical properties.
•: Mira (M) test for symmetry: The M test evaluates symmetry by comparing the sample mean ( $\bar{X}$ ) with the median ( $\hat{θ}$ ) [35]. The M test statistic is defined as

$M = 2 (\bar{X} - \hat{θ}) .$

This statistic amplifies any deviation from symmetry. Under the null hypothesis of symmetry, the M statistic follows a standard normal distribution in large samples. For small samples, bootstrapping is employed to estimate accurate p-values, enhancing the test applicability across various sample sizes.
•: Miao–Gel–Gastwirth (MGG) test: This test, known for its robustness against outliers, uses a unique approach for its denominator to mitigate the impact of extreme values [36]. The MGG test statistic is defined as

$MGG = \frac{\sqrt{N} (\bar{X} - \hat{θ})}{J},$

where the modified denominator J contrasts with traditional methods and is calculated as

$J = \sqrt{\frac{π}{8}} \sum_{i = 1}^{N} | X_{i} - \hat{θ} | .$

The distinctive J feature reduces the influence of outliers, making the MGG test especially suitable for datasets with extreme variations.

Complementing these analytical methods, we present below the Rp test introduced in [21,22]. The Rp test is specifically designed to address the complex distribution patterns often encountered in RNA sequencing data.

2.2. Rp Test in RNA-Sequencing

In our methodology for RNA-seq data analysis, the Rp test is particularly effective for datasets with asymmetric distributions, where “p” denotes the expected proportion of runs under symmetry. This test identifies and quantifies asymmetry, being especially valuable when the probability

P (X > 0)

deviates significantly from 0.5. Here, X represents the normalized expression values for each gene, where normalization is crucial to ensure that the data accurately reflect biological differences without technical biases.

The core of the Rp test is the trimmed test statistic

R_{k}

, which measures the symmetry of the data. The statistic of the Rp test is calculated as

R_{k} = \frac{1}{R_{n}} \sum_{j = n - k + 1}^{n} δ_{j} (R_{j} - ⌊ p R_{n} ⌋),

where p is the expected proportion of runs under symmetry,

R_{n}

is the number of runs,

δ_{j}

is the sign function, and

⌊ a ⌋

is the floor function of a or its integer part. The

R_{k}

statistic of the Rp test accounts for both the frequencies and sequences of positive and negative observations, providing a refined measure of symmetry. The statistical significance of

R_{k}

is evaluated by calculating its p-value. By comparing the p-value with a predefined significance level (

α

), researchers can decide whether to reject the null hypothesis of symmetry or not. The steps of the Rp test are summarized in Algorithm 1 and illustrated in the flowchart in Figure 1, which outlines the entire process from the initial input of normalized gene expression values to the final output of the test statistic and p-value.

Algorithm 1 Rp test for assessing symmetry in RNA-seq data.
	Input: Normalized gene expression values $x_{1}, \dots, x_{n}$
	Output: Statistic $R_{k}$ of the Rp test and its p-value
	1. Order ascending the absolute values $\| x_{1} \|, \dots, \| x_{n} \|$ to obtain a sequence $\| x_{(1)} \|, \dots, \| x_{(n)} \|$
	2. Compute the anti-rank $D_{j}$ for each $\| x_{(j)} \|$ , where $D_{j}$ represents the index in the original dataset
	3. Construct a binary sequence $S_{j}$ using the sign of $X_{D_{j}}$ , with 1 indicating non-negative values and 0 indicating negative values
	4. Initialize the run indicator $I_{1} = 1$ as
	for each subsequent j do
	if the sign of $S_{j}$ changes then
	Update $I_{j}$ to mark the start of new runs
	end if
	end for
	5. Calculate the partial number of runs $R_{j}$ for each observation, counting the runs up to observation j
	6. Define the trimmed statistic $R_{k}$ of the Rp test
	7. Evaluate the statistical significance of $R_{k}$ by calculating its p-value
	8. Compare the p-value with a predefined significance level ( $α$ ) to decide whether to reject the null hypothesis of symmetry or not

The Rp test evaluates two primary scenarios within the framework of the alternative hypothesis

H_{1}

. First, when

P (X > 0) ≫ 0.5

, suggesting an excess of positive observations in the normalized gene expression data,

R_{k}

is expected to yield positive values, leading to rejection of the null hypothesis

H_{0}

in favor of positive asymmetry. Conversely, if

P (X > 0) ≪ 0.5

, indicating a predominance of negative observations, a very low value of

R_{k}

would suggest rejecting

H_{0}

in favor of negative asymmetry. The practical implications of the Rp test in RNA-seq data analysis are highly relevant. By analyzing the distribution of normalized gene expression data, the Rp test aids in uncovering underlying patterns and potential asymmetries, which are critical in interpreting RNA-seq studies, where data distributions can impact the interpretation of gene expression levels.

2.3. Integration of the Rp Test in the Broader Study Context

Following the detailed explanation of the statistical tests under consideration, we now contextualize their application within the broader research framework. Figure 2 illustrates the research process from data acquisition and normalization to the application of statistical tests such as CM, M, MGG, and Rp. This ensures methodological precision and rigor throughout the research process, leading to reliable outcomes and facilitating a comprehensive analysis that supports the conclusions of the study.

3. Simulation Studies for Evaluating the Robustness of the Rp Test

To rigorously assess the robustness of the Rp test in detecting asymmetry in RNA-seq data distributions, we conduct an extensive set of Monte Carlo simulations. The aim of these simulations is twofold: first, to assess the effectiveness of the Rp test in detecting asymmetry for RNA-seq data distributions; and second, to compare its performance against other established symmetry tests, including the CM, M, and MGG tests. These simulations are designed to evaluate the tests performance across a range of symmetric and asymmetric distributions, focusing particularly on the control of the type I error.

3.1. Simulation Setup

The simulations involve generating samples from both symmetric and asymmetric distributions: specifically, normal, lognormal, and generalized lambda (GL) distributions. The normal and lognormal distributions used in our simulations are parameterized by choosing a zero mean and a standard deviation equal to one. This choice is made to standardize the comparison across different distributional forms, ensuring that the scales and locations of the distributions do not introduce confounding effects when evaluating the performance of the Rp test. The normal distribution is used as a benchmark for symmetric distributions, while the lognormal distribution, characterized by its positive skewness, serves as a representative asymmetric distribution. These distributions are commonly encountered in many practical applications, making them relevant for our analysis.

The GL family is particularly helpful for assessing the effectiveness of the Rp test across various symmetric and asymmetric scenarios and shows its flexibility in modeling different shapes of distributions encompassing a wide range of skewness and kurtosis values. A GL distribution is defined by its quantile function, which is given by

Q (p; λ_{1}, λ_{2}, λ_{3}, λ_{4}) = λ_{1} + \frac{1}{λ_{2}} (p^{λ_{3}} - {(1 - p)}^{λ_{4}}),

where

λ_{1}

is a location parameter,

λ_{2}

is a scale parameter, and

λ_{3}, λ_{4}

control the skewness and kurtosis of the distribution, respectively. By varying these parameters, the GL distribution can mimic a wide range of shapes, including those that are symmetric, heavy-tailed, or asymmetric. This diversity of shapes is illustrated in Figure 3, which presents the density plots for GL1 through GL12 distributions, each generated using the parameters specified in Table 1. These plots show the differences in shape, skewness, and kurtosis across the GL family, highlighting the flexibility and diversity of this family of distributions.

The sample sizes examined are

n \in {20, 30, 50, 100}

, reflecting typical sample sizes encountered in RNA-seq studies. These sample sizes are chosen to evaluate the test performance across small to moderately large datasets, which are common in genomic research.

3.2. Test Implementation and Metrics

The Rp test was implemented for three different values of the parameter p, specifically

p \in {0.9, 0.8, 0.7}

, which controls the threshold for detecting asymmetry in the data distribution. These values allow for a comprehensive analysis of the test’s sensitivity to different levels of asymmetry in the data distribution. In addition to the Rp test, three other symmetry tests were included for comparison: the CM, M, and MGG tests. The p-values for the Rp test were calculated using a bootstrap method. This method involves generating a large number of resampled datasets under the null hypothesis of symmetry, calculating the Rp test statistic for each resampled dataset, and then comparing the observed test statistic to the distribution of bootstrapped statistics. Bootstrapping provides a more accurate assessment of the p-value by capturing the sampling variability, particularly in cases where the asymptotic normality assumption may not hold due to small sample sizes or other factors. The simulation involved generating 10,000 samples for each combination of sample size and distribution. For each sample, the Rp test and the three other tests were applied, with the proportion of rejections of the null hypothesis of symmetry (type I error or empirical power) being recorded. The significance level was set at

α = 0.05

. The ability of each test to control the type I error (under the null hypothesis of symmetry) and empirical power (1 −

β

, under asymmetric alternatives, with

β

being the probability of type 2 error) was evaluated.

3.3. Simulation Results

The simulation results are summarized in Table 2. The Rp test showed strong control of the type I error across all sample sizes and distributions, with rejection rates close to the nominal level of 0.05 when applied to symmetry. The focus on

α

for symmetric distributions ensures that the test is controlling the type I error, avoiding false positives where the data are actually symmetric. The empirical power of the Rp test varied depending on the level of asymmetry in the distribution and the value of p, with higher values of p generally leading to lower power. For asymmetric distributions, measuring

1 - β

is crucial, as it indicates the test’s ability to correctly identify true asymmetries.

The Monte Carlo simulations indicate that the Rp test, particularly, with

p = 0.9

, effectively controls the type I error across all sample sizes and distributions, with rejection rates close to the nominal level of 0.05 when applied to symmetric distributions such as the normal and GL1-GL4 distributions. For a sample size of

n = 20

, the Rp test at

p = 0.9

showed no type I error probability practically (

α

close to zero) across any of the symmetric distributions, and this value was maintained even with larger sample sizes, such as

n = 100

.

The empirical power (1 −

β

) of the Rp test varied depending on the asymmetry level and value of p. In highly asymmetric distributions like GL6, the Rp test with

p = 0.9

exhibited lower power compared to tests with

p = 0.7

or

p = 0.8

. Nevertheless, even with

p = 0.9

, the Rp test achieved high power with larger sample sizes. For instance, with a sample size of

n = 100

, the empirical power for the GL6 distribution was 0.6107 for

p = 0.9

, 0.5890 for

p = 0.8

, and 0.5214 for

p = 0.7

. In contrast, the other symmetry tests (CM, M, and MGG) also controlled the type I error adequately but generally exhibited lower power to detect asymmetry compared to the Rp test, especially for the GL6 distribution.

The results in Table 2 emphasize the effectiveness of the Rp test in controlling both type I error and empirical power. Although the test with

p = 0.9

is slightly more conservative, leading to lower power, it remains effective in detecting asymmetries, particularly when using

p = 0.7

or

p = 0.8

. These results suggest that the Rp test is a versatile and reliable tool for symmetry testing in RNA-seq data, consistently performing well across a wide range of distributions.

In summary, the Monte Carlo simulations confirm that the Rp test is well-suited for practical applications where detecting subtle asymmetries in genomic data is crucial. This makes the Rp test a valuable tool in the downstream analysis of RNA-seq data, where understanding the distributional properties of gene expression is essential for drawing meaningful biological inferences.

4. Application to Real Genomic Data

This section details the data sources, preprocessing procedures, and characteristics of the dataset used in our application with real genomic data. Additionally, it discusses the usage of symmetry tests to evaluate RNA-seq datasets, with an emphasis on evaluating their significance, sensitivity, and robustness.

4.1. Data Source and Preprocessing Overview

The RNA-seq data utilized in this research were obtained from [37] and encompassed postmortem lateral temporal lobe samples from 15 Alzheimer’s disease patients and 15 age-matched controls. This dataset is particularly valuable due to its public accessibility and the detailed records it provides on genetic expression variations associated with Alzheimer’s disease, allowing for a comprehensive comparison between affected and healthy tissues to uncover underlying gene expression patterns.

Preprocessing step of the data began with normalization to mitigate technical variances while preserving the inherent biological variations and following established RNA-seq practices [7,8]. After this step, our dataset was refined to 18,347 genes across 30 samples by filtering out low-expression genes, specifically, those with counts below 10 in at least 20 samples. Normalization was performed using the variance stabilizing transformation (VST) method as implemented in [26].

The VST method ensures that the variance remains approximately constant across different expression levels, thereby improving comparability across samples and reducing the impact of technical artifacts. Additionally, we used the trimmed mean of M-values (TMM) method in edgeR to correct for compositional biases and normalize the counts, making them comparable across all samples. The TMM method involves calculating the ratio of each gene expression level relative to a reference sample and computing a weighted mean of the remaining ratios to derive a scaling factor. Furthermore, we applied the DiffChIPL method [28,29] to integrate multi-omic data, which further enhanced our analysis.

To validate the symmetry of the RNA-seq distributions post-normalization, we utilized the lawstat package of R [30,31]. This package provides essential tools for assessing the distributional properties of data, which is crucial for ensuring the reliability of the analysis and minimizing biases in downstream analyses.

After normalization, we assessed mean–variance independence using a dispersion plot, which displayed the variance of gene expression against the mean expression level for each gene. As shown in Figure 4, the dispersion or scatter plot revealed that the majority of the data points clustered within a range, which suggests homogeneity of variance across different levels of mean expression. The lack of a clear trend in this plot indicates that the data distributions do not show obvious signs of heteroscedasticity, which is an important consideration when evaluating the suitability of data for further statistical analysis [1].

To assess the symmetry in gene expression levels, we performed kernel density estimates on the samples. These estimates, illustrated in Figure 5, provide a smoothed visualization of the distribution of normalized expression values, aiding in the identification of any deviations from symmetry. The x-axis represents the normalized expression levels, while the y-axis indicates their relative frequency.

The results from both the dispersion plot and the plot of kernel density estimates suggest that the normalization process effectively stabilized the variance and maintained the overall symmetry of the data distributions, supporting the reliability of subsequent genomic analyses.

4.2. Evaluation of Symmetry of the Data Distribution

As described in Section 4.1, RNA-seq datasets were obtained from postmortem lateral temporal lobe samples of Alzheimer’s disease patients and age-matched controls. These datasets underwent normalization using the VST provided by the DESeq2 method.

Additionally, normalization methods from edgeR were employed to reduce technical variances while preserving biological variations [26,27,38]. This normalization was crucial for ensuring that the resulting data accurately reflected underlying biological differences and minimized the influence of technical artifacts.

In the subsequent analysis, we applied several symmetry tests to each dataset, including the CM, M, MGG, and Rp tests, to assess the symmetry of gene expression distributions within individual samples. Identifying notable asymmetries is critical, as such patterns could indicate disproportionate gene expression associated with Alzheimer’s disease or age-related changes. The results of these symmetry tests are summarized in Table 3. The p-values reported in this table were derived directly from real RNA-seq data by evaluating the symmetry after normalization. These reported values provide insight into the true distributional characteristics of the gene expression data within the context of Alzheimer’s disease.

In contrast, Section 3 detailed the effectiveness assessment of the Rp test through extensive Monte Carlo simulations. These simulations were designed to evaluate the performance of the Rp test across various symmetric and asymmetric distributions, focusing on its control of type I error and empirical power under different scenarios. While these simulations do not reflect the specific characteristics of the real RNA-seq data, they serve to validate the general applicability of the Rp test across different scenarios.

Our analysis of the 30 datasets indicated that only four datasets (specifically datasets #11, #15, #17, and #28) exhibited high asymmetry according to the Rp test. This indication suggests that asymmetry is not a predominant characteristic in the majority of the datasets analyzed. The decision to use a significance threshold of 0.10, although less conventional than the standard of 0.05, was intentional and tailored to the context of the present study to detect subtler patterns that might otherwise go unnoticed. In contrast, the CM, M, and MGG tests uniformly indicated symmetry across all datasets, with p-values consistently equal to 1. This raises concerns regarding the sensitivity of these tests and their potential susceptibility to type II error, where the null hypothesis of symmetry is incorrectly accepted. The consistent p-value of 1, particularly after applying the Bonferroni correction, suggests that these tests may be overly stringent or insensitive under the specific conditions of our RNA-seq data. This suggestion might result from the correction increasing the threshold for significance, so reducing the probability of detecting true asymmetries, especially in datasets with subtle deviations. The divergent results between the Rp test and the other tests illustrate the importance of employing several statistical methods when assessing symmetry in RNA-seq data. Relying solely on traditional tests may cause one to fail when detecting deviations from symmetry and potentially overlook important biological signals. The detection of asymmetry by the Rp test in specific datasets shows that factors such as variability in gene expression, technical noise, or inherent biological diversity could influence deviations from expected symmetry. This detection emphasizes the need for a thorough understanding of the specific characteristics of each dataset and the application of reliable statistical tools in the analysis of large-scale sequencing data.

In summary, our results show the importance of careful statistical evaluation in the model assumptions and in interpreting information obtained from RNA-seq data. The results suggest that asymmetries may persist even after data processing and normalization, presenting challenges to traditional analytical methods. Our results reinforce the need for comprehensive and multifaceted statistical testing in genomic data.

4.3. Robustness Assessment through Subsampling

To further evaluate the statistical properties of the symmetry tests, a detailed subsampling strategy was implemented. This strategy involved randomly selecting subsets from each RNA-seq dataset, with each subset comprising 100 data points. The selection of this subset size was carefully considered to strike a balance between computational efficiency and the need for a representative sample of the larger dataset. The primary rationale behind the subsampling strategy was to test the stability and consistency of the hypothesis tests under conditions of reduced data volume, which can resemble the conditions faced in smaller experimental studies.

In this study, as mentioned, a total of 10,000 subsamples were repeatedly processed without replacement from each dataset, simulating different scenarios within the same dataset, similar to a cross-validation. This process ensured a broad and stochastic representation of the original data distributions, thereby enhancing the analysis. Such a sampling method allows for a thorough exploration of the variance within the sample space, providing crucial insights into the reliability of statistical outcomes in the face of data sampling variability. Each subsample was rigorously evaluated with the CM, M, MGG, and Rp tests, and the results were meticulously recorded. The key metric of interest was the rate of rejection of the symmetry hypothesis, as this rate indicated the degree to which each test was able to detect asymmetry under the subsampling conditions. The recorded rejection rates across all subsamples are presented in Table 4. Notably, these rates are not uniform, suggesting that the symmetry of the distribution within the datasets is non-random. The observed variability in rejection rates highlights the impact that different dataset characteristics, such as sample size and inherent variability, can have on the outcomes of the mentioned tests. This variability underscores the importance of understanding the specific context of each RNA-seq dataset when interpreting the results of symmetry tests, as it can provide critical insights into the applicability of the statistical methods used.

To visualize the heterogeneity of the rejection rates, a bar graph is shown in Figure 6 to compare the rejection percentages for each test applied to the RNA-seq datasets. This figure shows the complex and varying nature of symmetry testing in RNA-seq datasets. A more granular analysis is provided subsequently and offers a deeper understanding of the practical consequences that this nature may have for experimental studies.

4.4. Analysis of Symmetry Rejection across RNA-Seq Datasets

A detailed examination of the RNA-seq datasets reveals that all of the symmetry tests applied challenge the symmetry hypothesis, albeit to varying degrees. The Rp test, in particular, consistently shows a higher tendency to reject this hypothesis. Notably, in dataset #21, the Rp test indicates that 40% of the subsamples deviate from symmetry. Similarly, datasets #1, #3, #9, and #30 also exhibit high rejection rates, each at 39%. In contrast, the CM, M, and MGG tests indicate that the symmetry of dataset #9 may warrant further scrutiny. This indication implies that some data distributions might deviate from symmetry post-normalization. While initial analyses of the full datasets may have suggested symmetry, closer inspection through subsampling can uncover potential asymmetries. This underscores the critical role of sample size in statistical evaluations, particularly for determining the symmetry of distributions after normalization.

The heightened sensitivity of the Rp test for detecting asymmetries across subsamples highlights the need for careful interpretation of this detection. The consistent rejection of symmetry in specific datasets, especially in dataset #21, raises questions about possible biological factors contributing to the observed asymmetry. It also suggests the potential necessity for more advanced normalization techniques. The detailed analysis of RNA-seq data in this study exposes important nuances in interpreting asymmetries and their implications.

As emphasized in [39,40], the choice of the normalization method can importantly affect the data distribution and may potentially introduce asymmetries. Even minor asymmetries may signal critical biological processes or anomalies, as demonstrated in [41], leading to important changes in cellular functions. Additionally, the quality and processing of RNA-seq data, as discussed in [42], can introduce variations that affect symmetry and, consequently, the interpretation of gene expression results. Therefore, as shown in [43,44], it is crucial to consider the effect size and statistical power of the tests when evaluating asymmetry and its implications for gene expression. This shows the complexity and necessity for meticulous approaches in the normalization and analysis of RNA-seq data.

5. Discussion and Conclusions

This study introduced and evaluated the Rp test as a robust tool for assessing symmetry in genomic data, particularly in RNA-seq datasets. Through extensive Monte Carlo simulations and its application to real genomic data, we demonstrated the effectiveness of the Rp test in detecting asymmetry across a variety of distributional scenarios, including those encountered after normalization processes. Our findings highlight the importance of carefully selecting statistical methods that are sensitive to the unique characteristics of genomic data.

The conducted simulations revealed that the Rp test consistently controlled type I error while maintaining high empirical power in detecting asymmetry, even for the small sample sizes typically found in RNA-seq studies. This control across different sample sizes and distributions underscores the suitability of the Rp test in contexts where data may exhibit complex non-normal characteristics. Simulations based on generalized lambda distributions further emphasized the test’s ability to handle a wide range of asymmetry levels, making it a versatile tool in genomic research.

In our application to RNA-seq data from studies of Alzheimer’s disease, the Rp test identified high asymmetries in a subset of data, suggesting that asymmetry is a feature that may persist even after normalization. Choice of the normalization method, as shown in our analysis, can affect the symmetry of gene expression distributions, potentially impacting downstream analyses like differential expression and pathway analysis [26,27]. The detection of asymmetry in specific datasets highlights the need for further exploration of the biological implications of this detection; particularly how the asymmetry might relate to disease mechanisms or technical artifacts introduced during data processing. Methods such as the rank-based inverse normal transform, as implemented in the RNOmni package, could be applied to address asymmetry before further analysis [45].

A subsampling strategy, which involved the generation of 10,000 subsamples of 100 data points from each RNA-seq dataset, further validated the effectiveness of the Rp test. Unlike other tests, which showed variability in detecting asymmetry across different subsamples, the Rp test consistently identified deviations from symmetry, even with reduced data volumes. This identification is crucial for studies involving small samples or subsets of larger samples, as it ensures that key asymmetries are not overlooked.

Our findings have implications for genomic research, extending beyond Alzheimer’s disease to fields such as cancer genomics and environmental genomics, where data heterogeneity and sample size variability are common. The demonstrated sensitivity of the Rp test to asymmetry across different contexts reinforces the need for careful test selection based on the specific characteristics of the data being analyzed. This is particularly relevant in studies where accurate detection of gene expression patterns can influence critical decisions, such as in the development of targeted therapies [43,44].

The methodology presented in this study could be extended to the context of emerging technologies such as single-cell RNA sequencing, which allows for the analysis of gene expression profiles at the level of individual cells. Unlike bulk RNA-seq, which measures gene expression across populations of cells, single-cell RNA sequencing captures the heterogeneity within the context of cell populations, providing a more detailed understanding of cellular processes [46,47]. The application of the Rp test in this context could offer valuable insights into the symmetry or asymmetry of gene expression distributions at the single-cell level, revealing critical biological mechanisms that are obscured in bulk analyses.

While our study lays a solid foundation for utilizing the Rp test in symmetry assessment, certain limitations remain. Although the simulation studies have enhanced our understanding and have offered a more controlled evaluation of the Rp test under various asymmetry conditions, our focus on specific RNA-seq datasets and subsamples that were limited to 100 data points may constrain the broader applicability of our findings. Future research should extend these investigations to a wider range of datasets that vary in size, conditions, and transformation techniques to further validate and refine our conclusions. Additionally, expanding the scope of simulations to encompass more diverse scenarios could deepen insights into how asymmetry impacts downstream analyses. These additional simulations could include differential expression studies [39,40]. For the asymmetries identified in genomic data distributions, methods of quantile regression [48] might be explored. Researchers might also use other types of asymmetric distributions for the studied tests [49]. Moreover, exploration using machine learning techniques [50] in the analysis of genomic data is a promising avenue for research.

In conclusion, our research highlights the critical role of effective statistical tools like the Rp test in gene expression data analysis. Ensuring that these tools are well-aligned with the characteristics of genomic data is crucial for deriving reliable insights in genomic research. Our work contributes to a more nuanced understanding of data normalization and statistical analysis in genomics and serves as a valuable resource for researchers across diverse fields, including agriculture, environmental science, and medical genomics.

Author Contributions

Conceptualization, V.L., J.C., M.E.V., R.O. and C.C.; data curation, J.C., M.E.V., R.O. and C.C.; formal analysis, V.L., J.C., M.E.V., R.O. and C.C.; investigation, V.L., J.C., M.E.V., R.O. and C.C.; methodology, V.L., J.C., M.E.V., R.O. and C.C.; writing—original draft, J.C., M.E.V. and R.O.; writing—review and editing, V.L. and C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the Vice-rectorate for Research, Creation, and Innovation (VINCI) of the Pontificia Universidad Católica de Valparaíso (PUCV), Chile, under grants VINCI 039.470/2024 (regular research), VINCI 039.493/2024 (interdisciplinary associative research), VINCI 039.309/2024 (PUCV centenary), and FONDECYT 1200525 (V.L.) from the National Agency for Research and Development (ANID) of the Chilean government under the Ministry of Science, Technology, Knowledge, and Innovation. This work was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), No. 303192/2022-4 and Fundação de Amparo a Ciência e Tecnologia do Estado da Bahia (FAPESB), No. APP0021/20223 (R.O.). This work also was part of HERMES 51031 (J.C.). The research was in addition funded by Portuguese funds through the CMAT—Research Centre of Mathematics of University of Minho, Portugal, within projects UIDB/00013/2020 (https://doi.org/10.54499/UIDB/00013/2020) and UIDP/00013/2020 (https://doi.org/10.54499/UIDP/00013/2020) (C.C.).

Data Availability Statement

The data and codes used in this study are available by request from the authors.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their valuable comments and suggestions, which helped us to improve the quality of this article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

References

García-Sancho, M.; Lowe, J. A History of Genomics across Species, Communities and Projects; Springer: New York, NY, USA, 2023. [Google Scholar]
Deng, D.; Chowdhury, M.H. Quantile regression approach for analyzing similarity of gene expressions under multiple biological conditions. Stats 2022, 5, 583–605. [Google Scholar] [CrossRef]
Zhang, S. A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance. BMC Bioinform. 2007, 8, 230. [Google Scholar] [CrossRef]
Huang, J.; Yang, J.; Gu, Z.; Zhu, W.; Wu, S. A constrained generalized functional linear model for multi-loci genetic mapping. Stats 2021, 4, 550–577. [Google Scholar] [CrossRef]
Hiremath, N.B.; Dayananda, P. Differential gene expression analysis of non-small cell lung cancer samples to classify candidate genes. Eng. Technol. Appl. Sci. Res. 2023, 13, 10571–10577. [Google Scholar] [CrossRef]
Dolinski, A.C.; Homola, J.J.; Jankowski, M.D.; Robinson, J.D.; Owen, J.C. Differential gene expression reveals host factors for viral shedding variation in mallards (Anas platyrhynchos) infected with low-pathogenic avian influenza virus. J. Gen. Virol. 2022, 103, 001724. [Google Scholar] [CrossRef] [PubMed]
Fletcher, M. Improved RNA-seq normalization. Nat. Genet. 2022, 5411, 1584. [Google Scholar] [CrossRef] [PubMed]
Corchete, L.A.; Rojas, E.A.; Alonso-López, D.; De Las Rivas, J.; Gutiérrez, N.C.; Burguillo, F.J. Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Sci. Rep. 2020, 10, 19737. [Google Scholar] [CrossRef] [PubMed]
Concha-Aracena, M.S.; Barrios-Blanco, L.; Elal-Olivero, D.; da Silva, P.H.F.; Nascimento, D.C.D. Extending normality: A case of unit distribution generated from the moments of the standard normal distribution. Axioms 2022, 11, 666. [Google Scholar] [CrossRef]
Dubois, E.; Galindo, A.N.; Dayon, L.; Cominetti, O. Assessing normalization methods in mass spectrometry-based proteome profiling of clinical samples. Biosystems 2022, 215, 104661. [Google Scholar] [CrossRef]
Ghandi, M.; Beer, M.A. Group normalization for genomic data. PLoS ONE 2012, 7, e38695. [Google Scholar] [CrossRef]
Konishi, S. Normalizing and variance stabilizing transformations for intraclass correlations. Ann. Inst. Stat. Math. 1985, 37, 87–94. [Google Scholar] [CrossRef]
Cortés-Ciriano, I.; Gulhan, D.C.; Lee, J.J.K.; Melloni, G.E.; Park, P.J. Computational analysis of cancer genome sequencing data. Nat. Rev. Genet. 2022, 23, 298–314. [Google Scholar] [CrossRef]
Leiva, V.; Sanhueza, A.; Kelmansky, S.; Martinez, E. On the glog-normal distribution and its association with the gene expression problem. Comput. Stat. Data Anal. 2009, 53, 1613–1621. [Google Scholar] [CrossRef]
Abrams, Z.B.; Johnson, T.S.; Huang, K.; Payne, P.R.; Coombes, K. A protocol to evaluate RNA sequencing normalization methods. BMC Bioinform. 2019, 20, 679. [Google Scholar] [CrossRef] [PubMed]
Vilca, F.; Rodrigues-Motta, M.; Leiva, V. On a variance stabilizing model and its application to genomic data. J. Appl. Stat. 2013, 40, 2354–2371. [Google Scholar] [CrossRef]
Tai, K.Y.; Dhaliwal, J.; Balasubramaniam, V. Leveraging Mann–Whitney U test on large-scale genetic variation data for analysing malaria genetic markers. Malar. J. 2022, 21, 79. [Google Scholar] [CrossRef] [PubMed]
Hafemeister, C.; Satija, R. Normalization and variance stabilization of single-cell RNA-sequencing data using regularized negative binomial regression. Genome Biol. 2019, 20, 296. [Google Scholar] [CrossRef] [PubMed]
Kelmansky, D.; Martinez, E.; Leiva, V. A new variance stabilizing transformation for gene expression data analysis. Stat. Appl. Genet. Mol. Biol. 2013, 12, 653–666. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Yu, X.; Sheng, C.; Jiang, X.; Zhang, Q.; Han, Y.; Jiang, J. A review of brain imaging biomarker genomics in Alzheimer’s disease: Implementation and perspectives. Transl. Neurodegener. 2022, 11, 42. [Google Scholar] [CrossRef]
Corzo-Salamanca, J.A.; Vergara-Morales, M.E.; Babativa-Márquez, J.G. A runs test for the hypothesis of symmetry with one sided alternative. Univ. Sci. 2019, 24, 295–305. [Google Scholar] [CrossRef]
Corzo, J.; Babativa, G. A modified runs test for symmetry. J. Stat. Comput. Simul. 2013, 83, 984–991. [Google Scholar] [CrossRef]
Luecken, M.D.; Theis, F.J. Current best practices in single-cell RNA-seq analysis: A tutorial. Mol. Syst. Biol. 2019, 15, e8746. [Google Scholar] [CrossRef] [PubMed]
Heumos, L.; Schaar, A.C.; Lance, C.; Litinetskaya, A.; Drost, F.; Zappia, L.; Lücken, M.D.; Strobl, D.C.; Henao, J.; Curion, F. Best practices for single-cell analysis across modalities. Nat. Rev. Genet. 2023, 24, 550–572. [Google Scholar] [CrossRef] [PubMed]
Fan, Y.; Andrusivová, Ž.; Wu, Y.; Chai, C.; Larsson, L.; He, M.; Luo, L.; Lundeberg, J.; Wang, B. Expansion spatial transcriptomics. Nat. Methods 2023, 20, 1179–1182. [Google Scholar] [CrossRef]
Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef]
Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, 139–140. [Google Scholar] [CrossRef]
Chen, Y.; Chen, S.; Lei, E.P. DiffChIPL: A differential peak analysis method for high-throughput sequencing data with biological replicates based on Limma. Bioinformatics 2022, 38, 4062–4069. [Google Scholar] [CrossRef]
McManus, C. Cerebral polymorphisms for lateralisation: Modelling the genetic and phenotypic architectures of multiple functional modules. Symmetry 2022, 14, 814. [Google Scholar] [CrossRef]
Hui, W.; Gel, Y.R.; Gastwirth, J.L. lawstat: An R package for law, public policy and biostatistics. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef]
Gastwirth, J.L.; Gel, Y.R.; Hui, W.W.; Lyubchich, V.; Miao, W.; Noguchi, K.; Lyubchich, M.V. Package ‘Lawstat’; R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023. [Google Scholar]
Nayak, D.S.K.; Das, J.; Swarnkar, T. Quality control pipeline for next generation sequencing data analysis. In Proceedings of Intelligent and Cloud Computing; Springer: Singapore, 2021; pp. 215–225. [Google Scholar]
Cabilio, P.; Masaro, J. A simple test of symmetry about an unknown median. Can. J. Stat. 1996, 24, 349–361. [Google Scholar] [CrossRef]
Mira, A. Distribution-free test for symmetry based on Bonferroni’s measure. J. Appl. Stat. 1999, 26, 959–972. [Google Scholar] [CrossRef]
Miao, W.; Gel, Y.; Gastwirth, J. A new test of symmetry about an unknown median. In Random Walk, Sequential Analysis and Related Topics—A Festschrift in Honor of Yuan-Shih Chow; World Scientific: Singapore, 2006; pp. 1–19. [Google Scholar]
Nativio, R.; Lan, Y.; Donahue, G.; Sidoli, S.; Berson, A.; Srinivasan, A.R.; Shcherbakova, O.; Amlie-Wolf, A.; Nie, J.; Cui, X.; et al. An integrated multi-omics approach identifies epigenetic alterations associated with Alzheimer disease. Nat. Genet. 2020, 52, 1024–1035. [Google Scholar] [CrossRef]
McCaw, Z.R.; Lane, J.M.; Saxena, R.; Redline, S.; Lin, X. Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies. Biometrics 2020, 76, 1262–1272. [Google Scholar] [CrossRef]
Modarres, R.; Gastwirth, J.L. Hybrid test for the hypothesis of symmetry. J. Appl. Stat. 1998, 25, 777–783. [Google Scholar] [CrossRef]
Dillies, M.A.; Rau, A.; Aubert, J.; Hennequet-Antier, C.; Jeanmougin, M.; Servant, N.; Keime, C.; Marot, G.; Castel, D.; Estelle, J.; et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 2013, 14, 671–683. [Google Scholar] [CrossRef] [PubMed]
The Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 2013, 368, 2059–2074. [Google Scholar] [CrossRef] [PubMed]
SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 2014, 32, 903–914. [Google Scholar] [CrossRef]
Conesa, A.; Madrigal, P.; Tarazona, S.; Gomez-Cabrero, D.; Cervera, A.; McPherson, A.; Szcześniak, M.W.; Gaffney, D.J.; Elo, L.L.; Zhang, X.; et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016, 17, 13. [Google Scholar] [CrossRef]
Yu, L.; Fernandez, S.; Brock, G. Power analysis for RNA-seq differential expression studies. BMC Bioinform. 2017, 18, 234. [Google Scholar] [CrossRef]
McCaw, Z. RNOmni: Rank Normal Transformation Omnibus Test. Version 1.0.1.2. 2023. Available online: https://CRAN.R-project.org/package=RNOmni (accessed on 25 August 2024).
Tang, F.; Barbacioru, C.; Wang, Y.; Nordman, E.; Lee, C.; Xu, N.; Wang, X.; Bodeau, J.; Tuch, B.B.; Siddiqui, A.; et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 2009, 6, 377–382. [Google Scholar] [CrossRef] [PubMed]
Andrews, T.S.; Kiselev, V.Y.; McCarthy, D.; Hemberg, M. Tutorial: Guidelines for the computational analysis of single-cell RNA sequencing data. Nat. Protoc. 2021, 16, 1–9. [Google Scholar] [CrossRef] [PubMed]
Sanchez, L.; Leiva, V.; Galea, M.; Saulo, H. Birnbaum-Saunders quantile regression and its diagnostics with application to economic data. Appl. Stoch. Model. Bus. Ind. 2021, 37, 53–73. [Google Scholar] [CrossRef]
Marchant, C.; Leiva, V.; Cavieres, M.F.; Sanhueza, A. Air contaminant statistical distributions with application to PM10 in Santiago, Chile. Rev. Environ. Contam. Toxicol. 2013, 223, 1–31. [Google Scholar] [PubMed]
Palacios, C.A.; Reyes-Suarez, J.A.; Bearzotti, L.A.; Leiva, V.; Marchant, C. Knowledge discovery for higher education student retention based on data mining: Machine learning algorithms and case study in Chile. Entropy 2021, 23, 485. [Google Scholar] [CrossRef]

Figure 1. Flowchart for the Rp test process.

Figure 2. Flowchart depicting the RNA-seq analysis process within the broader research framework.

Figure 3. Density plots for the GL1 through GL12 distributions with the parameters used in the simulations.

Figure 4. Scatter plot of variance versus mean expression levels.

Figure 5. Kernel density estimate of gene expression levels post-filtering.

Figure 6. Bar plots of the percentage of symmetry hypothesis rejection rates for the indicated test and RNA-seq dataset.

Table 1. The parameters of the indicated GL distributions used in the simulations.

Distribution	$λ_{1}$	$λ_{2}$	$λ_{3}$	$λ_{4}$	Type
GL1	0	0.197454	0.134915	0.134915	Symmetric
GL2	0	−1	−0.08	−0.08	Symmetric
GL3	0	−0.397912	−0.16	−0.16	Symmetric
GL4	0	−1	−0.24	−0.24	Symmetric
GL5	−0.116734	−0.351663	−0.13	−0.16	Asymmetric
GL6	0	−1	−0.1	−0.18	Asymmetric
GL7	3.586508	0.04306	0.025213	0.094029	Asymmetric
GL8	0	−1	−0.0075	−0.03	Asymmetric
GL9	0	1	1.4	0.25	Asymmetric
GL10	0	1	0.00007	0.1	Asymmetric
GL11	0	−1	−0.001	−0.13	Asymmetric
GL12	0	−1	−0.0001	−0.17	Asymmetric

Table 2. Rejection rates for type I error (

α

) and empirical power (1 −

β

) for the indicated distributions and sample sizes (n).

Table 2. Rejection rates for type I error (

α

) and empirical power (1 −

β

) for the indicated distributions and sample sizes (n).

n	Empirical	Distribution	$R_{0.9}$	$R_{0.8}$	$R_{0.7}$	MGG	CM	M
20	$α$	normal	0	0.0632	0.0507	0.0338	0.0265	0.0336
		GL1	0	0.0636	0.0512	0.0322	0.0273	0.0355
		GL2	0	0.0473	0.0357	0.0417	0.0243	0.0337
		GL3	0	0.0485	0.0389	0.0574	0.0239	0.0409
		GL4	0	0.0425	0.0334	0.0743	0.0258	0.0392
	$1 - β$	GL5	0	0.0724	0.0571	0.0614	0.0266	0.0478
		GL6	0	0.1646	0.1282	0.1291	0.0564	0.0979
		GL7	0	0.2759	0.2195	0.1197	0.0827	0.1315
		GL8	0	0.3538	0.2804	0.2100	0.1284	0.1879
		GL9	0	0.5512	0.4405	0.1654	0.1617	0.1691
		GL10	0	0.7404	0.6089	0.4286	0.3212	0.3849
		GL11	0	0.8366	0.7177	0.6353	0.4479	0.4859
		GL12	0	0.8567	0.7418	0.6771	0.4714	0.5026
		lognormal	0	0.8525	0.7395	0.7403	0.4864	0.4982
30	$α$	normal	0.1375	0.0535	0.5500	0.0392	0.0353	0.3800
		GL1	0.1339	0.0522	0.0573	0.0346	0.0302	0.0357
		GL2	0.1133	0.0403	0.0414	0.0495	0.0296	0.3600
		GL3	0.1097	0.0375	0.0378	0.0635	0.0325	0.0397
		GL4	0.1133	0.3800	0.0354	0.8200	0.0311	0.0418
	$1 - β$	GL5	0.1917	0.0705	0.7300	0.0747	0.0403	0.0496
		GL6	0.4198	0.2093	0.1942	0.1869	0.0999	0.1341
		GL7	0.6306	0.3952	0.3671	0.2009	0.1619	0.2070
		GL8	0.7589	0.5124	0.4737	0.3428	0.246	0.3074
		GL9	0.847	0.682	0.6069	0.232	0.2431	0.2211
		GL10	0.9773	0.8874	0.8158	0.6024	0.5274	0.5654
		GL11	0.9929	0.9471	0.8956	0.8165	0.6912	0.7087
		GL12	0.9956	0.956	0.9157	0.8473	0.7308	0.7311
		lognormal	0.9953	0.958	0.9182	0.8951	0.7543	0.7396
50	$α$	normal	0.0758	0.0651	0.0586	0.0418	0.0402	0.0408
		GL1	0.0699	0.0591	0.0509	0.0411	0.0398	0.0379
		GL2	0.5600	0.0447	0.0359	0.0508	0.0314	0.0389
		GL3	0.0547	0.4200	0.0369	0.0699	0.0327	0.0397
		GL4	0.0567	0.0399	0.0325	0.0908	0.0361	0.0418
	$1 - β$	GL5	0.1263	0.1039	0.0858	0.0927	0.0486	0.0618
		GL6	0.3778	0.3444	0.2819	0.2959	0.1857	0.2245
		GL7	0.6718	0.6146	0.5172	0.3286	0.2883	0.3395
		GL8	0.8080	0.7644	0.6690	0.5349	0.4495	0.5132
		GL9	0.9320	0.8344	0.7139	0.3512	0.3740	0.3190
		GL10	0.9983	0.9813	0.9213	0.824	0.7859	0.8159
		GL11	0.9997	0.9940	0.9655	0.9564	0.9206	0.9323
		GL12	0.9999	0.9965	0.9695	0.9707	0.9400	0.9438
		lognormal	0.9998	0.9968	0.9784	0.9840	0.9560	0.9479
100	$α$	normal	0.0669	0.0634	0.0586	0.0483	0.0487	0.0423
		GL1	0.0659	0.0566	0.0529	0.0438	0.0437	0.0374
		GL2	0.0487	0.0429	0.0398	0.0548	0.0377	0.0423
		GL3	0.0485	0.0375	0.0324	0.0715	0.0409	0.0436
		GL4	0.0481	0.0377	0.0322	0.0961	0.0406	0.0452
	$1 - β$	GL5	0.1591	0.1404	0.1201	0.1359	0.8500	0.0933
		GL6	0.6107	0.5890	0.5214	0.5250	0.4111	0.4451
		GL7	0.9141	0.8776	0.8014	0.6032	0.5711	0.6200
		GL8	0.9774	0.9621	0.9145	0.8348	0.7908	0.8286
		GL9	0.9892	0.9517	0.8759	0.5529	0.581	0.5179
		GL10	1	0.9994	0.9929	0.9786	0.9753	0.9807
		GL11	1	1	0.999	0.9988	0.9977	0.9986
		GL12	1	1	0.9994	0.9996	0.9994	0.9993
		lognormal	1	1	0.9996	0.9999	0.9999	0.9996

Table 3. Test statistics and p-values for RNA-seq datasets.

Dataset	Rp Test	p-Value	MGG Test	p-Value	CM Test	p-Value	M Test	p-Value
1	−2.6950	0.996	−16.8080	1	−17.0262	1	−17.2877	1
2	−0.1227	0.549	−18.1936	1	−18.5582	1	−18.4214	1
3	0.7926	0.214	−18.3478	1	−18.5783	1	−18.7437	1
4	−0.5547	0.710	−16.8183	1	−17.0739	1	−17.2347	1
5	0.3419	0.366	−16.9436	1	−17.2710	1	−17.0738	1
6	−3.2899	0.999	−17.0684	1	−17.3118	1	−16.8988	1
7	0.0887	0.465	−15.1808	1	−15.3896	1	−15.4853	1
8	−3.0902	0.999	−17.9734	1	−18.1465	1	−18.5181	1
9	−1.7155	0.957	−14.6008	1	−14.8674	1	−15.0311	1
10	−1.8128	0.965	−14.3613	1	−14.5475	1	−14.5453	1
11	1.3358	0.091	−17.8693	1	−18.1581	1	−18.3331	1
12	1.1736	0.120	−17.6595	1	−17.9535	1	−17.8061	1
13	0.3982	0.345	−18.2651	1	−18.5777	1	−18.5486	1
14	−2.9219	0.998	−12.2041	1	−12.3863	1	−12.2216	1
15	2.5531	0.005	−15.8770	1	−16.1026	1	−15.9890	1
16	0.9928	0.160	−19.3036	1	−19.6125	1	−20.0872	1
17	3.0181	0.001	−13.8194	1	−14.0199	1	−13.9879	1
18	−0.4531	0.675	−16.7694	1	−17.1088	1	−17.2301	1
19	−1.8638	0.969	−17.6307	1	−17.9916	1	−17.7343	1
20	0.4921	0.311	−15.6244	1	−15.8887	1	−15.6862	1
21	−2.1837	0.986	−17.8140	1	−18.2165	1	−17.6224	1
22	−2.5579	0.995	−22.1083	1	−22.4859	1	−23.0092	1
23	−5.3044	1	−21.4939	1	−21.9260	1	−22.4050	1
24	−4.9969	1	−22.0210	1	−22.4440	1	−22.4021	1
25	−4.2857	1	−17.9044	1	−18.2910	1	−18.0124	1
26	1.2033	0.114	−22.9884	1	−23.4014	1	−23.0682	1
27	−3.8619	1	−15.8874	1	−16.1360	1	−16.1700	1
28	2.9055	0.002	−18.3866	1	−18.5923	1	−18.8854	1
29	−1.5599	0.941	−19.6007	1	−19.8957	1	−19.7064	1
30	−0.6445	0.740	−16.5039	1	−16.7500	1	−16.9535	1

Table 4. Percentage of symmetry hypothesis rejections across indicated RNA-seq subsample and test.

	% of Rejection of H₀ with
RNA-Seq Subsample	Rp Test	MGG Test	M Test	CM Test
1	0.39	0.09	0.11	0.05
2	0.38	0.09	0.11	0.05
3	0.39	0.09	0.11	0.05
4	0.38	0.09	0.11	0.05
5	0.38	0.09	0.11	0.05
6	0.38	0.10	0.12	0.05
7	0.38	0.09	0.11	0.05
8	0.37	0.09	0.12	0.05
9	0.39	0.10	0.12	0.05
10	0.38	0.09	0.11	0.05
11	0.39	0.09	0.11	0.05
12	0.38	0.09	0.11	0.05
13	0.39	0.09	0.12	0.05
14	0.38	0.09	0.11	0.05
15	0.39	0.09	0.11	0.05
16	0.38	0.08	0.11	0.04
17	0.38	0.09	0.11	0.05
18	0.38	0.09	0.11	0.04
19	0.39	0.09	0.11	0.05
20	0.38	0.09	0.12	0.05
21	0.40	0.09	0.11	0.05
22	0.38	0.09	0.11	0.05
23	0.38	0.09	0.11	0.05
24	0.38	0.09	0.11	0,05
25	0.38	0.09	0.11	0,05
26	0.38	0.09	0.11	0.05
27	0.39	0.09	0.11	0.05
28	0.38	0.09	0.12	0.05
29	0.38	0.09	0.12	0.05
30	0.39	0.09	0.11	0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Leiva, V.; Corzo, J.; Vergara, M.E.; Ospina, R.; Castro, C. A Statistical Methodology for Evaluating Asymmetry after Normalization with Application to Genomic Data. Stats 2024, 7, 967-983. https://doi.org/10.3390/stats7030059

AMA Style

Leiva V, Corzo J, Vergara ME, Ospina R, Castro C. A Statistical Methodology for Evaluating Asymmetry after Normalization with Application to Genomic Data. Stats. 2024; 7(3):967-983. https://doi.org/10.3390/stats7030059

Chicago/Turabian Style

Leiva, Víctor, Jimmy Corzo, Myrian E. Vergara, Raydonal Ospina, and Cecilia Castro. 2024. "A Statistical Methodology for Evaluating Asymmetry after Normalization with Application to Genomic Data" Stats 7, no. 3: 967-983. https://doi.org/10.3390/stats7030059

APA Style

Leiva, V., Corzo, J., Vergara, M. E., Ospina, R., & Castro, C. (2024). A Statistical Methodology for Evaluating Asymmetry after Normalization with Application to Genomic Data. Stats, 7(3), 967-983. https://doi.org/10.3390/stats7030059

Article Menu

A Statistical Methodology for Evaluating Asymmetry after Normalization with Application to Genomic Data

Abstract

1. Introduction

2. Methodology

2.1. Statistical Methods for Symmetry Evaluation

2.2. Rp Test in RNA-Sequencing

2.3. Integration of the Rp Test in the Broader Study Context

3. Simulation Studies for Evaluating the Robustness of the Rp Test

3.1. Simulation Setup

3.2. Test Implementation and Metrics

3.3. Simulation Results

4. Application to Real Genomic Data

4.1. Data Source and Preprocessing Overview

4.2. Evaluation of Symmetry of the Data Distribution

4.3. Robustness Assessment through Subsampling

4.4. Analysis of Symmetry Rejection across RNA-Seq Datasets

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI