Statistical Analysis of Microbiome Data: From Methods to Application

A special issue of Genes (ISSN 2073-4425). This special issue belongs to the section "Technologies and Resources for Genetics".

Deadline for manuscript submissions: closed (25 October 2023) | Viewed by 23324

Special Issue Editor


E-Mail Website
Guest Editor
Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
Interests: microbiome analysis; statistical genetics

Special Issue Information

Dear Colleagues,

We would like to invite you to participate in this Special Issue on “Statistical Analysis of Microbiome Data: from Methods to Application”.

Microbiome studies with high-throughput sequencing data have proliferated in the last decade and have greatly outpaced the development of proper analytical methods that can best exploit rich data. There is an urgent need for statistical research directed toward microbiome studies for discovering microbial biomarkers for disease diagnosis and prognosis, understanding the inter-relationship among microbiota, understanding the crosstalk between the microbiota and human physiological (e.g., digestive, immune, metabolic) systems, and so on. Meanwhile, the complexity (e.g., high-dimensionality, sparsity, overdispersion, compositionality, phylogenic structure, experimental bias) in microbiome data presents profound challenges to the statistical society.

The purpose of this Special Issue is to host new statistical methodologies, application, and review papers on how to analyze complex microbiome data. Simulation tools, benchmarking studies, and perspectives will also be considered for publication.

Dr. Yi-Juan Hu
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Genes is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • microbiome data analysis

  • statistical models
  • compositional analysis
  • community-level test, test of individual taxa, differentially abundant taxa
  • clustered data
  • presence-absence analysis, rarefaction
  • mediation analysis
  • multi-omics analysis

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review, Other

23 pages, 723 KiB  
Article
Robust Differential Abundance Analysis of Microbiome Sequencing Data
by Guanxun Li, Lu Yang, Jun Chen and Xianyang Zhang
Genes 2023, 14(11), 2000; https://doi.org/10.3390/genes14112000 - 26 Oct 2023
Viewed by 1261
Abstract
It is well known that the microbiome data are ridden with outliers and have heavy distribution tails, but the impact of outliers and heavy-tailedness has yet to be examined systematically. This paper investigates the impact of outliers and heavy-tailedness on differential abundance analysis [...] Read more.
It is well known that the microbiome data are ridden with outliers and have heavy distribution tails, but the impact of outliers and heavy-tailedness has yet to be examined systematically. This paper investigates the impact of outliers and heavy-tailedness on differential abundance analysis (DAA) using the linear models for the differential abundance analysis (LinDA) method and proposes effective strategies to mitigate their influence. The presence of outliers and heavy-tailedness can significantly decrease the power of LinDA. We investigate various techniques to address outliers and heavy-tailedness, including generalizing LinDA into a more flexible framework that allows for the use of robust regression and winsorizing the data before applying LinDA. Our extensive numerical experiments and real-data analyses demonstrate that robust Huber regression has overall the best performance in addressing outliers and heavy-tailedness. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Figure 1

13 pages, 1498 KiB  
Article
Impact of Data and Study Characteristics on Microbiome Volatility Estimates
by Daniel J. Park and Anna M. Plantinga
Genes 2023, 14(1), 218; https://doi.org/10.3390/genes14010218 - 14 Jan 2023
Cited by 2 | Viewed by 1715
Abstract
The human microbiome is a dynamic community of bacteria, viruses, fungi, and other microorganisms. Both the composition of the microbiome (the microbes that are present and their relative abundances) and the temporal variability of the microbiome (the magnitude of changes in their composition [...] Read more.
The human microbiome is a dynamic community of bacteria, viruses, fungi, and other microorganisms. Both the composition of the microbiome (the microbes that are present and their relative abundances) and the temporal variability of the microbiome (the magnitude of changes in their composition across time, called volatility) has been associated with human health. However, the effect of unbalanced sampling intervals and differential read depth on the estimates of microbiome volatility has not been thoroughly assessed. Using four publicly available gut and vaginal microbiome time series, we subsampled the datasets to several sampling intervals and read depths and then compared additive, multiplicative, centered log ratio (CLR)-based, qualitative, and distance-based measures of microbiome volatility between the conditions. We find that longer sampling intervals are associated with larger quantitative measures of change (particularly for common taxa), but not with qualitative measures of change or distance-based volatility quantification. A lower sequencing read depth is associated with smaller multiplicative, CLR-based, and qualitative measures of change (particularly for less common taxa). Strategic subsampling may serve as a useful sensitivity analysis in unbalanced longitudinal studies investigating clinical associations with microbiome volatility. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Figure 1

10 pages, 297 KiB  
Article
What Can We Learn about the Bias of Microbiome Studies from Analyzing Data from Mock Communities?
by Mo Li, Robert E. Tyx, Angel J. Rivera, Ni Zhao and Glen A. Satten
Genes 2022, 13(10), 1758; https://doi.org/10.3390/genes13101758 - 28 Sep 2022
Viewed by 1517
Abstract
It is known that data from both 16S and shotgun metagenomics studies are subject to biases that cause the observed relative abundances of taxa to differ from their true values. Model community analyses, in which the relative abundances of all taxa in the [...] Read more.
It is known that data from both 16S and shotgun metagenomics studies are subject to biases that cause the observed relative abundances of taxa to differ from their true values. Model community analyses, in which the relative abundances of all taxa in the sample are known by construction, seem to offer the hope that these biases can be measured. However, it is unclear whether the bias we measure in a mock community analysis is the same as we measure in a sample in which taxa are spiked in at known relative abundance, or if the biases we measure in spike-in samples is the same as the bias we would measure in a real (e.g., biological) sample. Here, we consider these questions in the context of 16S rRNA measurements on three sets of samples: the commercially available Zymo cells model community; the Zymo model community mixed with Swedish Snus, a smokeless tobacco product that is virtually bacteria-free; and a set of commercially available smokeless tobacco products. Each set of samples was subject to four different extraction protocols. The goal of our analysis is to determine whether the patterns of bias observed in each set of samples are the same, i.e., can we learn about the bias in the commercially available smokeless tobacco products by studying the Zymo cells model community? Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
18 pages, 3539 KiB  
Article
MicrobiomeGWAS: A Tool for Identifying Host Genetic Variants Associated with Microbiome Composition
by Xing Hua, Lei Song, Guoqin Yu, Emily Vogtmann, James J. Goedert, Christian C. Abnet, Maria Teresa Landi and Jianxin Shi
Genes 2022, 13(7), 1224; https://doi.org/10.3390/genes13071224 - 9 Jul 2022
Cited by 20 | Viewed by 2960
Abstract
The microbiome is the collection of all microbial genes and can be investigated by sequencing highly variable regions of 16S ribosomal RNA (rRNA) genes. Evidence suggests that environmental factors and host genetics may interact to impact human microbiome composition. Identifying host genetic variants [...] Read more.
The microbiome is the collection of all microbial genes and can be investigated by sequencing highly variable regions of 16S ribosomal RNA (rRNA) genes. Evidence suggests that environmental factors and host genetics may interact to impact human microbiome composition. Identifying host genetic variants associated with human microbiome composition not only provides clues for characterizing microbiome variation but also helps to elucidate biological mechanisms of genetic associations, prioritize genetic variants, and improve genetic risk prediction. Since a microbiota functions as a community, it is best characterized by β diversity; that is, a pairwise distance matrix. We develop a statistical framework and a computationally efficient software package, microbiomeGWAS, for identifying host genetic variants associated with microbiome β diversity with or without interacting with an environmental factor. We show that the score statistics have positive skewness and kurtosis due to the dependent nature of the pairwise data, which makes p-value approximations based on asymptotic distributions unacceptably liberal. By correcting for skewness and kurtosis, we develop accurate p-value approximations, whose accuracy was verified by extensive simulations. We exemplify our methods by analyzing a set of 147 genotyped subjects with 16S rRNA microbiome profiles from non-malignant lung tissues. Correcting for skewness and kurtosis eliminated the dramatic deviation in the quantile–quantile plots. We provided preliminary evidence that six established lung cancer risk SNPs were collectively associated with microbiome composition for both unweighted (p = 0.0032) and weighted (p = 0.011) UniFrac distance matrices. In summary, our methods will facilitate analyzing large-scale genome-wide association studies of the human microbiome. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Figure 1

16 pages, 1375 KiB  
Article
A Distribution-Free Model for Longitudinal Metagenomic Count Data
by Dan Luo, Wenwei Liu, Tian Chen and Lingling An
Genes 2022, 13(7), 1183; https://doi.org/10.3390/genes13071183 - 1 Jul 2022
Cited by 1 | Viewed by 1639
Abstract
Longitudinal metagenomics has been widely studied in the recent decade to provide valuable insight for understanding microbial dynamics. The correlation within each subject can be observed across repeated measurements. However, previous methods that assume independent correlation may suffer from incorrect inferences. In addition, [...] Read more.
Longitudinal metagenomics has been widely studied in the recent decade to provide valuable insight for understanding microbial dynamics. The correlation within each subject can be observed across repeated measurements. However, previous methods that assume independent correlation may suffer from incorrect inferences. In addition, methods that do account for intra-sample correlation may not be applicable for count data. We proposed a distribution-free approach, namely CorrZIDF, which extends the current method to model correlated zero-inflated metagenomic count data, offering a powerful and accurate solution for detecting significance features. This method can handle different working correlation structures without specifying each margin distribution of the count data. Through simulation studies, we have shown the robustness of CorrZIDF when selecting a working correlation structure for repeated measures studies to enhance the efficiency of estimation. We also compared four methods using two real datasets, and the new proposed method identified more unique features that were reported previously on the relevant research. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Graphical abstract

17 pages, 1272 KiB  
Article
Principal Amalgamation Analysis for Microbiome Data
by Yan Li, Gen Li and Kun Chen
Genes 2022, 13(7), 1139; https://doi.org/10.3390/genes13071139 - 24 Jun 2022
Cited by 2 | Viewed by 2071
Abstract
In recent years microbiome studies have become increasingly prevalent and large-scale. Through high-throughput sequencing technologies and well-established analytical pipelines, relative abundance data of operational taxonomic units and their associated taxonomic structures are routinely produced. Since such data can be extremely sparse and high [...] Read more.
In recent years microbiome studies have become increasingly prevalent and large-scale. Through high-throughput sequencing technologies and well-established analytical pipelines, relative abundance data of operational taxonomic units and their associated taxonomic structures are routinely produced. Since such data can be extremely sparse and high dimensional, there is often a genuine need for dimension reduction to facilitate data visualization and downstream statistical analysis. We propose Principal Amalgamation Analysis (PAA), a novel amalgamation-based and taxonomy-guided dimension reduction paradigm for microbiome data. Our approach aims to aggregate the compositions into a smaller number of principal compositions, guided by the available taxonomic structure, by minimizing a properly measured loss of information. The choice of the loss function is flexible and can be based on familiar diversity indices for preserving either within-sample or between-sample diversity in the data. To enable scalable computation, we develop a hierarchical PAA algorithm to trace the entire trajectory of successive simple amalgamations. Visualization tools including dendrogram, scree plot, and ordination plot are developed. The effectiveness of PAA is demonstrated using gut microbiome data from a preterm infant study and an HIV infection study. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Figure 1

17 pages, 619 KiB  
Article
MarZIC: A Marginal Mediation Model for Zero-Inflated Compositional Mediators with Applications to Microbiome Data
by Quran Wu, James O’Malley, Susmita Datta, Raad Z. Gharaibeh, Christian Jobin, Margaret R. Karagas, Modupe O. Coker, Anne G. Hoen, Brock C. Christensen, Juliette C. Madan and Zhigang Li
Genes 2022, 13(6), 1049; https://doi.org/10.3390/genes13061049 - 11 Jun 2022
Cited by 3 | Viewed by 2005
Abstract
Background: The human microbiome can contribute to pathogeneses of many complex diseases by mediating disease-leading causal pathways. However, standard mediation analysis methods are not adequate to analyze the microbiome as a mediator due to the excessive number of zero-valued sequencing reads in the [...] Read more.
Background: The human microbiome can contribute to pathogeneses of many complex diseases by mediating disease-leading causal pathways. However, standard mediation analysis methods are not adequate to analyze the microbiome as a mediator due to the excessive number of zero-valued sequencing reads in the data and that the relative abundances have to sum to one. The two main challenges raised by the zero-inflated data structure are: (a) disentangling the mediation effect induced by the point mass at zero; and (b) identifying the observed zero-valued data points that are not zero (i.e., false zeros). Methods: We develop a novel marginal mediation analysis method under the potential-outcomes framework to address the issues. We also show that the marginal model can account for the compositional structure of microbiome data. Results: The mediation effect can be decomposed into two components that are inherent to the two-part nature of zero-inflated distributions. With probabilistic models to account for observing zeros, we also address the challenge with false zeros. A comprehensive simulation study and the application in a real microbiome study showcase our approach in comparison with existing approaches. Conclusions: When analyzing the zero-inflated microbiome composition as the mediators, MarZIC approach has better performance than standard causal mediation analysis approaches and existing competing approach. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Figure 1

17 pages, 434 KiB  
Article
Extension of PERMANOVA to Testing the Mediation Effect of the Microbiome
by Ye Yue and Yi-Juan Hu
Genes 2022, 13(6), 940; https://doi.org/10.3390/genes13060940 - 25 May 2022
Cited by 6 | Viewed by 2876
Abstract
Recently, we have seen a growing volume of evidence linking the microbiome and human diseases or clinical outcomes, as well as evidence linking the microbiome and environmental exposures. Now comes the time to assess whether the microbiome mediates the effects of exposures on [...] Read more.
Recently, we have seen a growing volume of evidence linking the microbiome and human diseases or clinical outcomes, as well as evidence linking the microbiome and environmental exposures. Now comes the time to assess whether the microbiome mediates the effects of exposures on the outcomes, which will enable researchers to develop interventions to modulate outcomes by modifying microbiome compositions. Use of distance matrices is a popular approach to analyzing complex microbiome data that are high-dimensional, sparse, and compositional. However, the existing distance-based methods for mediation analysis of microbiome data, MedTest and MODIMA, only work well in limited scenarios. PERMANOVA is currently the most commonly used distance-based method for testing microbiome associations. Using the idea of inverse regression, here we extend PERMANOVA to test microbiome-mediation effects by including both the exposure and the outcome as covariates and basing the test on the product of their F statistics. This extension of PERMANOVA, which we call PERMANOVA-med, naturally inherits all the flexible features of PERMANOVA, e.g., allowing adjustment of confounders, accommodating continuous, binary, and multivariate exposure and outcome variables including survival outcomes, and providing an omnibus test that combines the results from analyzing multiple distance matrices. Our extensive simulations indicated that PERMANOVA-med always controlled the type I error and had compelling power over MedTest and MODIMA. Frequently, MedTest had diminished power and MODIMA had inflated type I error. Using real data on melanoma immunotherapy response, we demonstrated the wide applicability of PERMANOVA-med through 16 different mediation analyses, only 6 of which could be performed by MedTest and 4 by MODIMA. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Figure 1

Review

Jump to: Research, Other

23 pages, 629 KiB  
Review
Methodological Considerations in Longitudinal Analyses of Microbiome Data: A Comprehensive Review
by Ruiqi Lyu, Yixiang Qu, Kimon Divaris and Di Wu
Genes 2024, 15(1), 51; https://doi.org/10.3390/genes15010051 - 28 Dec 2023
Cited by 1 | Viewed by 2251
Abstract
Biological processes underlying health and disease are inherently dynamic and are best understood when characterized in a time-informed manner. In this comprehensive review, we discuss challenges inherent in time-series microbiome data analyses and compare available approaches and methods to overcome them. Appropriate handling [...] Read more.
Biological processes underlying health and disease are inherently dynamic and are best understood when characterized in a time-informed manner. In this comprehensive review, we discuss challenges inherent in time-series microbiome data analyses and compare available approaches and methods to overcome them. Appropriate handling of longitudinal microbiome data can shed light on important roles, functions, patterns, and potential interactions between large numbers of microbial taxa or genes in the context of health, disease, or interventions. We present a comprehensive review and comparison of existing microbiome time-series analysis methods, for both preprocessing and downstream analyses, including differential analysis, clustering, network inference, and trait classification. We posit that the careful selection and appropriate utilization of computational tools for longitudinal microbiome analyses can help advance our understanding of the dynamic host–microbiome relationships that underlie health-maintaining homeostases, progressions to disease-promoting dysbioses, as well as phases of physiologic development like those encountered in childhood. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Figure 1

Other

Jump to: Research, Review

9 pages, 337 KiB  
Brief Report
Impact of Experimental Bias on Compositional Analysis of Microbiome Data
by Yingtian Hu, Glen A. Satten and Yi-Juan Hu
Genes 2023, 14(9), 1777; https://doi.org/10.3390/genes14091777 - 8 Sep 2023
Viewed by 939
Abstract
Microbiome data are subject to experimental bias that is caused by DNA extraction and PCR amplification, among other sources, but this important feature is often ignored when developing statistical methods for analyzing microbiome data. McLaren, Willis, and Callahan (2019) proposed a model for [...] Read more.
Microbiome data are subject to experimental bias that is caused by DNA extraction and PCR amplification, among other sources, but this important feature is often ignored when developing statistical methods for analyzing microbiome data. McLaren, Willis, and Callahan (2019) proposed a model for how such biases affect the observed taxonomic profiles; this model assumes the main effects of bias without taxon–taxon interactions. Our newly developed method for testing the differential abundance of taxa, LOCOM, is the first method to account for experimental bias and is robust to the main effect biases. However, there is also evidence for taxon–taxon interactions. In this report, we formulated a model for interaction biases and used simulations based on this model to evaluate the impact of interaction biases on the performance of LOCOM as well as other available compositional analysis methods. Our simulation results indicate that LOCOM remained robust to a reasonable range of interaction biases. The other methods tend to have an inflated FDR even when there were only main effect biases. LOCOM maintained the highest sensitivity even when the other methods could not control the FDR. We thus conclude that LOCOM outperforms the other methods for compositional analysis of microbiome data considered here. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Figure 1

10 pages, 4118 KiB  
Technical Note
Determination of Effect Sizes for Power Analysis for Microbiome Studies Using Large Microbiome Databases
by Gibraan Rahman, Daniel McDonald, Antonio Gonzalez, Yoshiki Vázquez-Baeza, Lingjing Jiang, Climent Casals-Pascual, Daniel Hakim, Amanda Hazel Dilmore, Brent Nowinski, Shyamal Peddada and Rob Knight
Genes 2023, 14(6), 1239; https://doi.org/10.3390/genes14061239 - 9 Jun 2023
Cited by 3 | Viewed by 2805
Abstract
Herein, we present a tool called Evident that can be used for deriving effect sizes for a broad spectrum of metadata variables, such as mode of birth, antibiotics, socioeconomics, etc., to provide power calculations for a new study. Evident can be used to [...] Read more.
Herein, we present a tool called Evident that can be used for deriving effect sizes for a broad spectrum of metadata variables, such as mode of birth, antibiotics, socioeconomics, etc., to provide power calculations for a new study. Evident can be used to mine existing databases of large microbiome studies (such as the American Gut Project, FINRISK, and TEDDY) to analyze the effect sizes for planning future microbiome studies via power analysis. For each metavariable, the Evident software is flexible to compute effect sizes for many commonly used measures of microbiome analyses, including α diversity, β diversity, and log-ratio analysis. In this work, we describe why effect size and power analysis are necessary for computational microbiome analysis and show how Evident can help researchers perform these procedures. Additionally, we describe how Evident is easy for researchers to use and provide an example of efficient analyses using a dataset of thousands of samples and dozens of metadata categories. Full article
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)
Show Figures

Figure 1

Back to TopTop