Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
Abstract
:1. Introduction
2. Materials and Methods
2.1. Mixture Model
2.2. Distance Measures
2.2.1. Norms Distances
2.2.2. Other Distances
2.3. Clustering Validation Indices
2.3.1. Internal Validation Indices
2.3.2. External Validation Assessment Measures
2.4. Partitioning Algorithms for Clustering
- Initialize: randomly choose K of the n points in the dataset to be the initial cluster medoids.
- Assign each data point to the closest medoid based on distance.
- Refine: for each medoid m and non-medoid data point j, swap j and m and compute the total cost by the new medoid j. Select the best medoid in terms of minimum cost.
- Repeat steps 2 and 3 until all the medoids are fixed.
Algorithm 1: for clustering based on joint mixture distribution: |
|
2.5. Simulation Studies
- Determine the number of rates in each sub-class using
- Generate the rate distribution in each sub-class from a mixture distribution with M components including a zero point mass and Gamma distributions. The number M is randomly chosen between 5 and 15.
- Sample the number of rates from where is the sample size for sub-class c.
- Sample for each subject from a .
- Generate the observed count .
3. Results
Simulation Results
4. Real Data Implementation
5. Discussion
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
OTU | operational taxonomic unit |
BC | Bray-Curtis distance |
JS | Jenson-Shannon distance |
PAM | partition around medoids |
DI | Dunn index |
MA | mean accuracy |
MJI | mean Jaccard index |
PD | Parkinson’s disease |
Appendix A. Mixture Model
Appendix A.1. Individual Mixture Distribution Estimation
Appendix B. Simulation
Appendix B.1. Mixture Distribution Estimation
- Low rate part: set five Gamma distribution , and five models, with the first model including all the distributions, the second model including the last 4 distributions, until the fifth model including only the last Gamma distribution;
- Medium rate part: one model with four Gamma distribution ;
- Higher rate part: one model with three Gamma distribution ; each value in Gamma distribution is chosen by uniformly binned the OTUs on a log-transformed scale from 8% to the 85% quantile;
- High count part: a point mass which combines all the counts greater than the 85% quantile.
References
- Hill-Burns, E.M.; Debelius, J.W.; Morton, J.T.; Wissemann, W.T.; Lewis, M.R.; Wallen, Z.D.; Peddada, S.D.; Factor, S.A.; Molho, E.; Zabetian, C.P.; et al. Parkinson’s disease and Parkinson’s disease medications have distinct signatures of the gut microbiome. Mov. Disord. 2017, 32, 739–749. [Google Scholar] [CrossRef] [PubMed]
- Falony, G.; Joossens, M.; Vieira-Silva, S.; Wang, J.; Darzi, Y.; Faust, K.; Kurilshikov, A.; Bonder, M.J.; Valles-Colomer, M.; Vandeputte, D. Population-level analysis of gut microbiome variation. Science 2016, 352, 560–564. [Google Scholar] [CrossRef] [PubMed]
- Zhernakova, A.; Kurilshikov, A.; Bonder, M.J.; Tigchelaar, E.F.; Schirmer, M.; Vatanen, T.; Mujagic, Z.; Vila, A.V.; Falony, G.; Vieira-Silva, S.; et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science 2016, 352, 565–569. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Xu, L.; Paterson, A.D.; Turpin, W.; Xu, W. Assessment and selection of competing models for zero-inflated microbiome data. PLoS ONE 2015, 10, e0129606. [Google Scholar] [CrossRef] [Green Version]
- Zhang, X.; Mallick, H.; Tang, Z.; Zhang, L.; Cui, X.; Benson, A.K.; Yi, N. Negative binomial mixed models for analyzing microbiome count data. BMC Bioinform. 2017, 18, 4. [Google Scholar] [CrossRef] [Green Version]
- Fisher, C.K.; Mehta, P. Identifying keystone species in the human gut microbiome from metagenomic timeseries using sparse linear regression. PLoS ONE 2014, 9, e102451. [Google Scholar] [CrossRef]
- Bray, J.R.; Curtis, J.T. An ordination of the upland forest communities of southern Wisconsin. Ecol. Monogr. 1957, 27, 326–349. [Google Scholar] [CrossRef]
- Lozupone, C.; Knight, R. UniFrac: A new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol. 2005, 71, 8228–8235. [Google Scholar] [CrossRef] [Green Version]
- Lozupone, C.A.; Hamady, M.; Kelley, S.T.; Knight, R. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Appl. Environ. Microbiol. 2007, 73, 1576–1585. [Google Scholar] [CrossRef] [Green Version]
- Chen, J.; Kyle, B.; Emily, S.; Charlson, C.; Hoffmann, J. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics 2012, 28, 2106–2113. [Google Scholar] [CrossRef]
- Zachary, D.; Christian, L.; Emily, R.; Dan, R.; Martin, J. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol. 2015, 11, e1004226. [Google Scholar]
- Tsilimigras, M.C.B.; Fodor, A.A. Compositional data analysis of the microbiome: Fundamentals, tools, and challenges. Ann. Epidemiol. 2016, 26, 330–335. [Google Scholar] [CrossRef]
- Forney, L.J.; Gajer, P.; Williams, C.J.; Schneider, G.M.; Koenig, S.S.; McCulle, S.L.; Karlebach, S.; Brotman, R.M.; Davis, C.C.; Ault, K.; et al. Comparison of self-collected and physician-collected vaginal swabs for microbiome analysis. J. Clin. Microbiol. 2010, 48, 1741–1748. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hong, B.Y.; Araujo, M.V.F.; Strausbaugh, L.D.; Terzi, E.; Ioannidou, E.; Diaz, P.I. Microbiome profiles in periodontitis in relation to host and disease characteristics. PLoS ONE 2015, 10, e0127077. [Google Scholar] [CrossRef] [Green Version]
- Leake, S.L.; Pagni, M.; Falquet, L.; Taroni, F.; Greub, G. The salivary microbiome for differentiating individuals: Proof of principle. Microbes Infect. 2016, 18, 399–405. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Neyman, J. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability: Held at the Statistical Laboratory, University of California, 21 June–18 July 1970, 9–12 April, 16–21 June, 19–22 July 1971; University of California Press: Berkeley, CA, USA, 1972. [Google Scholar]
- Gury-BenAri, M.; Thaiss, C.A.; Serafini, N.; Winter, D.R.; Giladi, A.; Lara-Astiaso, D.; Levy, M.; Salame, T.M.; Weiner, A.; David, E.; et al. The spectrum and regulatory landscape of intestinal innate lymphoid cells are shaped by the microbiome. Cell 2016, 166, 1231–1246. [Google Scholar] [CrossRef] [PubMed]
- Poole, A.C.; Goodrich, J.K.; Youngblut, N.D.; Luque, G.G.; Ruaud, A.; Sutter, J.L.; Waters, J.L.; Shi, Q.; El-Hadidi, M.; Johnson, L.M.; et al. Human salivary amylase gene copy number impacts oral and gut microbiomes. Cell Host Microbe 2019, 25, 553–564. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Maia, M.C.; Poroyko, V.; Won, H.; Almeida, L.; Bergerot, P.G.; Dizman, N.; Hsu, J.; Jones, J.; Salgia, R.; Pal, S.K. Association of Microbiome and Plasma Cytokine Dynamics to Nivolumab Response in Metastatic Renal Cell Carcinoma (mRCC). J. Clin. Oncol. 2018, 36, 656. [Google Scholar] [CrossRef]
- Kaufman, L.; Rousseeuw, P.J. Partitioning around medoids (program pam). Find. Groups Data Introd. Clust. Anal. 1990, 344, 68–125. [Google Scholar]
- Arumugam, M.; Raes, J.; Pelletier, E.; Le Paslier, D.; Yamada, T.; Mende, D.R.; Fernandes, G.R.; Tap, J.; Bruls, T.; Batto, J.M.; et al. Enterotypes of the human gut microbiome. Nature 2011, 473, 174–180. [Google Scholar] [CrossRef]
- McMurdie, P.J.; Holmes, S. Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 2014, 10, e1003531. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Koren, O.; Knights, D.; Gonzalez, A.; Waldron, L.; Segata, N.; Knight, R.; Huttenhower, C.; Ley, R.E. A guide to enterotypes across the human body: Meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput. Biol. 2013, 9, e1002863. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, G.D.; Chen, J.; Hoffmann, C.; Bittinger, K.; Chen, Y.Y.; Keilbaugh, S.A.; Bewtra, M.; Knights, D.; Walters, W.A.; Knight, R.; et al. Linking long-term dietary patterns with gut microbial enterotypes. Science 2011, 334, 105–108. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chen, J.; Li, H. Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat. 2013, 7, 418–442. [Google Scholar] [CrossRef] [Green Version]
- Holmes, I.; Harris, K.; Quince, C. Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE 2012, 7, e30126. [Google Scholar] [CrossRef] [Green Version]
- Feng, Z.; Subedi, S.; Neish, D.; Bak, S. Cluster Analysis of Microbiome Data via Mixtures of Dirichlet-Multinomial Regression Models. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2015, 69, 1163–1187. [Google Scholar]
- Calinski, T.; Harabasz, J. A Dendrite Method for Cluster Analysis. Comm. Stat. Simulat. Comp. 1974, 3, 1–27. [Google Scholar] [CrossRef]
- Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
- Strehl, A.; Ghosh, J. Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions. J. Mach. Learn. Res. 2003, 3, 583–617. [Google Scholar]
- Zhao, Q.; Franti, P. WB-index: A sum-of-squares based index for cluster validity. Data Knowl. Eng. 2014, 92, 77–89. [Google Scholar] [CrossRef]
- Joonas, H.; Susanne, J.; Tommi, K. Comparison of Internal Clustering Validation Indices for Prototype-Based Clustering. Algorithms 2017, 10, 105. [Google Scholar]
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
- Tibshirani, R.; Walther, G. Cluster validation by prediction strength. J. Comput. Graph. Stat. 2005, 14, 511–528. [Google Scholar] [CrossRef]
- Hennig, C.; Liao, T.F. Comparing Latent Class and Dissimilarity Based Clustering for Mixed Type Variables with Application to Social Stratification; Research Report No. 308; Department of Statistical Science, University College London: London, UK, 2010. [Google Scholar]
- Figueiredo, M.A.T.; Jain, A.K. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 381–396. [Google Scholar] [CrossRef] [Green Version]
- Bouguila, N.; Ziou, D.; Vaillancourt, J. Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application. IEEE Trans. Image Process. A Publ. IEEE Signal Process. Soc. 2004, 13, 1533–1543. [Google Scholar] [CrossRef]
- Xu, P.; Peng, H.; Huang, T. Unsupervised Learning of Mixture Regression Models for Longitudinal Data. Comput. Stats Data Anal. 2018, 125, 44–56. [Google Scholar] [CrossRef] [Green Version]
- Mohamed, M.B.I.; Frigui, H. Unsupervised clustering and feature weighting based on Generalized Dirichlet mixture modeling. Inf. Sci. 2014, 274, 35–54. [Google Scholar]
- Shestopaloff, K.; Escobar, M.D.; Xu, W. Analyzing differences between microbiome communities using mixture distributions. Stat. Med. 2018, 37, 4036–4053. [Google Scholar] [CrossRef]
- Dunn, J.C. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J. Cybern. 1973, 3, 32–57. [Google Scholar] [CrossRef]
- García-Jiménez, B.; Wilkinson, M.D. Robust and automatic definition of microbiome states. PeerJ 2019, 7, e6657. [Google Scholar] [CrossRef]
- Struyf, A.; Hubert, M.; Rousseeuw, P.J. Integrating robust clustering techniques in S-PLUS. Comput. Stat. Data Anal. 1997, 26, 17–37. [Google Scholar] [CrossRef]
- McDonald, D.; Price, M.N.; Goodrich, J.; Nawrocki, E.P.; DeSantis, T.Z.; Probst, A.; Andersen, G.L.; Knight, R.; Hugenholtz, P. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2012, 6, 610–618. [Google Scholar] [CrossRef] [PubMed]
- Xie, X.L.; Beni, G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 841–847. [Google Scholar] [CrossRef]
- Keshavarzian, A.; Green, S.J.; Engen, P.A.; Voigt, R.M.; Naqib, A.; Forsyth, C.B.; Mutlu, E.; Shannon, K.M. Colonic bacterial composition in Parkinson’s disease. Mov. Disord. 2015, 30, 1351–1360. [Google Scholar] [CrossRef]
- Nocedal, J. Updating quasi-Newton matrices with limited storage. Math. Comput. 1980, 35, 773–782. [Google Scholar] [CrossRef]
- Ypma, J. Introduction to Nloptr: An R Interface to NLopt. R Package. 2 August 2014. Available online: https://docplayer.net/39407286-Introduction-to-nloptr-an-r-interface-to-nlopt.html (accessed on 20 October 2020).
- Maechler, M.; Rousseeuw, P.; Struyf, A.; Hubert, M.; Hornik, K. Cluster: Cluster Analysis Basics and Extensions. R Package Version 2.0.1. 2015. Available online: https://www.scirp.org/(S(lz5mqp453edsnp55rrgjct55))/reference/ReferencesPapers.aspx?ReferenceID=2062247 (accessed on 20 October 2020).
- Desgraupes, B. Clustering indices. Univ. Paris Ouest-Lab Modal X 2013, 1, 34. [Google Scholar]
Two-Subclass Scenarios | ||||||
High ZP | Medium ZP | Low ZP | ||||
Distance | MA | MJI | MA | MJI | MA | MJI |
-D PDF | 0.608 | 0.435 | 0.547 | 0.365 | 0.534 | 0.359 |
-D CDF | 0.591 | 0.419 | 0.547 | 0.374 | 0.534 | 0.363 |
-C CDF | 0.600 | 0.428 | 0.556 | 0.371 | 0.505 | 0.327 |
Manhattan | 0.530 | 0.357 | 0.518 | 0.340 | 0.516 | 0.331 |
Euclidean | 0.530 | 0.357 | 0.518 | 0.341 | 0.516 | 0.331 |
Bray-Curtis | 0.467 | 0.288 | 0.434 | 0.258 | 0.427 | 0.253 |
Weighted UniFrac | 0.445 | 0.260 | 0.437 | 0.258 | 0.451 | 0.270 |
Generalized UniFrac | 0.605 | 0.420 | 0.407 | 0.241 | 0.441 | 0.274 |
Manhattan_log | 0.534 | 0.360 | 0.520 | 0.334 | 0.499 | 0.317 |
Euclidean_log | 0.534 | 0.360 | 0.516 | 0.333 | 0.502 | 0.320 |
Bray-Curtis_log | 0.467 | 0.287 | 0.431 | 0.254 | 0.427 | 0.252 |
Three-Subclass Scenarios | ||||||
High ZP | Medium ZP | Low ZP | ||||
Distance | MA | MJI | MA | MJI | MA | MJI |
-D PDF | 0.456 | 0.281 | 0.386 | 0.230 | 0.373 | 0.223 |
-D CDF | 0.427 | 0.261 | 0.375 | 0.226 | 0.381 | 0.228 |
-C CDF | 0.452 | 0.277 | 0.383 | 0.226 | 0.386 | 0.228 |
Manhattan | 0.366 | 0.222 | 0.364 | 0.217 | 0.367 | 0.217 |
Euclidean | 0.375 | 0.229 | 0.364 | 0.220 | 0.384 | 0.234 |
Bray-Curtis | 0.379 | 0.211 | 0.348 | 0.193 | 0.369 | 0.207 |
Weighted UniFrac | 0.376 | 0.210 | 0.351 | 0.198 | 0.374 | 0.217 |
Generalized UniFrac | 0.470 | 0.274 | 0.346 | 0.197 | 0.374 | 0.219 |
Manhattan_log | 0.383 | 0.234 | 0.379 | 0.227 | 0.404 | 0.243 |
Euclidean_log | 0.371 | 0.225 | 0.377 | 0.224 | 0.390 | 0.228 |
Bray-Curtis_log | 0.378 | 0.212 | 0.355 | 0.199 | 0.378 | 0.215 |
Two-Subclass Scenarios | Three-Subclass Scenarios | |||||
---|---|---|---|---|---|---|
Distance | High ZP | Medium ZP | Low ZP | High ZP | Medium ZP | Low ZP |
-D PDF | 2.47 | 2.73 | 2.30 | 2.48 | 2.74 | 2.35 |
-D CDF | 2.26 | 2.28 | 2.28 | 2.16 | 2.43 | 2.30 |
-C CDF | 2.40 | 2.46 | 2.65 | 2.26 | 2.88 | 2.63 |
Manhattan | 2.87 | 2.73 | 2.84 | 2.48 | 2.68 | 2.63 |
Euclidean | 2.86 | 2.72 | 2.85 | 2.93 | 2.85 | 2.91 |
Bray-Curtis | 3.08 | 3.47 | 3.44 | 3.31 | 3.57 | 3.44 |
Weighted UniFrac | 3.38 | 3.39 | 3.23 | 3.35 | 3.36 | 3.29 |
Generalized UniFrac | 2.88 | 3.39 | 3.16 | 3.01 | 3.34 | 3.20 |
Manhattan_log | 2.72 | 2.95 | 3.11 | 2.82 | 2.91 | 2.98 |
Euclidean_log | 2.76 | 2.94 | 3.05 | 2.46 | 2.80 | 2.78 |
Bray-Curtis_log | 3.08 | 3.50 | 3.44 | 2.80 | 3.48 | 3.17 |
Distance | Dunn | Silhouette Index | Wemmert-Gancarski | Xie-Beni |
---|---|---|---|---|
discrete PDF | 2 | 7 | 2 | 2 |
discrete CDF | 3 | 2 | 2 | 3 |
continuous CDF | 2 | 5 | 3 | 3 |
Manhattan | 10 | 4 | 4 | 10 |
Euclidean | 3 | 3 | 3 | 3 |
Bray-Curtis | 7 | 10 | 2 | 10 |
Manhattan_log | 9 | 2 | 10 | 10 |
Euclidean_log | 5 | 2 | 5 | 5 |
Bray-Curtis_log | 9 | 9 | 2 | 10 |
OTU | Full Sample (n = 197) | Cluster 1 (n = 166) | Cluster 2 (n = 31) |
---|---|---|---|
g_Akkermansia | |||
Mean (sd) | 404.3 (956.8) | 479.7 (1025.2) | 0.9 (1.4) |
Median (Min,Max) | 1 (0,5284) | 3.5 (0,5284) | 1 (0,7) |
g_Anaerotruncus | |||
Mean (sd) | 4.5 (11.8) | 4.2 (12.5) | 6.2 (7.2) |
Median (Min,Max) | 0 (0,120) | 0 (0,120) | 3 (0,27) |
g_Bacteroides | |||
Mean (sd) | 0.6 (1.5) | 0.5 (1.4) | 1.3 (1.8) |
Median (Min,Max) | 0 (0,9) | 0 (0,9) | 0 (0,5) |
g_Anaerococcus | |||
Mean (sd) | 8.8 (40.5) | 8.8 (43.6) | 8.8 (16.3) |
Median (Min,Max) | 0 (0,352) | 0 (0,352) | 2 (0,79) |
g_Akkermansia_ | |||
Mean (sd) | 198.8 (811.5) | 0.7 (1.9) | 1259.8 (1709.5) |
Median (Min,Max) | 0 (0,6278) | 0 (0,17) | 378 (53,6278) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, D.; Xu, W. Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model. Microorganisms 2020, 8, 1612. https://doi.org/10.3390/microorganisms8101612
Yang D, Xu W. Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model. Microorganisms. 2020; 8(10):1612. https://doi.org/10.3390/microorganisms8101612
Chicago/Turabian StyleYang, Dongyang, and Wei Xu. 2020. "Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model" Microorganisms 8, no. 10: 1612. https://doi.org/10.3390/microorganisms8101612
APA StyleYang, D., & Xu, W. (2020). Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model. Microorganisms, 8(10), 1612. https://doi.org/10.3390/microorganisms8101612