*3.4. Data Processing*

Raw FASTQ files were demultiplexed using the pipeline 'Integrated Microbial Next Generation Sequencing' (IMNGS) [24], which is based on UPARSE approach [25]. Two errors in barcode sequences were the maximal allowed number. Reads were trimmed to the position of the first base with a quality factor <3 and then paired. The resulting sequences were size filtered, except those of assembled size <300 and >600 nucleotides. Paired reads with expected error >3 were further filtered out and the remaining reads were trimmed at each end by 10 nucleotides to prevent analysis of the distorted base composition regions at the beginning of the sequences. Operational taxonomic units (OTUs) were grouped with a sequence similarity of 97%, keeping only those with a relative abundance of >0.25% in at least one sample of the 412 samples (352 fecal samples and 60 ceacal samples). OTU tables of all study groups are provided in the Appendix A. The Rhea pipeline [26], a set of scripts of the statistical computing software R, was used for data processing. In brief, sequences were normalized to the minimum count of sequences observed. Samples with less than 2500 reads counts were excluded [26]. Microbial diversity between groups was calculated by generalized Unifrac distances [27]. The Ribosomal Database Project (RDP v.9) classifier (Wang et al., 2007) was used to assign taxonomies at 80% confidence level. Important unidentified OTUs were classified using ExTaxon. *p* values were corrected for multiple comparisons according to the Benjamini-Hochberg method. Only taxa with a prevalence of 10% (proportion of samples positive for the given taxa) in at least one group were considered for statistical testing.
