2.1. Creating a Database
The database of TA systems that we built previously consisted only of the MazEF and RelBE superfamilies found in lactobacilli and bifidobacteria. To expand the database, we conducted an advanced search for type II TAS based on literature reports [
8,
19,
20,
21,
22]. To search for literature sources, we used the following search engines and databases: Google, Google Scholar, Scopus and Medline. The primary keywords used were “toxin–antitoxin systems” and “toxin–antitoxin systems type II”. As a result, we found 18 additional type II TAS adding up to a total of 20 considering MazEF and RelBE, which were added to the catalog, and all of which are represented in
Table 1.
To identify the main domains in the proteins of the TA systems reported in the literature (
Table 1), we used Pfam (
http://pfam.xfam.org/) [
23] and UniPlot (
http://www.uniprot.org/) databases. For each domain, we attributed its corresponding function as indicated in the Pfam database and retrieved its geninfo identifier (GI) number from the NCBI database.
After identifying all the domains, we blasted those with unknown function (DUF) against the NCBI database to find any possible homology with that of toxins and antitoxins. The maximum homology with TAS domains did not exceed 40%, which led us to exclude DUFs from further analysis. We also rejected the domains which belonged to antitoxins involved in binding to DNA: Arc, MetJ, OmegaRepress, parG, PSK trans fac, REGB T4, RepB RCR reg, RHH 1, RHH 3, RHH 4, RHH 5, RHH 7, SeqA N, TraY, VirC2, DndE, MatP C, Plasmid stab B, Repressor Mnt and RepB-RCR req. Pfam protein families of toxins and antitoxins used in this study are presented in
Table 2. Each domain belonged to one of the following four Pfam clans (
Table 2):
Plasmid-antitox (CL0136)—this clan includes mainly antitoxins participating in the preservation of plasmid-harboring bacteria.
Met_repress (CL0057)—this clan includes domains of only the antitoxins containing ribbon-helix-helix (RHH) motifs in the DNA-binding region.
Ccdb_PemK (CL0624)—this superfamily includes cell growth inhibitors and toxin components.
AbrB (CL0132)—this superfamily includes the DNA-binding domain of AbrB and the putative DNA-binding protein MraZ.
To conduct further analysis, we used the domains of toxin and antitoxin proteins from the two main superfamilies MazEF and RelBE and the family HicA-HicB (
Table S1).
We assembled a gene catalog containing the retrieved GI numbers using a script written in the Python programming language and Biopython. This script allows the user to automatically open all folders with the initial GI numbers, save them, and, using Entrez.efetch, get the corresponding fasta file with proteins as an output. Further, we only selected the sequences corresponding to 55 of the most common genera of bacteria from the human GI tract [
24]. We ruled out the repeated sequences (some of the domains belonged to a single TAS) using CD-HIT (
http://weizhongli-lab.org/cd-hit/) software.
In addition, we created another catalog consisting of files from the GenBank database (gb files) containing information about the proteins and their coding regions, each of which are assigned a unique accession number. Next, we selected those sequences whose gb files contained information about the coding region (CDS). For each such sequence, we downloaded the whole genome of the corresponding bacterium, which was used to create a FASTA file containing characteristics of the bacterium and the nucleotide sequence of the gene. Following these steps, we were able to create a catalog consisting of 5299 nucleotide sequences.
Since TA genes are prone to horizontal gene transfer, we had to exclude those localized on plasmids. To achieve this task, we created an additional catalog of all the plasmids submitted to the NCBI database, which numbered 51,988 sequences. Next, we identified the TA genes localized on plasmids using Blast software with the settings: query coverage > 80% and identity > 70%. This way we identified and deleted from the catalog 525 sequences containing TA genes, which were found in the genera Anaerobaculum, Bryantella, Leuconostos, Mollicutes and Turicibacter.
The final catalog consisted of 4239 nucleotide sequences belonging to 49 different genera, 489 species and 1346 strains (
Table 3). The average length of genes reached 306 base pairs. The number of genes per species varied from one to three.
2.3. Optimization of Control Parameters of the TAGMA Software on Metagenome Samples
We simulated nine metagenome samples with different numbers of reads, genera and species and genome size (
Tables S2 and S3). We simulated reads with both similar and different genome sizes. For simulated reads we used DWGSIM (
https://github.com/nh13/DWGSIM). The number of reads varied between 10
6 and 5 × 10
7 per sample. The genome coverage varied between three and seven reads per segment of the genome. The total number of species varied between 18 and 37.
TAGMA generates files containing information about markers, mapped genes and detected strains. To carry out our analysis, we opted for the files titled summary.txt and test_results_short.txt. The file summary.txt contains information about the identified strains, their uniqueness, their distinguishing markers, the overall number of markers and the marker coverage. The file test_results_short.txt contains information about the identified strains, their distinguishing markers, the marker coverage and the positions of significant single nucleotide polymorphism (SNP) unique to those markers.
As shown by the validation of the simulated data, TAGMA generated a high false positive rate (FPR). To readjust the software, we had to seek the optimal threshold values that allowed the identification of the maximum number of strains while maintaining a low level of false results. We chose the Jaccard index (JI) as an indicator of congruency between the input data used to simulate the metagenomes and the output data generated by TAGMA. As long as the number of false results is minimal, the higher the JI, the higher is the congruency between the input data and the results. We varied the parameters represented in
Table 4 using a simulated genome sample using a script written in Python.
We narrowed it down to the 12 most optimal thresholds (
Table S4), which reflected the best ratio between a high JI and a low FPR. We calculated the JI and the FPR for the 12 thresholds (
Tables S5 and S6). As can be inferred from
Tables S5 and S6, the values of the JI and the FPR varied between the simulated metagenome samples depending on the threshold. To select the optimal threshold, we calculated the mean values of the JI and the FPR for each threshold (see
Table S7).
As shown in
Table S7, threshold 1 was deemed optimal because it generated the lowest FPR while maintaining an acceptable JI. The other parameters were:
Uniqueness = 1—only unique results (cannot be confused with others) were admissible.
Coverage = 170 b.p.—the minimal length in b.p. of the marker covering a gene.
Coverage = 98%—the minimal coverage rate for each gene.
Number of significant SNPs = 3—the minimal number of SNPs used for the mapping of a particular gene.
Threshold number of TAS = 5—the minimal required number of TAS used for the identification of a species.
Number of TAS in summary.txt = 1—the minimal required number of TAS used for the identification of a species.
Number of TAS in test_results_short.txt = 2—the minimal required number of TAS used for the identification of a species.
The JI and the FPR in the case of the optimal threshold are listed in
Table S8. For some samples, the FPR was zero with a high JI index. Moreover, even though the JI index in the fourth and eighth samples was lowest, their error rate was low too. It is noteworthy that TAGMA detected all the species in the input file. Since the generated file included some noise, we had to set the optimal threshold to eliminate it as much as possible. Consequently, the set threshold excluded some of the valid species. Some sequences in the TAS database were only referred to the corresponding genus, which led to their exclusion from the results thereby lowering the JI.
2.4. Comparing TAGMA to MetaPhlAn2
The generated reads were simultaneously processed using MetaPlAn2.
Table 5 shows that TAGMA surpassed MetaPlAn2 by generating results with a higher JI, both before and after adjusting the thresholds.
The most plausible result yielded by MetaPhlAn2 was in the case of the fourth sample (JI = 0.79). However, in the case of the eighth sample, the JI was lowest. As for TAGMA, it yielded maximal results in two cases before its thresholds and in six cases after adjusting them. As for the FPR of results generated by MetaPhlAn2, it was at its lowest in the case of the fourth sample. Before adjusting the thresholds, the results generated by TAGMA displayed a high FPR for all samples, including sample number four with a low JI (
Table 6). To compare the efficiency of the software, we calculated the mean values for the JI and false positive (
Table 7).
After adjusting the thresholds of TAGMA, it generated the highest mean JI of 0.70, which is an improvement of 0.06 over its mean JI before the adjustment and an improvement of 0.11 over MetaPhlAn. Moreover, the average FPR characteristic of TAGMA after the adjustment reached 32%, which exceeded the average FPR yielded by MetaPhlAn2 by 7%, an overall improvement of 18%.
2.5. Taxonomic Profiling of Metagenomes Using TAGMA
To validate the TAGMA (Toxin Antitoxin Genes for Metagenomes Analyses) software with the created database, we selected 20 metagenomes isolated from children aged between one and nine and living in the central region of Russia (see Materials and Methods,
Table 8). TAGMA is a pipeline consisting of an existing published software and in-house scripts (
https://github.com/LabGenMO/TAGMA). The algorithm first scans BLASTN alignments of TAS and identifies markers (substitutions and indels) that distinguish gene variants, identified by all to all BLASTN alignments. Then, it aligns the metagenomics reads against TAS genes using BowTie2. Finally, the software presents significant hits [
18].
The samples were subjected to both whole genome sequencing (WGS) and 16S rRNA gene sequencing to obtain a better representation of the bacterial composition. The 16S rRNA gene sequencing data were analyzed using RDP software (
http://rdp.cme.msu.edu/). The WGS data were analyzed with MetaPhlan2 [
25], Kraken2 [
26], Centrifuge [
27] and TAGMA (
Table S9). 16S rRNA sequencing allowed us to characterize the bacterial diversity of samples at the genus level. The WGS data revealed to us a fair quantitative representation of each genus and species in the metagenomes.
Table S9 shows the bacterial diversity of metagenomes (over 0.01%).
After identifying the main genera, we analyzed the WGS data for species identification (
Table S9). Compared to MetaPhlan2, Kraken2 and Centrifuge, TAGMA shows similar bacterial diversity at the species level within the 55 genera. Further we opted for MetaPhlAn2 for future comparison because it is the kernel of a large number of taxonomic classification software. Moreover, the TAGMA software was able to analyze the WGS data on the level of strains or groups of strains (
Table 9 and
Table S10). In those cases, when TAGMA failed to identify the strains, only the species or genus is shown. These results support the use of TAS II as markers for the phylogenetic profiling of the GM.