1. Introduction
Metagenomics is an emerging research method that avoids the difficulties of traditional microbial culture techniques. It can sequence all the genetic information extracted from environmental samples, providing a new way to study microbial communities [
1]. Microbial communities are inextricably linked to human diseases. When microbial communities are affected, it can lead to the occurrence of certain diseases such as asthma [
2], allergies [
3], autoimmune diseases [
4], etc. The application of metagenomics to the microbiology field can help us better understand the relationship between microbial communities and human diseases and provide prevention or treatment tools to improve human health. Metagenomic binning is a critical step in metagenomics that allows genome sequences from the same microorganism to be placed into a bin, reconstructing a more complete genome and playing a pivotal role in analyzing the diversity and function of microbial communities.
Current metagenomic binning methods can be divided into three categories based on the characteristics of the information used for selection: nucleotide frequency-based, abundance-based, and nucleotide frequency- and abundance-based. Nucleotide frequency-based binning uses the similarity of nucleotide frequencies in genome sequences of the same species for binning. TETRA [
5] calculates the tetranucleotide frequencies of the DNA sequences, then calculates their Pearson correlation coefficients, and uses this information for binning. CompostBin [
6] uses a weighted PCA method for nucleotide frequency, which maps the data information into a low-dimensional space and reduces the dimensionality of the space. Differences in species abundance can lead to poor binning results for the two binning tools mentioned above. AbundanceBin [
7] also uses nucleotide frequency but uses a separate Poisson distribution to model the reads obtained from sequencing, which can classify short sequences sampled from species with different abundances and achieve higher binning accuracy. MetaClusterTA [
8], a tool for annotating metagenomic data, introduces binning technology and annotates the binning results based on the composition of tetranucleotides, resulting in more accurate and efficient annotations.
Abundance refers to the number of microbial species present in the environment. The abundance of each species in a metagenomic sample is different, and genome sequences from the same species have similar abundance characteristics. Genome sequences can be effectively classified based on this characteristic when the number of samples is large. Common binning methods based on abundance features include MetaGen [
9] and Canopy [
10]. MetaGen uses the relative abundance information of multiple samples to cluster contigs into different bins and relies on the Bayesian Information Criterion (BIC) to determine the number of genes in the sample. Canopy uses co-abundance information from multiple samples to reconstruct high-quality microbial genomes.
The binning method based on nucleotide frequency and abundance features combines the above two features, has a better binning effect, and is currently the most commonly used method in the field of metagenomic binning. Currently, such methods mainly include CONCOCT [
11], MetaBAT2 [
12], Maxbin2.0 [
13], BMC3C [
14], VAMB [
15], CLMB [
16], AVAMB [
17], SemiBin [
18], and GraphMB [
19]. CONCOCT uses a Gaussian mixture model to cluster contigs based on nucleotide frequency and abundance features of multiple samples and applies an improved Bayesian model to automatically determine the number of clusters. MetaBAT2 uses empirical probability distances obtained from nucleotide frequency and abundance information and then uses graph clustering methods and label propagation algorithms for clustering. Maxbin2.0 also uses two features and the expectation maximization algorithm to estimate the probability that each sequence belongs to different bins. In addition to using nucleotide frequency and abundance information, BMC3C also added codon features for the first time, using an integrated method to optimize the binning results and dividing the different clustering results obtained multiple times into different subgraphs through a graph segmentation method, each of which represents a different bin. The VAMB method uses a variational autoencoder to encode nucleotide frequency and abundance features before clustering to obtain the potential representation of the DNA sequence and then uses an iterative medoid clustering algorithm to cluster it. CLMB uses deep contrastive learning to add simulated noise to the training data to improve the model’s feature learning ability for noisy data, achieving better robustness and stability. In contrast to VAMB, the AVAMB method uses an adversarial autoencoder to encode features and fuses VAMB. Although the computation time is longer, the number of nearly complete genomes reconstructed is greater. SemiBin uses a deep Siamese neural network to implement a semi-supervised method that uses the reference database and retains the ability to obtain high-quality binning without reference. GraphMB uses a graph neural network to integrate the assembly graph into the binning process. The combination of graph structure and sequence information improves the quality of metagenomic binning.
Although the use of nucleotide frequency and abundance information has apparent advantages, most methods do not pay attention to the complex semantic information, position information, and sequence length of the DNA sequence itself and lack the processing of nucleotide frequency. This will lead to limited and complex redundant feature information extracted for binning, making binning inefficient. Therefore, this study proposes CedtBin, a metagenomic binning method based on contig embedding and decomposed tetranucleotide frequency, which uses the mask training task of the BERT model to obtain the potential representation of contigs and splices it with the tetranucleotide frequency after non-negative matrix decomposition into a new feature for subsequent clustering tasks. The DBSCAN algorithm is improved, and the Annoy algorithm and grid search strategy are used to adaptively determine the two critical parameters Eps and MinPts. The experimental results show that CedtBin’s binning results on both simulated and real datasets, whether at the strain level, species level, or genus level, are significantly better than the current mainstream binning methods VAMB and MetaBAT2 and can reconstruct more genomes.
3. Results
3.1. Experimental Instructions
The experiment uses the NC A100 v4 series virtual machine provided by the Microsoft platform, which is equipped with 1 NVIDIA A100 PCIe GPU, 80 GB GPU memory, and 24 non-multithreaded AMD EPYC Milan processor cores. The BERT-base model (total parameter size is 110 M) is used to pre-train the model using the Airways, Gi, Skin, and Urog datasets as training sets and the Oral dataset as the test set. During training, the batch size is set to 64, the initial learning rate is 2 × 10−5, and the multi-scale masking length is randomly selected from {1, 2, 4}, and other parameters are default. The ADAMW optimizer is used to optimize the training process, and a total of 10 epochs are trained.
When the NMF algorithm decomposes the original matrix
, the choice of k value is also critical. We construct the reconstruction error under different k values.
Figure 3 shows that the reconstruction error decreases significantly slower when k is between about 6 and 15. In this appropriate range, k = 10 can be selected. In the following experiments, the k value is defaulted to 10. In addition, the NMF algorithm uses the initialization parameter init = nndsvd to accelerate the algorithm’s convergence and improve the results’ stability and quality.
Prior to feature concatenation, the dimension of the contig embedding features is reduced to 24 dimensions using the UMAP algorithm (
https://umap-learn.readthedocs.io, accessed on 26 July 2024) to simplify the computation and improve efficiency. After concatenation, an N × 34 dimensional feature matrix is obtained for the clustering algorithm. In determining the parameters of the DBSCAN algorithm, the Annoy algorithm is used to perform an approximate nearest neighbor search. In order to balance query accuracy and efficiency, the number of trees to be built is set to n_trees = 15, and the distance between data is calculated using the Euclidean distance. The MinPts values for the grid search are {5, 10, 15, 20, 25}. Notably, unless otherwise specified, all experiments in this study use the Annoy-DBSCAN algorithm for the clustering process.
3.2. Results and Analysis of Different Features
3.2.1. Evaluation Metrics
This study uses three metrics to measure the effectiveness of metagenomic binning: accuracy, recall, and F1 score. These metrics are commonly used in the field of metagenomic binning. Accuracy measures the ability of the binning method to correctly assign contigs to specific genomic bins. Specifically, it indicates the proportion of contigs that actually belong to a bin among all contigs predicted to be bins. A high accuracy means that most predictions are correct among the contigs predicted to be bins. Recall measures the ability of the binning method to identify all contigs that actually belong to a bin. Specifically, it indicates the proportion of all contigs that actually belong to a bin that are correctly assigned to that bin. A high recall means that the binning method is able to identify most of the contigs that actually belong to a bin. The F1 score is the harmonic mean of accuracy and recall, which takes into account the precision and recall of the classifier. It strikes a balance between accuracy and recall and is particularly suitable for situations where data distribution in metagenomics is uneven.
This study constructs a 2 × 2 confusion matrix to help calculate these indicators and classifies the binning results of contigs into four situations:
,
,
, and
.
refers to the number of contigs that actually belong to a particular bin and are correctly assigned to that bin.
refers to the number of contigs that do not actually belong to a specific bin but are incorrectly assigned to that bin.
refers to the number of sequences that do not actually belong to a specific bin and are correctly identified as not belonging to that bin.
refers to the number of contigs that actually belong to a bin but are incorrectly assigned to other bins. The calculation formulas for the three indicators are as follows:
3.2.2. Binning Performance of Different Features
We perform the following ablation experiments to compare the performance of different features. Four features are compared: original tetranucleotide frequency (TNF), decomposed tetranucleotide frequency (Dec_TNF), contig embedding, and contig embedding + Dec_TNF. Their accuracy, recall, and F1 score are calculated at the species level on the five subsets of the CAMI2-HMP simulated dataset.
As can be seen in
Figure 4, on the five simulated datasets, the performance of the method using only TNF as a feature is the worst, with an accuracy of between 13.55% and 26.72%, a recall of between 7.65% and 15.68%, and an F1 score of between 9.95% and 18.02%. This is because unprocessed tetranucleotide frequencies can provide some local features of the sequence, but this type of information is relatively small and may not be sufficient to distinguish complex or similar genomes. Dec_TNF has a more significant improvement over TNF, with an accuracy rate from 19.28% to 48.74%, a recall rate from 7.74% to 23.59%, and an F1 score from 11.04% to 26.23%. NMF can extract the main components of the tetranucleotide frequencies, thereby capturing the main patterns in the sequence and making the features more compact and representative. In addition, NMF can reduce the noise and redundant information in the data, thereby improving the discriminability of the features.
The most significant is contig embedding with an accuracy between 76.70% and 82.52%, a recall between 64.39% and 73.65%, and an F1 score between 70.38% and 76.68%. All three indicators are far superior to TNF and Dec_TNF, with an increase in the F1 score of 60.43% and 53.54%, respectively. This fully proves that by using the BERT model to learn contigs and generate embedding vectors, these vectors can capture the context and semantic information of contigs, making the features more comprehensive, effective, and more accessible to distinguish between different genomes and improving the binning performance.
Contig embedding provides global context information of contigs, while Dec_TNF provides the main components of local patterns. The concatenation of contig embedding and Dec_TNF as features used by the CedtBin method can make it perform better on different types of contigs, thereby improving the overall performance. The accuracy of the concatenated features is between 80.79% and 89.12%, the recall rate is between 71.57% and 82.87%, and the F1 score is between 75.90% and 85.62%, which is improved in all indicators compared to the single use of contig embedding.
3.3. Results and Analysis of Different Binning Methods
3.3.1. Simulated Dataset CAMI2-HMP Binning Results
To demonstrate the effectiveness of CedtBin in binning, the binning methods CedtBin, VAMB, and MetaBAT2 are compared on the simulated dataset CAMI2-HMP. The binning performance is measured by calculating the number of reconstructed near-complete (NC, recall > 90% and precision > 95%) genomes. Unlike the previous definitions, recall and precision are calculated in a way that takes into account the base pair coverage between the genome and the bins. Recall is the proportion of base pairs in a genome that are correctly assigned to a bin, and precision is the proportion of base pairs in a bin that are correctly assigned to a genome. Use the benchmark.py script in VAMB to obtain the number of reconstructed genomes.
Binning methods VAMB and MetaBAT2 both use default parameters for metagenome binning. The number of NC genomes reconstructed from the strain level on the five datasets of Airways, GI, Oral, Skin, and Urog is shown in
Figure 5, and the specific values are given in
Table 2. Compared to VAMB and MetaBAT2, the CedtBin method performs best on all datasets, with the largest number of reconstructed NC genomes. On the Airways dataset, CedtBin reconstructs 85 NC genomes, significantly higher than VAMB’s 77 and MetaBAT2′s 38, an increase of approximately 10.39% over VAMB and 123.68% over MetaBAT2. On the GI dataset, CedtBin and VAMB perform similarly with 93 and 92, respectively, both better than MetaBAT2′s 80. On the Oral dataset, CedtBin reconstructs 133 NC genomes, slightly better than VAMB’s 130, but far better than MetaBAT2′s 76, an increase of 75%. On the Skin dataset, CedtBin reconstructs 100 NC genomes, which exceeds the 94 reconstructed by VAMB and the 68 reconstructed by MetaBAT2, with an improvement of 6.38% and 47.06%, respectively. On the Urog dataset, CedtBin performs best, reconstructing 95 NC genomes, which exceeds the 87 reconstructed by VAMB and the 67 reconstructed by MetaBAT2, with an improvement of 9.20% and 41.79%, respectively. In terms of the total number of NC genomes, CedtBin reconstructs 506 genomes, VAMB reconstructs 480, and MetaBAT2 reconstructs 329. Compared with VAMB, CedtBin’s improvement percentage is about 5.42%. Compared with MetaBAT2, CedtBin’s improvement percentage is about 53.80%.
In addition, CedtBin outperforms MetaBAT2 and VAMB at the species and genus level in most cases, also indicating that our method is more suitable for binning on these complex datasets. As shown in
Table 3, CedtBin is able to reconstruct a greater number of genomes when reconstructing low-quality genomes with a reconstruction accuracy greater than 95% and a recall rate greater than 50% at the species level. Compared to MetaBAT2, CedtBin reconstructs 122 more genomes, an increase of 28.97%. Compared to VAMB, CedtBin reconstructed 51 more genomes, an increase of 10.37%. When the reconstruction accuracy is greater than 95% and the recall rate is 99%, CedtBin reconstructs more genomes than MetaBAT2 and VAMB. CedtBin reconstructs 67 and 36 more genomes than Maxbin2 and VAMB, respectively, an increase of 33.50% and 15.58%. Compared to the species level, the number of reconstructed genomes at the genus level is generally reduced, which is expected because the genus is a higher taxonomic level. As shown in
Table 4, the CedtBin method reconstructs one to two fewer genomes than the VAMB method on the Oral dataset. In addition, in most cases, the CedtBin method still shows excellent performance at the genus level, reconstructing more genomes than the other two methods.
3.3.2. Real Dataset MetaHIT Binning Results
The three methods CedtBin, MetaBAT2, and VAMB are run on the real dataset MetaHIT to obtain the number of genomes reconstructed with an accuracy greater than 95% at the strain level. The results are shown in
Figure 6 and
Table 5.
After analysis, we find that CedtBin performs better than VAMB and MetaBAT2 at all recall rates. In the range of 0.50 to 0.95 recall rates, CedtBin is able to reconstruct more genomes and shows more robust performance. At a recall rate of 0.5, CedtBin reconstructs 9 more genomes than VAMB and 33 more than MetaBAT2. At a recall rate of 0.9, CedtBin reconstructs 2 NC genomes more than VAMB and 22 NC genomes more than MetaBAT2. At a recall rate of 0.99, none of the methods reconstructs a genome, indicating that the task is challenging at this recall rate on real datasets.
3.4. Memory Usage and Runtime
To investigate how much memory and time the NMF and Annoy algorithms consume, we run experiments on the simulated dataset CAMI2-HMP and the real dataset MetaHIT (see
Table 1 for dataset details) and record the running time and maximum memory consumption, as shown in
Table 6. The column labeled “TNF” records the memory and time used for the entire binning process using only raw tetranucleotide frequency as the feature. The column “Dec_TNF” records the memory and time used for binning after first applying non-negative matrix factorization (NMF) to the tetranucleotide frequency (TNF) feature. “CedtBin” uses contig embedding and concatenated features from Dec_TNF, obtaining cluster representation on a pre-trained BERT model. It records the complete time and memory usage from obtaining encoded representations of the input sequences to clustering. The last two columns in the table represent the time and memory usage for the standalone clustering process using the DBSCAN algorithm and the Annoy-DBSCAN algorithm within CedtBin.
Using TNF to bin, the dataset dimension is N × 136, and it takes very little time, from 42.91 s to 97.96 s, and the maximum memory usage is the largest, from 9959.82 MiB to 55,506.55 MiB. This is because the features are not processed, resulting in high memory usage during clustering. Using Dec_TNF to bin, the dimension of the dataset is N × 10. Although it takes about 38.93 s more time on average than TNF (it takes some time to decompose the original feature matrix), it can be seen that the maximum memory usage is reduced from 3002.84 MiB to 5246.63 MiB, and the maximum memory usage is reduced by about 83.34% compared to TNF. This fully proves that the tetranucleotide frequency features processed by the NMF algorithm can effectively improve the binning effect and significantly reduce the memory usage. The dataset dimension of CedtBin is N × 34. Due to the addition of contig embedding, the time and memory requirements increase. The whole binning process takes between 28 min 38 s and 55 min 55 s. The maximum memory usage is on average about 2474.05 MiB more than Dec_TNF.
In CedtBin, we use DBSCAN and Annoy-DBSCAN for comparison. When using DBSCAN, we empirically fill in the two parameters Eps and MinPts and then run it. The average time used for this part of DBSCAN is 4.09 s, and the average maximum memory usage is 396.55 MiB. It can be seen that clustering using the DBSCAN algorithm is very fast when the feature dimension is small. Annoy-DBSCAN combines the Annoy algorithm with a grid search strategy to obtain these two parameters. The average time for the whole process is 28.44 s, and the average maximum memory usage is 488.49 MiB. Although the Annoy-DBSCAN method increases the average usage time by about 24.35 s and the average maximum memory usage by about 91.94 MiB compared to the DBSCAN method, 24.35 s is relatively insignificant in the entire CedtBin binning process. If manual parameter input can be avoided, this time sacrifice is worthwhile. In addition, we can also balance the search time and search accuracy by adjusting the parameters of the Annoy algorithm.
4. Discussion
This study proposes a new metagenomic binning method, CedtBin, which uses contig embedding and decomposed tetranucleotide frequency for binning. Metagenomic binning methods have been studied for many years, but little attention has been paid to the semantic information of contigs. The experimental results show that contig embedding is a very effective feature for metagenomic binning. Tetranucleotide frequency is the most commonly used feature, but the lack of processing of the original features leads to limited information and poor binning effect. It is optimized by the NMF algorithm. Although the improvement effect is limited, the processed features are more discriminative and significantly reduce time and memory. We chose the DBSCAN algorithm for clustering because it can cluster according to the density of the data and detect the noise in the data. It is suitable for metagenomic binning tasks and has been used in many studies. However, the DBSCAN algorithm requires manual input of the Eps and MinPts parameters and is very parameter sensitive. For this reason, we use the Annoy algorithm combined with a grid search strategy to adaptively determine these two parameters.
The CAMI2-HMP simulation dataset was chosen due to its role in providing a standardized benchmark for evaluating metagenomic binning methods [
20], as CAMI was designed to offer a unified evaluation standard for binning performance and is widely used in the field. Simulation datasets generally have high-quality contigs and often result in good binning performance. However, real environments are more complex and variable, with poorer sequence quality, which can lead to differences in binning performance compared to simulation datasets. Although CedtBin performs better on the real MetaHIT dataset compared to other methods, it may face challenges in more complex environments. Therefore, it is necessary to train deep learning models on a broader range of datasets to enhance their applicability to real metagenomic data.
CedtBin has achieved good binning results in metagenome binning, but some work needs to be done. First, research on the BERT model is limited due to limited resources and performance constraints. In terms of masking strategies, in the future, the number of masked k-mers can be changed from 15% to 40%, and the effect of different masking lengths can be explored. Secondly, in terms of feature selection, adding features such as codons, GC content, and abundance can be considered. It is believed that the addition of these features can further improve the effect of binning.