1. Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative virus of coronavirus disease 2019 (COVID-19), emerged at the end of 2019, burdening both the global economy and public health [
1,
2,
3,
4]. Next-generation sequencing has provided an unprecedented opportunity to monitor the COVID-19 pandemic in real-time [
5,
6]. During the pandemic, vast amounts of SARS-CoV-2 genome sequences have been accumulated at ever-growing rates and shared in the public database. As of 12 July 2022, more than 10 million SARS-CoV-2 genome sequences worldwide are available to researchers in the online database Global Initiative on Sharing all Individual Data (GISAID) [
7] (available at
https://www.gisaid.org/ (accessed on 12 July 2022)). Rapidly growing genome sequences contribute to surveilling this fast-spreading pathogen and distinguishing emerging lineages [
8,
9,
10,
11,
12]. In particular, lineage classification is a critical tool for monitoring variants of concern (VOCs) or variants of interest (VOIs) with reduced susceptibility to neutralizing antibodies or having higher transmissibility [
13]. Research indicated that distinct SARS-CoV-2 lineages could play a pivotal role in developing drugs and designing vaccines by altering pathogenesis in infected hosts or virus tropism [
14,
15]. Therefore, the rapid identification of SARS-CoV-2 lineages, associated with different medical conditions and symptoms, has assisted in the long-term surveillance of this pathogen and is of utmost importance for updating SARS-CoV-2 vaccines [
16,
17,
18,
19].
Viral classification, allowing precise and unambiguous communication between researchers in different fields, is a challenging problem [
20,
21]. At present, many scientists are working on effectively categorizing SARS-CoV-2. The World Health Organization (WHO) recommended the use of the Greek alphabet such as Alpha, Beta, Gamma, Delta, and Omicron to classify SARS-CoV-2 genomes [
22]. An early work by Chinese researchers identified two major lineages, L and S, based on two highly linked single nucleotides [
14]. In addition, other developed sequence typing tools, Nextstrain [
23], GISAID [
7], Phylogenetic Assignment of Named Global Outbreak Lineages (Pangolin) [
11,
24], COVID-19 Genotyping Tool [
15], and Genome Detective Coronavirus Typing Tool [
25], are critical for tracking emerging diversity and spread of certain lineages. There are 25 Nextstrain clades, 1725 Pango lineages, and 11 GISAID clades as of 20 April 2022. Nonetheless, the phylogeny-based classification methods, such as GISAID [
7] and Pangolin [
11], demand huge computation time and memory consumption [
18]. Moreover, those methods have a great demand for genetic distance thresholds when determining the maximal genetic differentiation among closely related viruses [
18,
26]. As for the single nucleotide polymorphism (SNP)-based classification methods, including Chinese lineage [
14] and Nextstrain [
23], are not enough to fully address the complex genetic diversity of SARS-CoV-2, for those two methods depend on mutations with significant geographic distribution and frequency or marker mutations [
27]. Since the genetic diversity of SARS-CoV-2 challenges the current classification methods of SARS-CoV-2 variants [
6], a more inexpensive, rapid, effective, and robust classification method is needed to identify the lineage of the virus, making it possible to quantitatively partition and describe the diversity of SARS-CoV-2 lineages [
8,
12,
28,
29,
30]. Given that an impressive amount of sequencing data is being generated, we intend to adopt supervised learning-based approaches, which attempt to learn directly from the data, to classify SARS-CoV-2 genome sequences.
As shown in
Figure 1, the proposed system in this study focuses on the rapid classification of SARS-CoV-2 genome sequences through supervised learning methods. Different from the previous work, the focus of this study is not to discover new evolutionary branches, but to provide a model with improved efficiency and accuracy based on existing Nextstrain, GISAID, and Pangolin classification standards. In summary, the main contributions of this study are listed as follows: (1) Supervised learning-based identification models are constructed for the three typing strategies of Nextstrain, GISAID, and Pangolin, respectively, achieving rapid and accurate SARS-CoV-2 genome sequence typing. (2) A multilayer template matching algorithm is proposed for SARS-CoV-2 genome sequence typing, achieving ideal results for the Nextstrain and GISAID clades. (3) Based on the template matching algorithm, this study has proposed a matching score-based method to quantify the difference between clades. (4) The lightweight data structure proposed in this study reduces the computational resource requirements of the model. (5) Finally, the ensemble model can achieve higher accuracy by fusing the prediction results of different methods. Extensive tests on a large amount of SARS-CoV-2 genome sequences show that the classification model constructed in this study has high accuracy and robustness. Furthermore, by introducing sub-models, this study can efficiently construct an extended model that identifies newly emerging clades.
4. Discussion
Facing the SARS-CoV-2 genome sequence typing problem, this study built classifiers for three typing strategies of GISAID, Nextstrain, and Pangolin. In addition to the machine learning-based methods, this study has proposed a method based on template matching for GISAID and Nextstrain. Based on the template matching algorithm, we obtained the difference matrix between viral clades and applied it as one of the classifier evaluation indicators. To achieve a fast and accurate classifier, two improvements have been made. First, two data structures based on one-hot coding and site mutation were used for nucleotide sequence transformation. Second, a weighted fusion strategy was applied to obtain an ensemble model. Overall, our study achieved the highest accuracy on Nextstrain clade typing (precision: 99.879%, recall: 99.879%, F-score: 99.879%), followed by the Pangolin (precision: 97.889%, recall: 97.732%, F-score: 97.766%) and the GISAID (precision: 96.433%, recall: 96.291%, F-score: 96.235%).
(1) Nextstrain: Our study has studied the classification of 25 Nextstrain clades, using seven machine learning-based methods and a template matching-based method. The ensemble model achieved the highest classification precision, recall, and F-score. The template matching algorithm achieved a classification performance comparable to any machine learning-based classifier. In addition, the difference matrix
obtained from the matching algorithm can intuitively represent the distance between different clades.
Figure 4 and
Figure 9 show that the misclassified samples are mainly distributed between clades with small differences. Furthermore, data structure
has a better classification performance in SARS-CoV-2 genome sequence typing. Although the accuracy is slightly lower than that of
, the computational efficiency is improved by more than five times (as shown in
Table 2).
(2) GISAID: Research on the classification of 11 GISAID clades has been carried out in this work. The ensemble model on data structure
achieved the best results. Compared with the Nextstrain clade typing, TM performs worse in the GISAID clade classification, and the F-score is lower than 85%.
Figure 10 shows that except for GRA, GRY, and GK, the GISAID clades are less diverse (
). Furthermore, a total of 13 (23.6%) elements in
Figure 10 are less than 0.03, while those in
Figure 4 equal zero. It indicates that the separability between GISAID clades is lower than that of Nextstrain clades. The ensemble model on
obtained the highest typing accuracy with an ideal computational speed.
(3) Pangolin: A total of 710 Pango lineages are included in this study. The classification accuracy of RF and Catboost is very close, and the ensemble of the two methods can obtain higher precision, recall, and F-score. More interestingly, the performance of the ensemble model on is better than that on , with higher accuracy and less computation time.
Compared with existing SARS-CoV-2 typing studies, our results have both improvements and limitations. The Genome Detective Coronavirus Typing Tool [
25] can only identify the SARS-CoV-2 clades of several VOCs. In addition, this method is computationally inefficient, taking an average of 30 ms per genome. UShER [
43] places sequences on a comprehensive tree and supplied sequences need to be uploaded to UShER’s servers where processing takes place. In addition, it takes an average of 18 ms to place one sample onto the reference tree using 16 threads and achieves an accuracy of 98.5% for samples with one parsimony-optimal placement. On the other hand, Nextclade [
44] is an open-source project for viral genome alignment, mutation calling, clade assignment, quality checks, and phylogenetic placement. Although its web version can provide comprehensive and up-to-date sequence analysis results, its offline version performs clade assignment based on a small number of valid nucleotide sites, with low accuracy, and partial sequences cannot be effectively identified. Compared with Nextclade and UShER, this study does not construct the evolutionary tree but focuses on the typing of genomes. In addition, the methods proposed in this study (the template matching and the ensemble model) are computationally efficient (<20 ms for one sample) with higher accuracy (>99.85%). The disadvantage of our work is that we can only identify existing clades and cannot discover new SARS-CoV-2 clades. However, the proposed extended model can identify newly emerging clades by training sub-models with only a small amount of work.
As for the GISAID clade typing, its classification accuracy is relatively low. GISAID classification is based more on several marker variants than strictly phylogenetic relationships [
18]. Moreover, clade O refers to other clades that do not meet the GISAID clade definition [
45]. This can further explain that the typing model has the worst accuracy on clade O (recall: 77.249%, F-score: 86.625%). The PhenoGraph [
46] classification identifies 303 SARS-CoV-2 clades and is consistent with, but more detailed and precise, than the known GISAID clades [
18]. It provides an unsupervised clustering method for SARS-CoV-2 clades. In contrast, we provide supervised models for a different classification density. Although the weighted recall of the proposed model is about 96%, VOCs such as GK (Delta) and GRA (Omicron) can achieve an accuracy of over 99%.
The Pangolin classification tool [
24] provides the basis for the research in this study. Different from PangoLEARN, this study tried a lightweight data structure
with higher efficiency. The classification accuracy has been improved through model integration. The limitation of our method is that only 710 SARS-CoV-2 lineages are included in this study due to the constraints of computational resources. This problem can be solved by increasing the hardware configuration level and downloading more data. In addition, GNU-based Virus IDentification (GNUVID) is applied to assign sequence type profiles to all high-quality SARS-CoV-2 genomes [
28]. The overall prediction statistics of GNUVID on high-quality genomes are precision (94.7%), recall (96.4%), and F-score (95.0%), which are lower than those of the classifier proposed in this study. In addition, this study adopts the lightweight data structure
to improve the classification efficiency, and the average time per sequence is about 10 ms, which is much lower than the 31 ms of GNUVID [
28].
5. Conclusions
This study presents a SARS-CoV-2 genome sequence classification system based on supervised learning methods. Overall, the system aims to achieve rapid and accurate SARS-CoV-2 genome sequence typing for the three typing strategies of Nextstrain, GISAID, and Pangolin, respectively. When we obtained SARS-CoV-2 genome sequences from COVID-19 patients, the system proposed in this study can be applied to efficiently and accurately type these sequences, which would help to carry out relevant epidemiological analysis and provide reliable typing and traceability basis for effectively blocking its spread. For Nextstrain and GISAID, this study has proposed a method based on template matching. Through the strategy of multi-layer matching, we improved the efficiency of the matching algorithm. The template matching method achieved satisfactory results in the Nextstrain clade typing. A template matching-based difference metric method is proposed to quantify the difference between two clades and serve as an evaluation factor for classifier performance. Furthermore, we have proposed an ensemble model that integrates a combination of machine learning methods (such as Random Forest and Catboost) with optimized weights. In addition to the one-hot coding method, this study has proposed a data structure based on nucleotide site mutation, which obtains good results in SARS-CoV-2 genome sequence typing. While obtaining ideal classification accuracy, the computational resources are greatly reduced. Finally, verified by a large number of testing datasets, the ensemble model proposed in this study helps to improve the accuracy of the classification system (Nextstrain: 99.879%, Pangolin: 97.732%, GISAID: 96.291%). This study provides a comprehensive and efficient method for SARS-CoV-2 genome sequence typing, which helps to monitor the diversity of SARS-CoV-2, thereby serving the global anti-epidemic. In addition, by introducing sub-models, this study can rapidly construct an extended model that accurately identifies newly emerging clades without retraining the main model constantly. Future work will focus on the discovery of new clades and the identification of recombination.