Next Article in Journal
Bacillus velezensis Strain HN-Q-8 Induced Resistance to Alternaria solani and Stimulated Growth of Potato Plant
Previous Article in Journal
DNA-Binding Protein Dps Protects Escherichia coli Cells against Multiple Stresses during Desiccation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring the Potential of GANs in Biological Sequence Analysis

Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
*
Author to whom correspondence should be addressed.
Biology 2023, 12(6), 854; https://doi.org/10.3390/biology12060854
Submission received: 29 April 2023 / Revised: 3 June 2023 / Accepted: 12 June 2023 / Published: 14 June 2023
(This article belongs to the Section Bioinformatics)

Abstract

:

Simple Summary

This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real data in terms of tackling the data imbalance challenge. The experimental results on four distinct datasets demonstrate that GANs can improve the overall classification performance. This kind of analytical (classification) information can improve our understanding of the viruses associated with the sequences, which can be used to build prevention mechanisms to eradicate the impact of the viruses.

Abstract

Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models’ performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.

1. Introduction

Biological sequences usually refer to nucleotides or amino-acid-based sequences, and their analysis can provide detailed information about the functional and structural behaviors of the corresponding viruses, which are usually responsible for causing diseases, for example, flu [1] and COVID-19 [2]. This information is very useful in building prevention mechanisms, such as drugs [3], vaccines [4], etc., and to control the disease spread, eliminate the negative impacts, and perform virus spread surveillance.
Influenza A virus (IAV) is such an example, which is responsible for causing a highly contagious respiratory illness that can significantly threaten global public health. As the Centers for Disease Control and Prevention Center (CDC) (https://www.cdc.gov/flu/weekly/index.htm, accessed on 20 April 2023) reports, so far this season, there have been at least 25 million illnesses, 280,000 hospitalizations, and 17,000 deaths from flu in the United States. Therefore, identifying and tracking the evolution of IAV accurately is a vital step in the fight against this virus. The classification of IAV is an essential task in this aspect as it can provide valuable information on the origin, evolution, and spread of the virus. Similarly, coronaviruses are also known to infect multiple hosts and create global health crises [5] by causing pandemics, for instance, COVID-19, which is caused by the SARS-CoV-2 coronavirus. Therefore, determining the infected host information of this virus is essential for understanding the genetic diversity and evolution of the virus. As the spike protein region from the coronavirus genome is used to attach to the host cell membrane, so utilizing only the spike region provide sufficient information to determine the corresponding host. Moreover, the identification of the viral taxonomy can further enrich its understanding, e.g., the viral polymerase palmprint sequence of a virus is utilized to determine its taxonomy (species generally) [6]. A polymerase palmprint is a unique sequence of amino acids located at the thumb subunit of the viral RNA-dependent polymerase. Furthermore, examining the antigen specificities based on the T-cell receptor sequences can provide beneficial information regarding solving numerous problems of both basic and applied immunology research.
Many traditional sequence analysis methods follow phylogeny-based techniques [7,8] to identify sequence homology and predict disease transmission. However, the availability of large-size sequence data exceeds the computational limit of such techniques. Moreover, the application of ML approaches for performing biological sequence analysis is a popular research topic these days  [9,10]. The ability of ML methods to determine the sequence’s biological functions makes them desirable to be employed for sequence analysis. Additionally, ML models can also determine the relationship between the primary structure of the sequence and its biological functions. For example, ref. [9] built a random forest-based algorithm to classify the sucrose transporter (SUT) protein, ref. [10] designed a novel tool for protein–protein interactions data and functional analysis, and ref. [11] developed a new ML model to identify RNA pseudo-uridine modification sites. ML-based biological sequence analysis approaches can be categorized into feature-engineering-based methods [12,13], kernel-based methods [14], neural network-based techniques [15,16], and pre-trained deep learning models [17,18]. However, extrinsic factors limit the performance of ML-based techniques and one such major factor is data imbalance, as in the case of biological sequences, the data are generally imbalanced because the number of negative samples is much larger than that of positive samples [19]. ML models can obtain the best results when the dataset is balanced while unbalanced data will greatly affect the training of machine learning models and their application in real-world scenarios [20].
In this paper, we explore the idea of improving the performance of ML methods for biological sequence analysis by eradicating the data imbalance challenge using generative adversarial networks (GANs). Our method leverages the strengths of GANs to effectively analyze these sequences, with the potential to have significant implications for virus surveillance and tracking, as well as the development of new antiviral strategies. By accurately classifying the viral sequences, our study contributes to the field of virus surveillance and tracking. The ability to effectively identify and track viral strains can assist in monitoring the spread of infectious diseases, understanding the evolution of viruses, and informing public health interventions. Moreover, the accurate classification of the viral sequences has significant implications for the development of antiviral strategies. By better understanding the genetic diversity and relatedness of viral strains, researchers can identify potential targets for antiviral therapies, design effective vaccines, and predict the emergence of drug-resistant strains. We discuss how our study’s findings can contribute to these areas, emphasizing the importance of accurate sequence analysis in guiding the development of new antiviral strategies.
Our contributions to this work are as follows:
  • We explore the idea of classifying biological sequences using generative adversarial networks (GANs).
  • We show that usage of GANs improves predictive performance by eliminating the data imbalance challenge.
  • We demonstrated the potential implications of the proposed approach for virus surveillance and tracking, and for the development of new antiviral strategies.
The rest of the paper is organized as follows: Section 2 contains the related work. The proposed approach details are discussed in Section 3. The datasets used in the experiments along with the ML models and evaluation metrics information is provided in Section 4. Section 5 highlights the experimental results and their discussion. Finally, the paper is concluded in Section 6.

2. Related Work

The combination of biological sequence analysis and ML models has gained quite a lot of attention among researchers in recent years [9,10]. As a biological sequence consists of a long string of characters corresponding to either nucleotides or amino acids, it needs to be transformed into a numerical form to make it compatible with the ML model. Various numerical embedding generation mechanisms are proposed to extract features from the biological sequences [12,15,18].
Some of the popular embedding generation techniques use the underlying concept of k-mer to compute the embeddings. Similar to how refs. [21] use the k-mers frequencies to obtain the vectors, refs. [13,22] combine position distribution information and k-mers frequencies to obtain the embeddings. Other approaches [15,16] employ neural networks to obtain the feature vectors. Moreover, kernel-based methods [14] and pre-trained deep-learning-model-based methods [17,18] also play a vital role in generating the embeddings. Although all these techniques illustrate promising analysis results, they have not mentioned anything about dealing with data imbalance issues, which if handled properly, will yield performance improvement.
Furthermore, another set of methods tackles the class imbalance challenge with the aim to enhance overall analytical performance. They use resampling techniques at the data level by either oversampling the minority class or undersampling the majority class. For instance, ref. [9] uses the borderline-SMOTE algorithm [23], an oversampling approach, to balance the feature set of the sucrose transporter (SUT) protein dataset. However, due to the usage of the k-nearest neighbor algorithm, borderline-SMOTE has high time complexity and is susceptible to noise data and is unable to make good use of the information of the majority samples [24]. Similarly, ref. [25] performs protein classification by handling the data imbalance using a hybrid sampling algorithm that combines both ensemble classifier and over-sampling techniques, KernelADASYN [26] employs a kernel-based adaptive synthetic over-sampling approach to deal with data imbalance. However, these methods do not utilize the overall data distribution, they are only based on local information [27].

3. Proposed Approach

In this section, we discuss our idea of exploring GANs to obtain analytical performance improvement for biological sequences in detail. As our input sequence data consists of string sequences representing amino acids, they need to be transformed into numerical representations in order to operate GANs on them. For that purpose, we use four distinct and effective numerical feature generation methods, which are described below.

3.1. Spike2Vec [21]

Spike2Vec generates the feature embedding by computing the k-mers of a sequence. As k-mers are known to preserve the ordering information of the sequence. K-mers represent a set of consecutive substrings of length k driven from a sequence. For s sequence with length N, the total number of its k-mers will be N k + 1 . This method devises the feature vector for a sequence by capturing the frequencies of its k-mers. To further deal with the curse of dimensionality issue, Spike2Vec uses random Fourier features (RFF) to map data to a randomized low-dimensional feature space. We use k = 3 to obtain the embeddings.

3.2. PWM2Vec [22]

This method works by using the concept of k-mers to obtain the numerical form of the biological sequences, however, rather than utilizing constant frequency values of the k-mers, it assigns weights to each amino acid of the k-mers and employs these weights to generate the embeddings. The position weight matrix (PWM) is used to determine the weights. PWM2Vec considers the relative importance of amino acids along with preserving the ordering information. The workflow of this method is illustrated in Figure 1 which uses k = 5 , while our experiments use k = 3 to obtain the embeddings for performing the classification tasks.

3.3. Minimizer

This approach is based on the utility of minimizers [28] (m-mer) to obtain the feature vectors of sequences. The minimizer is extracted from a k-mer and it is a m length lexicographically smallest (in both forward and backward order) substring of consecutive alphabets from the k-mer. Note that m < k . The workflow of computing minimizers for a given input sequence is shown in Figure 2. This approach intends to eliminate the redundancy issue associated with k-mers, hence improving the storage and computation cost. Our experiments used k = 9 and m = 3 to generate the embeddings.
After obtaining the numerical embeddings of the biological sequences using the methods mentioned above, we further utilize these embeddings to train our GAN model. We utilize annotated groups as input to the GAN. This model has two parts, a generator model and a discriminator model. Each discriminator and generator model consists of two inner dense layers with ReLU activation functions (each followed by a batch-normalization layer) and a final dense layer. In the discriminator, the final dense layer has a Sigmoid activation function while the generator has a SoftMax activation function. The generator’s output has the same dimensions as the input data, as it synthesizes the data, while the discriminator yields a binary scalar value to indicate whether the generated data are fake or real.
The GAN model is trained using the cross-entropy loss function, ADAM optimizer, 32 batch size, and 1000 iterations. The steps followed to obtain the synthetic data after the training GAN model is illustrated in Algorithm 1. As given in the algorithm, first, the generator and discriminator models are created in steps 1–2. Then, the discriminator model is complied for training with cross-entropy loss and ADAM optimizer in step 3. After that, the count and length of synthetic sequences along with the number of training epochs and batch size are mentioned in steps 4–6. Then, the training of the models occurs in steps 7–12 , where each of the models is fine-tuned for the given number of iterations. Once the GAN model is trained, its generator part is employed to synthesize new embedding data which resemble real-world data. These synthesized data can eliminate the data imbalance problem, improving the analytical performance. Moreover, the overall workflow of training the GAN model is shown in Figure 3. The figure illustrates the training procedure of the GAN model by fine-tuning the parameters of its generator and discriminator modules. It starts by obtaining the numerical embeddings of the input sequences and passing them to the discriminator part along with the synthetic data generated by the generator part. The discriminator model is trained in a way that it can identify whether the data are real or synthetic, and based on this information, we fine-tune the generator model. The overall goal is training the generator model to the extent that the synthetic data generated by it cannot be distinguished by the discriminator model anymore, which means that the synthetic data are very close to the real data.
Algorithm 1 Training GAN model
       Input: Set of Sequences S, g a n C n t
       Output: GANs based sequences S
1:  m _ g e n g e n e r a t o r ( ) ▹ generator model
2:  m _ d i s d i s c r i m i n a t o r ( ) ▹ discriminator model
3:  m _ d i s . c o m p i l e ( l o s s = C E , o p t = A D A M )
4:  s e q L e n l e n ( S [ 0 ] ) ▹ len of each S sequence
5:  i t e r 1000
6:  b a t c h _ s i z e 32
7: for i in i t e r  do
8:        n o i s e r a n d o m ( g a n C n t , s e q L e n )
9:        S m _ g e n . p r e d i c t ( n o i s e ) ▹ get GAN sequences
10:      m _ d i s . b a c k w a r d ( m _ d i s . l o s s ) ▹ fine-tune m _ d i s
11:      m _ g e n . b a c k w a r d ( m _ g e n . l o s s ) ▹ fine-tune m _ g e n
12: end for
13: return( S )

4. Experimental Setup

This section highlights the details of the datasets used to conduct the experiments along with the information about the classification models and their respective evaluation metrics to report the performance. All experiments were carried out on an Intel (R) Core i5 system with a 2.40 GHz processor and 32 GB memory. We use Python to run the experiments. Our code and preprocessed datasets are available online for reproducibility (https://github.com/taslimmurad-gsu/GANs-Bio-Seqs/tree/main, accessed on 20 April 2023).

4.1. Dataset Statistics

We use 4 different datasets to evaluate our suggested method. A detailed description of each of the dataset is given as follows.

4.1.1. Influenza A Virus

We are using the influenza A virus sequence dataset belonging to two kinds of subtypes “H1N1” and “H3N2” extracted from [29] website. These data contain 222 , 450 sequences in total with 119 , 100 sequences belonging to the H1N1 subtype and 103 , 350 to the H2N3 subtype. The detailed statistics for this dataset are shown in Table 1. We use these two subtypes as labels to classify the Influenza A virus in our experiments.

4.1.2. PALMdb

The PALMdb [6,30] dataset consists of viral polymerase palmprint sequences, which can be classified species-wise. This dataset is created by mining the public sequence databases using the palmscan [6] algorithm. It has 124,908 sequences corresponding to 18 different virus species. The distribution of these species is given in Table 2 and more detailed statistics are shown in Table 1. We use the species name as a label to do the classification of the PALMdb sequences.

4.1.3. VDjDB

VDJdb is a curated dataset of T-cell receptor (TCR) sequences with known antigen specificities [31]. This dataset consists of 58,795 human TCRs and 3353 mouse TCRs. More than half of the examples are TRBs (n = 36,462) with the remainder being TRAs (n = 25,686). The T-cell receptor alpha chain (TRA) and T-cell receptor beta chain (TRB) refer to the chains that make up the T-cell receptor (TCR) complex. The TRB chain plays a crucial role in antigen recognition and is involved in T-cell immune responses. It has 78 , 344 total sequences belonging to 17 unique antigen species. The distribution of the sequence among the antigen species is shown in Table 3 and further details of the dataset are given in Table 1. We use these data to perform the antigen species classification.

4.1.4. Coronavirus Host

The host dataset consists of spike sequences of coronavirus corresponding to various infected hosts. These data are extracted from ViPR [32] and GISAID [33]. They contain 5558 total sequences belonging to 21 unique hosts and their detailed distribution is shown in Table 4.

4.2. ML Classifiers and Evaluation Metrics

To perform classification tasks, we employed the following ML models: naive Bayes (NB), multilayer perceptron (MLP), k-nearest neighbor (k-NN) (where k = 3 ), random forest (RF), logistic regression (LR), and decision tree (DT). For each classification task, the data are split into 30–70% train–test sets using stratified sampling to preserve the original data distribution. Furthermore, our experiments were conducted by averaging the performance results of 5 runs for each combination of dataset and classifier to obtain more stable results.
We evaluated the classifiers using the following performance metrics: accuracy, precision, recall, weighted F1, F1 macro, and ROC AUC macro. Since we are doing multi-class classification in some cases, we utilized the one-vs-rest approach for computing the ROC AUC score for them. Moreover, the reason for reporting many metrics is to obtain more insight into the classifiers’ performance, especially in the class imbalance scenario where reporting only accuracy does not provide sufficient performance information.

5. Results and Discussion

This section discusses the experimental results comprehensively. The subtype classification results of the Influenza A virus dataset are given in Table 5, along with the results of the PALMdb dataset species-wise classification. The antigen species-wise classification results of VDjDB data and host-wise classification results of coronavirus host data are shown in Table 6. The reported results represent the results achieved using the test set.
We have compared the classification performance of three embedding generation methods (Spike2Vec, PWM2Vec, Min2Vec) using four datasets (Influenza A virus, PALMdb, VDjDb, Host) under three different settings (without-GANs, with-GANs, only-GANs). Without-GANs indicate the scenario where the original embeddings from the three embedding generation methods are used to perform the classifications, while with-GANs show the performance achieved using the original embeddings with the addition of the GANs-based synthetic data for eliminating the class imbalance challenge. The only-GANs setting is utilized to illustrate the performance gained by using only the synthetic data without the original one. It provides an overview of the effectiveness of the synthetic data in terms of classification predictive performance.
In the with-GANs scenario, for each dataset, the classes with a lower number of instances combine their respective GANs-based synthetic data to increase their count to make them comparable with the most frequent classes. This addition removes the data imbalance issue and the newly created dataset is further utilized for performing the classification tasks. Note that the synthetic data are only added to the training set, while the test set contains the original data, so the test set has the actual imbalance data distribution. A further detailed discussion of the results for each embedding method with various combinations of datasets and setting scenarios are given below.

5.1. Performance of without-GANs Data

These results illustrate the classification performance achieved corresponding to the embeddings generated by Spike2Vec, PWM2Vec, and minimizer strategies for each dataset. We can observe that for the Influenza A virus dataset, Spike2Vec and minimizer are exhibiting similar performance for almost all the classifiers and are better than PWM2Vec. However, the NB model yields minimum predictive performance for all the embeddings. Similarly, the VDjDb dataset portrays similar performance for Spike2Vec and minimizer for all evaluation metrics, while its PWM2Vec has a very low predictive performance. Moreover, all the embeddings achieve the same performance in terms of all the evaluation metrics for every classifier on the PALMdb dataset. For the host dataset, all the three embeddings are yielding very similar results with NB exhibiting the lowest and RF exhibiting the highest performances.

5.2. Performance of with-GANs Data

To view the impact of GAN-based data on the predictive performance for all the datasets, we evaluate the performance using the original embeddings with GAN-based synthetic data added to them, respectively. These GANs-based data are used to train the classifiers, while only the original data are used as test data. For a dataset, to generate the GAN data corresponding to an embedding generation method, the GAN model is trained with the original embeddings first and then new data are synthesized for that embedding. Every label of the embedding will have a different count of synthetic data added to it depending on its count in the original embedding data. The aim is to make the class distribution balanced in a dataset.
For Influenza A virus data, the results show that in some cases the addition of GANs-based synthetic data improves the performance as compared to the performance on the original data, such as for the KNN, RF, and NB classifiers corresponding to PWM2Vec methods. Similarly, on the VDjDB dataset, the GAN-based improvement is also witnessed in some cases, such as for all the classifiers corresponding to the PWM2Vec method except NB. Moreover, as the performance of the PALMdb dataset on the original data is at its maximum already, the addition of GAN embeddings has retained that performance. Furthermore, the host dataset combining the synthetic data with the original data shows a performance improvement for some scenarios; for instance, PWM2Vec-based classification using NB, KNN, and RF classifiers, Spike2Vec- and Min2Vec-based classifications using NB and KNN classifiers.
Generally, we can observe that the inclusion of GAN synthetic data in the training set can improve the overall classification performance. This is because the training set size increases and the data imbalance issue is resolved by adding the respective synthetic data.

5.3. Performance of Only-GANs Data

We also studied the classification performance gain of using only GANs-based embeddings without the original data. The results depict that for all four datasets, this category has the lowest predictive performance for all the combinations of classifiers and embeddings as compared to the performance on original data and on original data with GANs. As only the synthetic data are employed to train the classifiers, they are tested on the original data, which is why the performance is low as compared to others.

5.4. Data Visualization

We visualize our datasets using the popular visualization technique, t-SNE [34], to view the internal structure of each dataset following various embeddings. The plots for the Influenza A virus dataset are reported in Figure 4. We can observe that for Spike2Vec and minimizer-based plots, the addition of GAN-based features causes two big clusters along with the small scattered clusters for each, unlike their original t-SNEs, which only consist of small scattered groups. However, the PWM2Vec-based plots for both with GANs and without GANs show similar structures; however, generally including GAN-based embeddings to the original ones can improve the t-SNE structure.
Similarly, the t-SNE plots for the PALMdb dataset corresponding to different embeddings are shown in Figure 5. We can observe that this dataset shows similar kinds of cluster patterns corresponding to both without-GANs- and with-GANs-based embeddings. As the original dataset already shows clear and distinct clusters for various species, adding GAN-based embedding to it does not affect the cluster structure much.
Moreover, the t-SNE plots for the VDjDB dataset are given in Figure 6. We can observe that the addition of GAN-based features to the minimizer-based embedding has yielded more clear and distinct clusters in the visualization. GAN-based spike2vec also portrays more clusters than the Spike2Vec one. However, the PWM2Vec shows similar patterns for both GAN-based and without GANs embeddings. Overall, it indicates that adding GANs-based features is enhancing the t-SNE cluster structures.
Furthermore, the t-SNE plots for the host dataset are illustrated in Figure 7. We can see that for PWM2Vec, the addition of GANs-based embeddings further refines the structure by reshaping the clusters, while the structures of Spike2Vec and Min2Vec seem to remain almost same for both with and without GANs.
We also investigated the t-SNE structures generated by using only the GANs-based embeddings and Figure 8 illustrates the results. It can be seen that for all the datasets only-GAN embeddings are yielding non-overlapping distinct clusters corresponding to each group with respect to the dataset. It is because, for each group, the only-GAN embeddings are synthesized after training the GAN model with the original data of the respective group. Note that for host data, some of the clusters are very tiny because of the corresponding number of instances in the dataset belonging to that group being very small.

6. Conclusions

In conclusion, this work explores a novel approach to improve the predictive performance of the biological sequence classification task by using GANs. It generates synthetic data with the help of GANs to eliminate the data imbalance challenge, hence improving the performance. In the future, we would like to extend this study by investigating more advanced variations of GANs to synthesize the biological sequences and their impacts on the biological sequence analysis. We also want to examine additional genetic data, such as hemagglutinin and neuraminidase gene sequences, with GANs to improve their classification accuracy.

Author Contributions

Conceptualization, T.M., S.A. and M.P.; Methodology, T.M. and S.A.; Formal analysis, T.M. and S.A.; Data curation, T.M.; Writing—original draft, T.M.; Writing—review and editing, S.A. and M.P.; Visualization, T.M.; Supervision, M.P. All authors have read and agreed to the published version of the manuscript.

Funding

T.M. was partially supported by an HEC fellowship, S.A. was partially supported by an MBD fellowship and M.P. was partially supported by a GSU/Computer Science startup.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Das, K. Antivirals targeting influenza A virus. J. Med. Chem. 2012, 55, 6263–6277. [Google Scholar] [CrossRef] [PubMed]
  2. Pedersen, S.F.; Ho, Y.C. SARS-CoV-2: A storm is raging. J. Clin. Investig. 2020, 130, 2202–2205. [Google Scholar] [CrossRef]
  3. Rognan, D. Chemogenomic approaches to rational drug design. Br. J. Pharmacol. 2007, 152, 38–52. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Dong, G.; Pei, J. Sequence Data Mining; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2007; Volume 33. [Google Scholar]
  5. Majumder, J.; Minko, T. Recent Developments on Therapeutic and Diagnostic Approaches for COVID-19. AAPS J. 2021, 23, 1–22. [Google Scholar] [CrossRef]
  6. Babaian, A.; Edgar, R. Ribovirus classification by a polymerase barcode sequence. PeerJ 2022, 10, e14055. [Google Scholar] [CrossRef]
  7. Hadfield, J.; Megill, C.; Bell, S.; Huddleston, J.; Potter, B.; Callender, C.; Sagulenko, P.; Bedford, T.; Neher, R. Nextstrain: Real-time tracking of pathogen evolution. Bioinformatics 2018, 34, 4121–4123. [Google Scholar] [CrossRef] [Green Version]
  8. Minh, B.Q.; Schmidt, H.A.; Chernomor, O.; Schrempf, D.; Woodhams, M.D.; von Haeseler, A.; Lanfear, R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol. 2020, 37, 1530–1534. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Chen, D.; Li, S.; Chen, Y. ISTRF: Identification of sucrose transporter using random forest. Front. Genet. 2022, 13, 1012828. [Google Scholar] [CrossRef] [PubMed]
  10. Yang, L.; Zhang, Y.H.; Huang, F.; Li, Z.; Huang, T.; Cai, Y.D. Identification of protein–protein interaction associated functions based on gene ontology and KEGG pathway. Front. Genet. 2022, 13, 1011659. [Google Scholar] [CrossRef]
  11. Zhang, X.; Wang, S.; Xie, L.; Zhu, Y. PseU-ST: A new stacked ensemble-learning method for identifying RNA pseudouridine sites. Front. Genet. 2023, 14, 1121694. [Google Scholar] [CrossRef]
  12. Kuzmin, K.; Adeniyi, A.E.; DaSouza, A.K.; Lim, D.; Nguyen, H.; Molina, N.R.; Xiong, L.; Weber, I.T.; Harrison, R.W. Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 2020, 533, 553–558. [Google Scholar] [CrossRef] [PubMed]
  13. Ma, Y.; Yu, Z.; Tang, R.; Xie, X.; Han, G.; Anh, V.V. Phylogenetic analysis of HIV-1 genomes based on the position-weighted k-mers method. Entropy 2020, 22, 255. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Ghandi, M.; Lee, D.; Mohammad-Noori, M.; Beer, M.A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 2014, 10, e1003711. [Google Scholar] [CrossRef] [Green Version]
  15. Shen, J.; Qu, Y.; Zhang, W.; Yu, Y. Wasserstein distance guided representation learning for domain adaptation. Proc. Conf. AAAI Artif. Intell. 2018, 32, 11784. [Google Scholar] [CrossRef]
  16. Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. arXiv 2016, arXiv:1511.06335. [Google Scholar]
  17. Heinzinger, M.; Elnaggar, A.; Wang, Y.; Dallago, C.; Nechaev, D.; Matthes, F.; Rost, B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 2019, 20, 723. [Google Scholar] [CrossRef] [Green Version]
  18. Strodthoff, N.; Wagner, P.; Wenzel, M.; Samek, W. UDSMProt: Universal deep sequence models for protein classification. Bioinformatics 2020, 36, 2401–2409. [Google Scholar] [CrossRef] [Green Version]
  19. Zhang, Y.; Qiao, S.; Lu, R.; Han, N.; Liu, D.; Zhou, J. How to balance the bioinformatics data: Pseudo-negative sampling. BMC Bioinform. 2019, 20, 695. [Google Scholar] [CrossRef] [PubMed]
  20. Abd Elrahman, S.M.; Abraham, A. A review of class imbalance problem. J. Netw. Innov. Comput. 2013, 1, 332–340. [Google Scholar]
  21. Ali, S.; Patterson, M. Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences. arXiv 2021, arXiv:2109.05019. [Google Scholar]
  22. Ali, S.; Bello, B.; Chourasia, P.; Punathil, R.T.; Zhou, Y.; Patterson, M. PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. Biology 2022, 11, 418. [Google Scholar] [CrossRef] [PubMed]
  23. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Advances in Intelligent Computing, Proceedings of the International Conference on Intelligent Computing, ICIC 2005, Hefei, China, 23–26 August 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
  24. Xiaolong, X.; Wen, C.; Yanfei, S. Over-sampling algorithm for imbalanced data classification. J. Syst. Eng. Electron. 2019, 30, 1182–1191. [Google Scholar]
  25. Zhao, X.M.; Li, X.; Chen, L.; Aihara, K. Protein classification with imbalanced data. Proteins Struct. Funct. Bioinform. 2008, 70, 1125–1132. [Google Scholar] [CrossRef]
  26. Tang, B.; He, H. KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning. In Proceedings of the 2015 IEEE Congress on Evolutionary Computation (CEC), Sendai, Japan, 25–28 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 664–671. [Google Scholar]
  27. Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 2018, 91, 464–471. [Google Scholar] [CrossRef]
  28. Roberts, M.; Haynes, W.; Hunt, B.; Mount, S.; Yorke, J. Reducing storage requirements for biological sequence comparison. Bioinformatics 2004, 20, 3363–3369. [Google Scholar] [CrossRef] [Green Version]
  29. Bacterial and Viral Bioinformatics Resource Center. Available online: https://www.bv-brc.org/ (accessed on 20 April 2023).
  30. Edgar, R.C.; Taylor, J.; Lin, V.; Altman, T.; Barbera, P.; Meleshko, D.; Lohr, D.; Novakovsky, G.; Buchfink, B.; Al-Shayeb, B.; et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 2022, 602, 142–147. [Google Scholar] [CrossRef] [PubMed]
  31. Bagaev, D.V.; Vroomans, R.M.; Samir, J.; Stervbo, U.; Rius, C.; Dolton, G.; Greenshields-Watson, A.; Attaf, M.; Egorov, E.S.; Zvyagin, I.V.; et al. VDJdb in 2019: Database extension, new analysis infrastructure and a T-cell receptor motif compendium. Nucleic Acids Res. 2020, 48, D1057–D1062. [Google Scholar] [CrossRef]
  32. Pickett, B.E.; Sadat, E.L.; Zhang, Y.; Noronha, J.M.; Squires, R.B.; Hunt, V.; Liu, M.; Kumar, S.; Zaremba, S.; Gu, Z.; et al. ViPR: An open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 2012, 40, D593–D598. [Google Scholar] [CrossRef]
  33. GISAID Website. Available online: https://www.gisaid.org/ (accessed on 29 December 2021).
  34. Van der M., L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. The workflow of PWM2Vec method for a given sequence.
Figure 1. The workflow of PWM2Vec method for a given sequence.
Biology 12 00854 g001
Figure 2. The workflow of obtaining minimizers from an input sequence.
Figure 2. The workflow of obtaining minimizers from an input sequence.
Biology 12 00854 g002
Figure 3. The workflow of training the GAN model. It shows the process followed to fine tune the parameters of the generator and discriminator models while training.
Figure 3. The workflow of training the GAN model. It shows the process followed to fine tune the parameters of the generator and discriminator models while training.
Biology 12 00854 g003
Figure 4. t-SNE plots for Influenza A virus dataset without GANs (ac) and with GANs (df). The figure is best seen in color.
Figure 4. t-SNE plots for Influenza A virus dataset without GANs (ac) and with GANs (df). The figure is best seen in color.
Biology 12 00854 g004
Figure 5. t-SNE plots for PALMdb dataset without GANs (ac), and with GANs (df). The figure is best seen in color.
Figure 5. t-SNE plots for PALMdb dataset without GANs (ac), and with GANs (df). The figure is best seen in color.
Biology 12 00854 g005
Figure 6. t-SNE plots for VDjDB dataset without GANs (ac), and with GANs (df). The figure is best seen in color.
Figure 6. t-SNE plots for VDjDB dataset without GANs (ac), and with GANs (df). The figure is best seen in color.
Biology 12 00854 g006
Figure 7. t-SNE plots for Host dataset without GANs (ac) and with GANs (df). The figure is best seen in color.
Figure 7. t-SNE plots for Host dataset without GANs (ac) and with GANs (df). The figure is best seen in color.
Biology 12 00854 g007
Figure 8. t-SNE plots of only GANs embeddings for Influenza A virus, PALMdb, VDjDB, and host datasets. The figure is best seen in color.
Figure 8. t-SNE plots of only GANs embeddings for Influenza A virus, PALMdb, VDjDB, and host datasets. The figure is best seen in color.
Biology 12 00854 g008
Table 1. Dataset Statistics of each dataset used in our experiments.
Table 1. Dataset Statistics of each dataset used in our experiments.
Sequence Length
Name|Sequences|ClassesGoalMinMaxAverage
Influenza A Virus222,4502Virus Subtypes Classification117168.60
PALMdb124,90818Virus Species Classification53150130.83
VDjDB78,34417Antigen Species Classification72012.66
Host555821Coronavirus Host Classification915841272.36
Table 2. Species-wise distribution of PALMdb dataset.
Table 2. Species-wise distribution of PALMdb dataset.
Species NameCountSpecies NameCount
Avian orthoavulavirus 12353Chikungunya virus2319
Dengue virus1627Hepacivirus C29,448
Human orthopneumovirus3398Influenza A virus47,362
Influenza B virus8171Lassa mammarenavirus1435
Middle East respiratory syndrome-related coronavirus1415Porcine epidemic diarrhea virus1411
Porcine reproductive and respiratory syndrome virus2777Potato virus Y1287
Rabies lyssavirus4252Rotavirus A4214
Turnip mosaic virus1109West Nile virus5452
Zaire ebolavirus4821Zika virus2057
Table 3. Antigen species-wise distribution of VDjDB dataset.
Table 3. Antigen species-wise distribution of VDjDB dataset.
Antigen Species NameCountAntigen Species NameCount
CMV37,357DENV1180
DENV3/4177EBV11,026
HCV840HIV-13231
HSV-2154HTLV-1232
HomoSapiens4646InfluenzaA14,863
LCMV141MCMV1463
PlasmodiumBerghei243RSV125
SARS-CoV-2758SIV2119
YFV789
Table 4. Coronavirus host-wise distribution of host dataset.
Table 4. Coronavirus host-wise distribution of host dataset.
Host NameCountHost NameCount
Bat153Bird374
Bovine88Camel297
Canis40Cat123
Cattle1Dolphin7
Environment1034Equine5
Fish2Hedgehog15
Human1813Monkey2
Pangolin21Python2
Rat26Swine558
Turtle1Unknown2
Weasel994--
Table 5. The subtype classification results of Influenza A virus dataset and species-wise classification results of Palmdb dataset. These results are average results over five runs. The best value for each metric is shown as bold.
Table 5. The subtype classification results of Influenza A virus dataset and species-wise classification results of Palmdb dataset. These results are average results over five runs. The best value for each metric is shown as bold.
Method Influenza A VirusPALMdb
Algo.Acc. ↑Prec. ↑Recall ↑F1 (Weig.) ↑F1 (Macro) ↑ROC AUC ↑Train Time (Sec.) ↓ Acc. ↑Prec. ↑Recall ↑F1 (Weig.) ↑F1 (Macro) ↑ROC AUC ↑Train Time (Sec.) ↓
Without GANsSpike2Vec [21]NB0.5380.6730.5380.3820.3580.50396.851 0.9990.9990.9990.9990.9990.999453.961
MLP0.9990.9990.9990.9990.9990.999742.551 0.9990.9990.9990.9990.9990.9991446.421
KNN0.9990.9990.9990.9990.9990.9992689.320 0.9990.9990.9990.9990.9990.9991274.75
RF0.9990.9990.9990.9990.9990.999433.459 0.9990.9990.9990.9990.9990.999166.087
LR0.9660.9660.9660.9660.9660.96524.467 0.9990.9990.9990.9990.9990.99931,564.898
DT0.9990.9990.9990.9990.9990.99954.024 0.9990.9990.9990.9990.9990.999163.827
PWM2Vec [22]NB0.5630.7450.5630.4350.4140.53060.155 0.9990.9990.9990.9990.9990.999562.922
MLP0.6440.7850.6440.5790.5660.6171471.086 0.9990.9990.9990.9990.9990.9991675.896
KNN0.6440.7850.6440.5790.5660.6172665.538 0.9990.9990.9990.9980.9990.9991514.240
RF0.6440.7850.6440.5790.5660.6181514.979 0.9990.9990.9990.9990.9990.999284.450
LR0.6440.7840.6440.5790.5650.617388.235 0.9990.9990.9990.9990.9990.99941,029.833
DT0.6440.7850.6440.5790.5660.61778.525 0.9990.9990.9990.9990.9990.999233.533
MinimizerNB0.6790.6820.6790.6730.6690.67057.469 0.9990.9990.9990.9990.9990.999474.482
MLP0.9980.9980.9980.9980.9980.9981864.844 0.9990.9990.9990.9990.9990.9993958.188
KNN0.9990.9990.9990.9990.9990.9992818.292 0.9990.9990.9990.9990.9990.9991357.673
RF0.9990.9990.9990.9990.9990.9991039.824 0.9990.9990.9990.9990.9990.999399.507
LR0.7190.7190.7190.7190.7180.718186.522 0.9990.9990.9990.9990.9990.9997270.111
DT0.9990.9990.9990.9990.9990.99972.510 0.9990.9990.9990.9990.9990.999223.215
With GANsSpike2Vec [21]NB0.5380.6810.5380.3800.3550.502138.179 0.9990.9990.9990.9990.9990.999197.033
MLP0.9920.9920.9920.9920.9920.9921604.287 0.9990.9990.9990.9990.9990.999491.182
KNN0.9990.9990.9990.9990.9990.9993546.211 0.9990.9990.9990.9990.9990.999689.672
RF0.9990.9990.9990.9990.9990.999784.393 0.9990.9990.9990.9990.9990.999243.646
LR0.9570.9570.9570.9570.9570.9576810.398 0.9990.9990.9990.9990.9990.9992643.646
DT0.9990.9990.9990.9990.9990.999365.332 0.9990.9990.9990.9990.9990.999396.362
PWM2Vec [22]NB0.5650.7480.5650.4370.4160.532107.617 0.9990.9990.9990.9990.9990.999569.510
MLP0.6440.7840.6440.5790.5660.6171817.859 0.9990.9990.9990.9990.9990.9991337.920
KNN0.6460.7850.6460.5810.5680.6192965.701 0.9990.9990.9990.9990.9990.9991524.009
RF0.6460.7860.6460.5820.5690.6191837.425 0.9990.9990.9990.9990.9990.9991802.577
LR0.6320.7930.6320.5890.5970.65710,273.672 0.9990.9990.9990.9990.9990.9993549.095
DT0.6460.7860.6460.5810.5680.6191264.188 0.9990.9990.9990.9990.9990.9992580.831
MinimizerNB0.6110.7260.6110.5340.5200.584127.058 0.9990.9990.9990.9990.9990.999669.513
MLP0.9760.9760.9760.9760.9760.976825.868 0.9990.9990.9990.9990.9990.9991231.650
KNN0.9990.9990.9990.9990.9990.9993163.325 0.9990.9990.9990.9990.9990.9991484.555
RF0.9990.9990.9990.9990.9990.9991557.065 0.9990.9990.9990.9990.9990.9991699.503
LR0.7110.7120.7110.7110.7100.7112179.485 0.9990.9990.9990.9990.9990.9993482.345
DT0.9990.9990.9990.9990.9990.999481.232 0.9990.9990.9990.9990.9990.9992700.860
Only GANs For TrainingSpike2Vec [21]NB0.4430.3180.4430.2960.3170.47669.293 0.0560.0050.0560.0090.0140.523172.517
MLP0.4990.5060.4990.4980.4990.503279.364 0.1040.2600.1040.1480.0390.486264.306
KNN0.5860.6230.5860.5230.5100.5614088.144 0.1260.2420.1260.1560.1230.533263.101
RF0.4640.2150.4640.2940.3170.500386.409 0.0110.0000.0110.0000.0010.5008451.755
LR0.5230.5230.5230.5230.5200.520469.512 0.0010.0000.0010.0000.0010.5001481.505
DT0.5350.2860.5350.3730.3480.500308.698 0.0420.0010.0420.0030.0040.4992764.815
PWM2Vec [22]NB0.4680.5080.4680.3310.3510.50060.008 0.0340.0040.0340.0030.0040.499370.330
MLP0.4710.5030.4710.3690.3850.500333.503 0.4000.3350.4000.3550.0800.534577.936
KNN0.5200.5750.5200.4700.4800.5424565.427 0.0610.2130.0610.0890.0590.4962475.871
RF0.5350.2860.5350.3720.3480.500746.999 0.0340.0010.0340.0020.0030.50010,880.182
LR0.5340.6030.5340.4820.4920.557975.877 0.0010.0120.0010.0090.0090.490278.851
DT0.5350.2860.5350.3720.3480.500500.541 0.0220.0010.0220.0320.0130.5003078.085
MinimizerNB0.5230.5290.5230.5230.5230.52665.955 0.0620.1940.0620.0480.0550.525497.483
MLP0.4770.4950.4770.4470.4550.494499.569 0.0050.0030.0050.0030.0080.475707.236
KNN0.5390.5380.5390.5380.5350.5365211.216 0.1770.1550.1770.1480.0580.5223116.525
RF0.5350.2870.5350.3730.3480.499624.564 0.0340.0010.0340.0020.0030.50010,349.430
LR0.5480.5480.5480.5480.5460.546771.273 0.2010.1200.2010.2280.1020.5013234.386
DT0.4640.2150.4640.2940.3170.500576.693 0.0030.0020.0030.0020.0030.500346.660
Table 6. The antigen species-wise classification results of VDjDB dataset and host-wise classification results of coronavirus host dataset. These results are average values over five runs. The best value for each metric is shown as bold.
Table 6. The antigen species-wise classification results of VDjDB dataset and host-wise classification results of coronavirus host dataset. These results are average values over five runs. The best value for each metric is shown as bold.
Method VDjDBHost
Algo.Acc. ↑Prec. ↑Recall ↑F1 (Weig.) ↑F1 (Macro) ↑ROC AUC ↑Train Time (Sec.) ↓ Acc. ↑Prec. ↑Recall ↑F1 (Weig.) ↑F1 (Macro) ↑ROC AUC ↑Train Time (Sec.) ↓
Without GANsSpike2Vec [21]NB0.9990.9990.9990.9990.9990.99987.948 0.6640.7520.6640.6470.5780.7879.735
MLP0.9990.9990.9990.9990.9990.999689.357 0.8240.8320.8240.8220.7020.839117.344
KNN0.9980.9980.9980.9980.9980.999167.426 0.7760.8200.7760.7910.6420.8242.421
RF0.9990.9990.9990.9990.9990.999152.581 0.8490.8510.8490.8510.7010.85122.611
LR0.9990.9990.9990.9990.9990.999882.695 0.8430.8500.8430.8400.6830.8802169.697
DT0.9990.9990.9990.9990.9990.99943.314 0.8280.8300.8280.8330.6510.8446.415
PWM2Vec [22]NB0.1790.9260.1790.2500.3050.63484.292 0.4360.6250.4360.3960.4330.7092.782
MLP0.5250.6850.5250.3990.3150.6261216.913 0.7970.8080.7970.7870.6060.81569.551
KNN0.5250.6890.5250.3990.3200.626248.660 0.7950.7920.7950.7890.6470.8160.917
RF0.5250.6900.5250.4000.3200.626736.583 0.8330.8340.8330.8270.6910.85310.166
LR0.5250.6810.5250.4000.3200.626299.575 0.8020.8170.8020.7940.6710.843693.437
DT0.5250.6900.5250.4000.3200.62639.697 0.8030.8040.8030.8000.6250.8267.063
MinimizerNB0.9300.9720.9300.9400.8380.93598.159 0.4800.6900.4800.4520.5570.75318.977
MLP0.9340.9520.9340.9280.7820.8821253.018 0.7820.7970.7820.7740.6770.831279.057
KNN0.9510.9610.9510.9470.8490.925172.851 0.7630.7860.7630.7650.6880.8324.831
RF0.9530.9620.9530.9480.8470.927468.139 0.8350.8400.8350.8270.7050.84360.184
LR0.9520.9610.9520.9480.8470.926203.061 0.8180.8270.8180.8110.6930.839978.112
DT0.9520.9620.9520.9480.8470.92625.392 0.8180.8240.8180.8130.6830.8414.959
With GANsSpike2Vec [21]NB0.9990.9990.9990.9990.9880.99978.891 0.6840.7590.6840.6640.6560.84312.607
MLP0.9990.9990.9990.9990.9990.9991085.850 0.7470.7720.7470.7410.4820.790187.038
KNN0.9980.9980.9980.9980.9920.998135.567 0.7960.7970.7960.7910.6380.8263.193
RF0.9990.9990.9990.9990.9990.999186.662 0.8490.8510.8490.8420.7070.85230.672
LR0.9990.9990.9990.9990.9990.9995736.169 0.8260.8400.8260.8210.7260.8734897.412
DT0.9990.9990.9990.9990.9990.999143.618 0.8140.8190.8140.8110.6780.86532.644
PWM2Vec [22]NB0.1510.9260.1510.2470.2340.603109.493 0.5280.5820.5280.4410.3450.7093.033
MLP0.5290.6850.5290.4030.2310.595358.965 0.6330.6550.6330.5590.4110.69581.928
KNN0.5310.6890.5310.4060.3170.625126.428 0.8020.7960.8020.7990.6230.8050.954
RF0.5320.6910.5320.4080.3190.6251052.845 0.8420.8410.8420.8360.7350.85410.607
LR0.5280.6900.5280.4030.3210.6265643.762 0.6790.7210.6790.6700.4840.731704.796
DT0.5280.6910.5280.4030.3210.626142.579 0.8190.8220.8190.8160.7140.8397.636
MinimizerNB0.9160.9890.9160.9430.8010.91690.476 0.4900.7280.4900.4300.5050.73417.223
MLP0.9520.9610.9520.9480.8510.927440.944 0.7120.7520.7120.7020.4410.709160.668
KNN0.9510.9600.9510.9470.8440.926149.858 0.7940.7980.7940.7840.5760.7733.979
RF0.9530.9610.9530.9490.8500.927527.874 0.8220.8310.8220.8120.7100.84354.738
LR0.9520.9610.9520.9480.8490.9274918.374 0.7990.8280.7990.7860.7210.8485240.159
DT0.9520.9610.9520.9480.8500.927111.393 0.7940.8060.7940.7870.6700.82643.638
Only GANs For TrainingSpike2Vec [21]NB0.0020.0000.0020.0000.0010.49198.736 0.1670.0350.1670.0580.0160.49812.501
MLP0.0220.0320.0220.01920.0160.479222.003 0.1070.1190.1070.1060.0230.51150.035
KNN0.1060.1390.1060.0760.1230.558368.164 0.2020.0820.2020.0920.0350.5124.793
RF0.0100.0000.0100.0000.0010.500665.565 0.1840.0330.1840.0570.0160.50026.686
LR0.2000.1360.2000.0910.0200.5003497.008 0.0530.1580.0530.0480.02290.4873809.357
DT0.1900.0360.1900.0610.0180.499467.308 0.0100.0000.0100.0000.0010.49936.366
PWM2Vec [22]NB0.0260.0680.0260.0030.0030.49993.458 0.0970.0090.0970.0170.0100.5002.699
MLP0.3920.3890.3920.2950.0560.499250.162 0.0210.1900.0210.0260.0120.36439.795
KNN0.1400.2050.1400.0400.0160.500343.585 0.0920.0090.0920.0160.0100.4981.421
RF0.4770.2270.4770.3080.0380.500644.587 0.3180.1010.3180.1530.0280.50014.304
LR0.0122.0700.0124.0200.0010.5004498.689 0.0330.1120.0330.0450.0570.490375.859
DT0.0024.1700.0028.3240.0000.500498.689 0.3180.1010.3180.1530.0280.5008.850
MinimizerNB0.0230.2150.0230.0350.0330.510115.915 0.0720.0050.0720.0090.0080.50011.483
MLP0.4200.5970.4200.4480.0810.514274.471 0.0000.0000.0000.0000.0000.00063.847
KNN0.5510.6900.5510.6000.1520.599382.306 0.1760.0330.1760.0560.0190.5003.665
RF0.0100.0000.0100.0000.0010.500792.106 0.1790.0320.1790.0540.0190.50023.71
LR0.5140.2350.5140.3850.0470.5003465.703 0.2210.2250.2210.1380.0650.5213603.909
DT0.4740.2250.4740.3050.0370.500445.797 0.0440.0150.0440.0150.0060.47739.195
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Murad, T.; Ali, S.; Patterson, M. Exploring the Potential of GANs in Biological Sequence Analysis. Biology 2023, 12, 854. https://doi.org/10.3390/biology12060854

AMA Style

Murad T, Ali S, Patterson M. Exploring the Potential of GANs in Biological Sequence Analysis. Biology. 2023; 12(6):854. https://doi.org/10.3390/biology12060854

Chicago/Turabian Style

Murad, Taslim, Sarwan Ali, and Murray Patterson. 2023. "Exploring the Potential of GANs in Biological Sequence Analysis" Biology 12, no. 6: 854. https://doi.org/10.3390/biology12060854

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop