In this section, the applications of DL for different types of omics data are reviewed. The indicative literature was selected as they cover the state-of-the-art methodologies in the popular applications of the most recent years.
3.1. Genomics
The raw material of genomics research is the DNA sequences which, from a computational point of view, are strings that comprise four characters: A, C, G, and T. To serve as the input data for the predictive modeling, the DNA sequences are usually represented in either of the two following ways:
One-hot encoding: a 2D matrix with four rows (each for one of the four characters) and a number of columns that is equal to the DNA string length. For each column, the character that was found in that position in the DNA sequence gets value one, while the rest of the three rows are given the value 0, as it is shown in
Figure 3.
k-mers: A vector is generated, representing the possible permutations of the four nucleotides for a user-defined k (e.g., for k = 3, the permutations are AAA, AAC, AAG, …, TTC, TTG, TTT). Each element of the vector takes the value one if the permutation is present in the string; otherwise, it takes the value 0, as it is shown in
Figure 3.
In [
27], the DNA sequences of the synthetic plasmids were used to predict the lab of origin of the synthetic DNA. After the one-hot encoding, the sequences were used to train a CNN, which correctly identified the source lab 48% of the time, and for 70% of the time, the true origin appeared in the top-ten predicted labs. DL-based genomics techniques, which use DNA sequences as training data, have lately been proposed for the primer construction for PCR tests to detect COVID-19 [
28] from human samples. The problem was formulated as being both of the multiclass (what type of virus) and binary class type (COVID-19 or not). A CNN achieved 98.73% accuracy for the binary problem. One-hot encoded DNA strings were also used for a DL-based approach to automate part of the sequencing pipeline itself [
29]. The DNA sequencing methods produce clusters of short variations of the DNA substrings called
contigs, which are then aligned to a reference genome to generate the consensus sequence. However, this is a computationally expensive process that combines dynamic programming and demanding algorithms to navigate the intractably vast search spaces. Researchers have automated this process via DL using contigs as the input data and the consensus sequences as the labels. After training a Bidirectional Long Short-Term Memory (BiLSTM) model that scanned the contigs with a window of size three, the research team in [
29] predicted the consensus sequences with up to 99.81% identity with the ground truth, thereby surpassing all of the state-of-the-art models, including a tool that was released by Oxford Nanopore, a leading company in sequencing technology.
Another example of deploying DL to solve a common bioinformatics challenge was reported in [
30], where the genomic sequences were classified as to whether the sequencing machine was reading in the forward or the reverse direction. The conventional bioinformatics techniques that were used to infer it were quite demanding, and they required a reference genome, which is problematic when one is sequencing a new organism for which there is no reference genome available. The input data were long reads of the DNA, which are represented as k-mers, with k ranging from one to five. A DNN and a CNN were trained, with the CNN performing slightly better than the other one. For validation, the researchers clustered the similar sequences together and performed a type of majority voting, where the majority of the orientation predictions of the clustered sequences were applied to all of the sequences in the cluster, and thus, the models predicted correctly up to 96.2% of the human reads and up to 98% of the S. cerevisiae (yeast) reads.
In [
31], a CNN was trained on one-hot encoded DNA strings to predict the gene regulatory regions, i.e., the regions within a DNA sequence with activated genes. The annotations were a time-series that signified which parts of the sequences were inactive and which were active above a certain threshold, which were represented by vectors of 0 s and 1 s, respectively. The researchers observed that the window and stride size of the convolutional layers had a major importance in the model’s effectiveness.
Desai et al. [
32] used the DNA strings themselves without one-hot encoding them, and passed them through an embedding layer to extract the representations. The task was to identify the bacteria from environmental samples. The classification was hierarchical, with bacteria being classified on three taxonomic levels: family, genus, and species. For comparison, the researchers trained a Recurrent Neural Network (RNN), an LSTM, a BiLSTM, a CNN, and a Combinatorial CNN. The LSTM outperformed the others in the family-level classification, with it having a 91.24% accuracy, and the BiLSTM had the best performance for the genus and species levels, with it having 85.63% and 70.78% accuracies, respectively.
Tahir et al. [
33] combined the one-hot encoded DNA strings with codon composition tables to predict whether the sequences exhibited a certain property, namely, whether they contained N6-methyladenine sites or not. Codons are k-mers with k = 3, and the number of possible permutations of the four nucleotides for k = 3 is 64. The codon composition vector of a DNA string is a vector with a length = 64, where each position takes a value of one if the corresponding codon is present in the DNA string, otherwise, it takes a value of 0. The architecture that was used consisted of a CNN employing the one-hot DNA sequence. Its output vector was then concatenated with the codon composition vector, and it was fed into a dense layer for the classification to be performed. The latter approach achieved an accuracy of up to 98.05%, and it surpassed other reported methods by at least 2%.
In a similar way, Phuycharoen et al. [
34] combined the one-hot representation with the codon composition tables to train the CNNs to predict which sequences contained the TF (Transcription Factor) binding sites in a 3-class problem (increased binding, decreased binding, and non-differential binding). The DNA sequences were also combined with a dinucleotide composition matrix in a two-tier classification system that firstly predicted whether the DNA contained promoter regions and then, in the case that it did not, it predicted whether the regions were strong or weak promoters [
35]. Dinucleotides are k-mers with k = 2, and a dinucleotide composition matrix is a 2D matrix with rows corresponding to the samples, columns corresponding to all of the possible 2-nucleotide permutations (i.e., AA, AC, …, TG, TT), and data that are the normalized frequency of each dinucleotide of each sample. First, a CNN employed the one-hot encoded DNA sequences, its output concatenated with the dinucleotide frequency matrix, and a dense layer classified the resulting vectors as to whether they contained a promoter or not; for those vectors which did, the same pipeline took place, with a second CNN taking the one-hot DNA sequences, concatenating its output with the dinucleotide frequency matrix, and a dense layer classifying whether the promoter was weak or strong. The proposed approach surpassed the accuracy of the previous benchmark methods by 2–10%.
K-mers have been used not only in composition matrices, but also in their preliminary form, as they are vectors of strings of length = k, resulting from the sliding window over the original sequence (e.g., with k = 3, TACGG becomes {TAC, ACG, CGG}). These lists of strings are treated as a corpus of texts, and natural language processing techniques are applied to learn the word embeddings. In [
36], the k-mers were transformed with Glove embedding method [
37]. A hybrid CNN–BiLSTM was trained to predict which DNA sequences contained the chromatin-accessible sites, thus signifying that these sequences play functional roles in a biological system. The hybrid model performed better than its component models did individually, and its accuracy surpassed that of the other previously used methods by 1–7%. Guo et al. [
38] compared the Glove and word2vec [
39] embeddings of the k-mers. The task was performed in the same way as it was previously performed, and the sequences were evaluated as to whether they displayed chromatin accessibility or not. After training the hybrid CNN-GRU models that had an additional attention layer, the Glove embedding method was proved to be better than the word2vec one, and the overall performance of it was comparable to those of the other state-of-the-art methods. In [
40], word embedding and FastText-transformed k-mers were used for the binary task of classifying the DNA sequences as to whether they contained, or belonged to, essential or non-essential genes. The used model was an ensemble of both of the shallow ML and DL models, comprising a k-nearest neighbors (k-NN) one, a random forest (RF) one, a support vector machine (SVM), a DNN one, and a CNN one. The model achieved 76.3% accuracy, 84.6% specificity, 60.2% sensitivity, and it was comparable with the other state-of-the-art methods, surpassing most of them.
In [
41], an automatic framework, namely AMBER, which is used for designing CNNs in genomics was presented, and it was based on a novel Neural Architecture Search (NAS). The pathology type was encoded to be one-hot, which was the label. AMBER was applied to the modelling genomic regulatory features, and this resulted in the achievement of better predictions of the disease-relevant variants when they were compared to those of the basic non-NAS models. Li et al. [
42] proposed a DL genomics approach, and they applied it to a multitasking classification of Alzheimer’s disease progression by identifying the novel genetic biomarkers that have gone unnoticed by the traditional genome-wide association studies (GWAS). The classification accuracies achieved up to 99.44% by using the proposed DL genomics model. Chalupová et al. [
43] developed an easy neural network tool for genomics (ENNGene) to bridge the gap between the need for DL models in genomics and the limited ability of researchers in the field to develop one. ENNGene could deal with multiple input branches, and it could be fully customized by the user. The results of using ENNGene were similar to those of the state-of-the-art ones.
3.2. Transcriptomics
Gene expression profiles quantify the activity of each gene within a biological sample. The more active the gene is, then the more it is transcribed into the RNA, and the RNA sequencing yields more reads that map to the active genes than they do to the inactive ones. The active genes have more reads. The formulas that were used to normalize these read counts do not focus simply on the numbers of reads, but also take into account the read- and gene-lengths. More details on the full transcriptomics pipeline can be found in [
44,
45,
46]. The final results of the process, which were used as the data in the predictive modeling, are numerical vectors with values that are between 0 and 1, denoting how much each gene is expressed within a sample (
Figure 4).
Transcriptomics data have occasionally been used for the completion of regression tasks, such as inferring a patient’s age from the gene expression profiles of the protein-coding genes [
47], though most studies are tackling the classification problems of this. Being exceedingly high-dimensional with them having thousands of features, the gene expression data are more useful and tractable after a treatment with the dimensionality reduction techniques. The most common methodologies apply feature engineering and feature selection. The feature selection of the transcriptomics data, which is referred to as the differential gene expression (identifying genes that are expressed differentially depending on the class), employs a number of domain-specific techniques [
48,
49,
50].
The feature extraction and the dimensionality reduction techniques have been applied. In [
51], the researchers set out to determine which method was optimal for the gene expression data, concluding that the DL-based extraction with the proposed DeepAE was the best when it was compared to the four benchmark models: singular value decomposition, k-sparsity singular value decomposition, sparse non-negative matrix factorization, and the previous state-of-the-art one, a domain-specific method that is called CS-SMAF [
52]. The evaluation was performed by comparing the original with the reconstructed transcriptomics data using the Pearson correlation coefficient, the Euclidean distance, and the mean absolute error as the metrics for the comparison.
A tangible portion of the DL-based transcriptomics research has been conducted in the field of oncology; in [
53], the gene expression profiles were utilized in a three-fold binary class task. Firstly, this was performed to predict the high-risk patients. Secondly, this was performed to predict whether the patient would survive or not. Thirdly, this was performed to predict their event-free survival, i.e., whether the patient would survive without experiencing repercussions and side-effects from the treatment. The DL architecture was comprised of two integrated models. A DNN took the original dataset, while an Auto- Encoder (AE) took the dataset after it had been feature-selected for the High-Risk class. The AE-extracted features were concatenated at some point and integrated into the DNN, which generated the multi-output binary-class predictions. In the comparative experiments, this architecture was shown to consistently outperform the RF and linear-SVM.
In [
54], a simple binary prediction task was combined with unsupervised learning for the purpose of knowledge discovery. First, a DNN took cellular gene expression data and classified whether each cell was cancerous or not. The cancerous cells were singled out, and through a k-means clustering, they were grouped with the goal of discovering the novel cancer subtypes that were not present in the input data labels. Compared with other clustering methods, the k-means were shown to yield better more biologically meaningful results. Lee et al. [
55] classified early and late-stage cancer from the gene expression profiles, and they observed that with this type of data, an increase in the number of input samples raised the probability of bias due to there being outliers. The solution that they proposed was to use statistics to determine whether the differences in the gene expression values among different types of cancer were statistically significant. Since the t-test could be sensitive to large outliers, they used the Wilcoxon rank-sum test. After that statistics-based feature selection, they trained a DNN, which yielded a 94.2% accuracy.
In [
56], a nested classification task was tackled, and first, it classified the tumor type, and then, it classified the molecular tumor subtype; both of these problems were multi-class. The researchers normalized the gene expression data of each patient, applied a log2 transform, and performed a feature selection by comparing the median expression of each gene for the in-class samples with the out-of-class samples; the median was more robust for the outliers than it was for the mean. By applying ResNet, 1D-CNN, and 1D-Inception, they reached a maximum accuracy of 98.54% for the primary tumor type, and maximum accuracy of 83.5% for the molecular tumor subtype.
In [
57], the gene expression profiles were combined with the splice junction data for the unsupervised discovery of novel cancer subtypes. The splice junctions are DNA subsequences (exons) that are left out during the transcription from DNA into RNA, thus changing the function of the gene. In this study, a frequentist estimate of the inclusion level of each junction in each sample was calculated, thus resulting in a matrix of the shape [649 patient samples] × [34.425 skipping exons]. Conventional clustering is problematic in high-dimensional data, hence, in the proposed work, an AE-based pipeline reduced the dimensionality, and the learned latent-space representations were then clustered. First, two stacked AEs (SAE) took the input data, one SAE handled the gene expression and the other handled the splice junction matrix. Then, their outputs were concatenated and fed into an AE that extracted the final features, which were then clustered through the k-means. Compared to PCA-based clustering, the AE-based clustering yielded better results and identified more clinically meaningful cancer subtypes.
Another field profiting from DL-based transcriptomics is drug research. In [
58], the gene expression profiles of tissues that were treated with drugs were used. Each sample was labeled by the properties of the chemical agent that were contained in the applied drug, e.g., antineoplastic, cardiovascular, central nervous system agents, etc. As a biologically relevant type of feature engineering, the researchers used OncoFinder [
59] to transform the gene expression data into a matrix representation of signaling pathway graphs. The matrix represented the regulatory interactions among the genes, with rows and columns signifying the genes, and the values indicating the up-regulation, down-regulation, or there being no effect. A DNN was trained with this data using the effect that a chemical agent has on regulatory gene interactions to infer the drug’s properties. The DNN was proven to be better when it was compared to an SVM. It should be noted that the model might have concluded in a novel discovery; however, a certain drug was misclassified, and its biological effects contradicted its human-annotated label, and after reviewing the relevant literature, the researchers proposed the chemical agent as a candidate for drug repurposing.
In [
60], the gene expression profiles of both of the chemical agent perturbations and gene knockdown perturbations were used for predicting the protein–drug interactions from protein coding genes. Chemical agent perturbations took place by applying chemicals to the tissues, and the researchers monitored how the genome responded, and which genes were up/down-regulated. The gene knockdown perturbations consisted of removing a gene through the use of some gene editing technique and monitoring how the other genes responded. The data were in the form of real-valued matrices. For the chemical perturbation matrix, the shape was [number of genes] × [number of drugs], and for the gene knockdown matrix the shape was [number of genes] × [number of landmark genes]. The labels were binary, and they represented whether a drug affected a gene or not. Thus, by using the two types of input data, it was possible to detect both of the direct interactions, where a chemical agent directly affected a certain gene or the indirect interactions, where an agent affected the other genes which, in turn, affected the gene in which they were interested in. A DNN was used with an input layer that had two channels for the two datasets. Then, the channels were concatenated, thus producing the final binary classification with an accuracy of 90.53% and an F1-score of 86.38%. The model proved to be superior to Logistic Regression, Random Forest, and Gradient Boosted Tree.
In [
61], the transcriptomics data were combined with the gene signaling pathways and chemical structures of the drugs for the research of drug repurposing, i.e., finding drugs that can be used for diseases other than that which they were originally developed for. The idea was to train the models with the effect that the drugs had on the gene expression to identify the drugs that were misclassified by the human annotators, and then compare the chemical structure similarity and drug effect on the gene signaling pathways. By finding similar drugs and evaluating how these approved drugs had been classified, hypotheses for the experiments and discoveries could generate. In that particular study, the gene signaling pathways were in a graph form, which were represented as a matrix of genes with topological weights. Through the in silico Pathway Activation Network Decomposition Analysis (iPANDA) algorithm [
62], they were transformed into an activation score matrix of the shape [number of samples] × [number of signaling pathways]. The gene expression profiles were simplified by clustering them into groups of similar genes. The chemical structure data did not participate in the model training, but they were only considered when an interesting finding occurred or when a drug looked promising, whereas the chemical structure comparison identified the similar drugs. A DNN was trained for a 6-class prediction, classifying the drugs based on their therapeutic effect. The experimental tests showed that combining gene expression and signaling pathway data yielded better results than either one of the individual data types did alone, and the researchers have reported a new discovery, which is a strong candidate drug for repurposing, which is awaiting its experimental validation.
In [
63], the unsupervised inference of the regulatory networks was conducted on transcriptomics and gene ontology data. The gene expression profiles of the cells that were treated with chemical compounds were concatenated with one-hot encoded gene ontology annotations, which consisted of the categorical information of the attributes of the genes [
64]. The real-valued gene expression matrix was first thresholded and turned into a binary one, with a value of one if a gene was differentially expressed due to the chemical treatment, otherwise, it was given a value of 0. A Deep Belief Network (DBN) performed a hierarchical-style clustering, with the gene clusters having been seen as the modules of the gene regulatory network. The resulting findings were confirmed by the Gene Ontology data and a subsequent literature review, and the regulatory networks that were inferred from the DBN-based clustering had biological validity. The researchers highlighted that setting the parameters for the DBN was an empirical, trial-and-error process with no theoretical foundation suggesting beforehand which parameters would be optimal. The choice of using a DBN was made for its robustness to the noise. The k-means and hierarchical clustering, being distance-based, were affected by the randomness in the data, while the DBNs were better at finding the generalization beyond the noise. The goal was to decode the gene regulatory networks in an analysis pipeline that used the DBN as the starting point for the clustering, and then, the researchers continued with the statistical tests and the other non-DL techniques.
In [
65], DL and joint supervised classification have been used to characterize the molecular changes that are correlated to Alzheimer’s disease (AD). The mapping of the cohort with a heterogenous diagnosis to the same transcriptomic space took place, and an unsupervised dimensionality reduction was applied to obtain a progressive trajectory that is associated with the severity of the AD. Finally, the transcriptomic data were applied to the model from different areas of the brain and blood monocyte samples to evaluate the reliability of the findings for different cohorts and tissues. In [
66], a machine learning technique used the molecular characteristics of tumors to generate personalized therapies. A cohort from which the cell line gene expression data were gathered was employed, and they could be classified into two groups with different pan-drug chemotherapeutic sensitivity. The Boruta feature selection algorithm was used, and a neural network with 10 hidden layers classified the pan-cancer cell lines into the two groups with 89% accuracy. The results indicated that the cell lines with similar gene expression profiles had a comparable pan-drug chemotherapeutic sensitivity. Therefore, the comparable biomarkers could be used to select the effective drugs to increase the therapeutic reaction, while at the same time reducing the cytotoxic problem.
Gene expression datasets are sometimes used in conjunction with gene interaction graphs to train the Graph Convolutional Networks (GCN). In [
67], the researchers wanted to classify the cell types from their gene expression. They procured the gene expression profiles for various cells, and then used them to construct a cell similarity matrix by measuring the cosine similarity among the expression levels of the different cells. That matrix was then turned into a graph for a GCN training procedure, and it included both labeled and unlabeled cells (semi-supervised). Training the GCN with this graph plus labeled the gene expression data resulted in a model that took a single-cell gene expression profile and was able to output a probability matrix, thus representing the probability that the cell belonged to a number of preset types of cells. The GCNs also contributed to drug research. In [
68], the GCNs were used to predict the drug response. The model was trained with the gene expression profiles of the cells that received a treatment with drugs plus a graph with the interactions among the genes. In [
69], the researchers trained the GCNs to predict the interactions among the genes that are associated with cancer in a search for the non-essential genes that could be targets for drugs. Drawing on the concept of Synthetic Lethality (SL), the researchers hoped that identifying, within a cancer cell, a non-essential gene that interacted with the cancer-causing gene, the non-essential gene could be targeted by a chemical treatment and the cancer cell would die without affecting the healthy cells. The SL interactions were sparse, and training ML models were prone to overfitting, but the researchers managed to accurately capture the relationships among the genes using a Dual-Dropout GCN (DDGCN) that used both fine-grained and coarse-grained node dropout techniques, thereby achieving state-of-the-art results.
3.3. Proteomics
A protein or peptide sequence is a string consisting of the 20 amino acids, which are denoted with the letters A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, W, Y, and V. The simplest way to transform such a string into data for its predictive modeling is by using one-hot encoding, as shown in
Figure 5. However, this is often coupled with Natural Language Processing (NLP) or with the representations that integrate domain knowledge into the data.
Natural language processing-based encodings may also be coupled with additional preprocessing. Borrowing from NLP, a number of studies have applied the skip-gram-based [
39] encoding of protein sequences. In [
70], the protein strings were transformed into a 20 × 15 matrix, where each amino acid in the sequence was a “word”, and the skip-gram encoded the 20 possible amino acids into a 15-dimensional space. The skip-gram encoding was followed by an embedding layer. A CNN took the embeddings and classified whether a protein could bind to the HLA class I regions of a genome, meaning that the protein would be recognized by the cell as belonging to the organism and therefore, no immune response would be triggered. The latter was a biomedical application that was useful for drug research and for understanding autoimmune disorders. The CNN performed comparably, depending on the dataset, with the other state-of-the-art models. In [
71], another variation was proposed for the DL- and ML-based classifications of the protein sequences, which were tested with DNN and SVM, thus achieving high performance. The visualization of the encoded sequence space suggested that this encoding stored accurate information about the protein structure. The method encoded a sequence as a continuous-value vector that characterized both the sequential structure and physicochemical properties of the protein.
In [
72], a novel natural language processing method that is called SeqVec (Sequence-to-Vector) was proposed and it was based on ELMo (Embeddings from Language Models) [
73]. SeqVec was trained on unlabeled data, and it learned to predict, probabilistically, the next word/character given that it knew the previous words/characters. Learning these probability distributions was a similar process to that of understanding the syntax and grammar of the corpus, and the same word/character could have different embeddings depending on the words/characters that came before it. The researchers tested this novel embedding on a wide range of DL-based proteomics tasks, which fell into three different categories depending on the type of output that the DL model produced: (1) a sequential output, where the model predicted a protein’s secondary structure (a string of length equal to the length of the amino acid sequence, denoting the three-dimensional structure of the physical protein), (2) its classification, both 10-class (subcellular localization, i.e., whether the protein was located in nucleus, ribosome, membrane, etc.) and binary (whether water-soluble or membrane-bound), and (3) regression, where the model generated a continuous value for the estimated protein disorder. A DNN was implemented for the classifications, and a CNN was implemented for the sequential and regression tasks. Although the reported results did not surpass those of the previous state of the art approaches, SeqVec yielded a better performance than that which is obtained by using other encodings, and moreover, the SeqVec representation was produced faster.
In [
74], an encoding method that is called MOS (Matrix Of Sequence) transformed a protein string into a 2D matrix with values ranging from 0 to 1. It resulted in the faster training of a model when it was compared to a number of other encoding methods. The trained model was a DNN that was used for the classification of protein interactions, and it yielded an accuracy of 94.34%. The amino acid sequences and physicochemical properties of the amino acids provided the necessary information to predict the protein structure [
75], and a number of encodings were proposed to bring domain knowledge into the representation, thus integrating additional information into the sequential context of the data. Chen et al. [
76] encoded the sequences with the Auto Covariance (AC) algorithm [
57], which transformed the protein sequences into numerical matrices of the same dimensions, regardless of the sequence length. The 20 amino acids were grouped into seven physicochemical properties, and a normalized matrix was constructed to represent the information. Then, the matrix was transformed to fit into a user-configurable shape/dimensionality, ensuring that all of the sequences were represented with matrices of uniform dimensions. The researchers used this transformation to evaluate the host–pathogen protein–protein interactions (HP PPI), predicting whether two proteins had a positive or a negative interaction. A stacked denoising AE extracted the features, and following this, a dense layer for the final classification was applied. The proposed architecture surpassed the traditional machine learning models.
In [
77], the proteins were classified as venomous or non-venomous, which is a task that is useful in antidote research. The protein representation was based on the Atchley factors [
78], whereas each amino acid was represented with a numerical vector of five elements, corresponding to five physicochemical and structural properties (polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge). Thus, each sequence was represented as a 2D matrix of shape 5 × [sequence length]. A Gated Recurrent Unit (GRU) was trained with the Atchley representations and the binary labels, and it surpassed the previous methods by up to 16% in its accuracy, and up to 22% in its F1-score. In [
79], four different encodings of protein sequences and their combinations were used to train a CNN. The results compared: (1) a simple one-hot encoding of the amino acid sequence, (2) the Informative Physico Chemical Properties (IPCP), an encoding that quantified the physiochemical properties of the amino acids, (3) the Composition of K-Spaced Amino Acid Pairs (CKSAAP), which encoded the normalized frequency of the appearance of each possible pair of amino acids, without the two amino acids of each pair having to be next to each other, but with the k amino acids being interpolated between them [
80], and (4) the Pseudo Amino Acid Composition (PseAAC), which used a set of discrete serial correlation factors, rather than the sequence’s actual composition. The researchers concluded that one-hot encoding that was concatenated with CKSAAP provided the best data representation, yielding 88.98% accuracy and an Area Under Curve (AUC) of 0.90, thus surpassing the previous methods. Ahmad et al. [
81] classified the peptides as to whether they contained antifungal properties or not by concatenating the one-hot encoded sequences with three types of extracted features: (1) the Composite Physicochemical Properties (CPP), an encoding that described the amino acid composition and eight physicochemical properties of each protein, (2) the Quasi Sequence Order (QSO), whereas the sequential information of the protein was encoded using Grantham distance (chemical distance information) and the Schneider-Wrede distance-based matrix (distance based on physicochemical properties) among each pair of the twenty amino acids [
82], and (3) a reduced amino acid alphabet, which was an abstract concept, consisting of different methods and proposals to represent a peptide sequence in a dimensionality that is smaller than that of a one-hot encoded sequence. The experiments demonstrated that the combination of the three extracted features performed better than any single feature did alone.
In addition to the amino acid sequence and the physicochemical properties, some researchers have aimed to enrich the proteomic data representations by integrating the evolutionarily similar information. The latter was done with a Position Specific Scoring Matrix (PSSM), a 2D matrix of dimensions [20 amino acids] × [protein length], as shown in
Table 4.
The values of
Table 4 were computed using the online PSI-BLAST tool [
83], which searches the Swiss-Prot database and calculates, through multiple alignments, each sequence’s evolutionary similarity with the proteins that are from other lifeforms. On an abstract level, a PSSM can be seen as the location of a protein in the protein sequence space of all of the lifeforms that share a similar protein. Since one dimension of the PSSM is the sequence length, a transformation called PsePSSM (Pseudo-Position Specific Scoring Matrix) was developed so that the protein sequences of varying lengths could be represented by the matrices of uniform shape. Numerous variations of the aforementioned protein representations have been developed [
84,
85,
86,
87,
88,
89,
90,
91], and they share two common characteristics. Firstly, they integrate the domain knowledge into the data, which represent additional information that is not present in the amino acid sequences themselves; this information may consist of the physicochemical properties, or of the evolutionary similarities that are extracted through comparing the proteins across the lifeforms, thereby encoding some of these comparative insights into the data. Secondly, they often result in non-sequential data that can be utilized by any neural network or, for that matter, any conventional machine learning model.
When the protein and peptide sequences are not transformed into a representation that nullifies their sequential status, the DL models that are able to handle them are proven to be CNNs and RNNs. Wen et al. [
92] used a genetic algorithm to find the optimal architecture of a hybrid CNN-Bidirectional GRU. The task that they performed was regression using a sequence to predict the peptide retention times (RT), which can be used as a quality control for the drug development. The researchers addressed the data shortage with a transfer learning procedure. First, they trained the model with various peptides and RTs, second, they trained it with the particular peptides of interest, and finally, they fine-tuned the pretrained model. Predicting the peptide RTs from a sequence was the focus of Ma et al. [
93] who used a CNN with capsule layers [
94], thus achieving a state-of-the-art performance. The CNNs with capsule layers were further validated by Du et al. [
95], which resulted in it outperforming the conventional ML models in the binary task of predicting whether a protein was saliva-secretory or not. The study was part of a larger project of analyzing the cancer biomarkers in saliva, as saliva has certain advantages when it is compared to blood and urine for evaluating disease biomarkers. Armenteros et al. [
96] asserted that a hybrid CNN with a BiLSTM was perhaps the best model for sequence classification, and this hybrid improved its performance with an attention layer. The hybrid CNN–BiLSTM processed the sequence, and the attention mechanism focused on the important regions within it. Both a binary and a 10-class task were studied, the binary task was the determination of whether the protein was membrane-bound or soluble, and the multi-class was a subcellular localization one, i.e., determining the protein’s location in relation to the cell (e.g., nucleus, cytoplasm, membrane, extracellular, etc.). The attention-augmented hybrid achieved 78% accuracy for the 10-class task, and 92% accuracy for the binary, thus outperforming the previous state-of-the-art one.
In [
97], the proteomics data were coupled with the GCNs for drug repurposing. The scientists drew from various datasets to create one large multi-relational graph of drug–protein, disease–drug, disease–protein, and protein–protein interactions. A CGN was trained to predict whether a drug and a disease could be associated, i.e., whether the drug could be used as a treatment for the disease. Another data type that is common in computational life science is Mass Spectrometry (MS) data. A mass spectrum, which is taken from any molecular structure, chemical or biological, measures the mass-to-charge ratio (
m/
z) of the ions that are contained in the sample, as shown in
Figure 6. It is represented by a sparse 1D vector, denoting the relative abundance of the ions (y-axis, value in the vector) for each position in a discretized
m/
z spectrum (x-axis, position in the vector). Through MS, conclusions can be made regarding which molecules a sample consists of, and this can be used to determine its chemical composition.
Guan et al. [
99] used one-hot encoded sequences to infer various tandem mass spectrometry (MS/MS) properties: (1) the ion retention time (iRT), which was a continuous value, (2) the survey scan charge state distributions, which were histogram-type vectors, and (3) the sequence ion intensities of the spectra, which was a one-dimensional time-series vector. In that multioutput determinations, the highest performance was reached with a BiLSTM, which comparatively surpassed the CNNs. In [
100], a hybrid CNN and BiLSTM technique inferred the MS/MS spectra from one-hot encoded peptide sequences, thereby achieving state-of-the-art results. In [
101], the same task was tackled with a seq2seq type of Residual CNN, with the difference that the one-hot encoded sequence was concatenated with a numerical vector of the monoisotopic mass of each amino acid in the sequence.
3.4. Metabolomics
Most of the metabolomics data types fall in the time-series category, which refers to the MS spectra (
Figure 6), the Nuclear magnetic resonance (NMR) spectra (
Figure 7) the liquid chromatograms (
Figure 7), etc. These techniques elucidate the chemical composition of the samples, resulting in histogram-type plots, which are represented as one-dimensional vectors.
With DL-based metabolomics, the classification of the biological systems does not rely on the genetic makeup and activity, e.g., the genomic sequence and gene expression, but on the chemistry of it. Probing into the chemical composition of the metabolites, the byproducts of cell metabolism, provides information and data for the classification tasks, thereby identifying the biomarkers for various traits or diseases, as well as for drug research.
Regarding the task of processing the spectrometry data for DL model training, Akyol et al. [
102] recommend two preprocessing practices: (1) replacing the missing values with some very small values, which are normally half of the minimum positive value of the data, and this was performed because the missing values probably express the metabolites whose levels were too low to be detected, and (2) applying a quantile normalization to reduce the variability among the different samples. Klimczak et al. [
103] used NMR spectra to identify the different taxa of pollen from the metabolites that were contained in air-sampled pollen extracts. The latter could provide useful information to patients who are allergic to specific types of pollen, and predict the allergy outbreaks during the spring, a capability that is particularly valuable for the areas where a tangible part of the population is afflicted by a pollen allergy, such as in Central Europe. Previous studies tried to classify the pollen by using carefully prepared samples that contained purified pollen in very high concentrations. That approach aimed to be realistic and deployed a data-driven approach to identify the pollen species from natural samples. A CNN learned to classify the NMR spectra of three species of pollen, and it achieved an accuracy between 86% and 93% depending on plant type. The MS spectra of tumors were used for a six-class cancer type prediction in [
104]. A number of spectra were taken for each tumor, with the range of ion
m/
z scanning and the number of peaks varying in each spectrum. In order to make the spectra uniform, the researchers performed binning on the
m/
z range, and they summed up the peaks that fell within each bin. A DNN was trained with the resulting vectors, thus yielding significant performance gains over the traditional ML models. In that study, the batches of spectra were integrated manually into a single spectrum; however, in [
105] the integration was achieved through the use of DL. The labels were hand-crafted spectra which were made by human experts from batches of spectra of the same peptide, where the peaks in the different spectra rarely coincided in the same regions on the x-axis. A BiLSTM took the spectra and learned to perform peak integration. The quality of the model’s output, after adequately training it, was slightly lower than that of the human annotators.
DL has been also deployed for metabolomics data annotation to automatically detect the peaks from spectra and chromatograms. In [
106], the peak detection in LC–MS (liquid chromatography—mass spectrometry) data was learned from three-class labels, where each position on the input vectors was labeled as peak, no-peak/noise, or uncertain. The peaks were scaled to a maximum value of one to limit the selection bias towards more abundant peaks, and a custom, simplified U-net was trained to segment the LC–MS images according to the three classes. The model achieved an intersection-over-union (IoU) of 0.88 for the separation regions and 0.85 for the peak regions. A U-net was also employed in [
107] for chromatogram peak detection. The researchers devised a method to generate synthetic chromatograms and annotations to train the U-net model, which identifies the peaks with a higher accuracy than the human experts did.
Raw spectra and chromatograms may also be turned into metabolic profiles, as shown in
Table 5, representing the chemical composition of each sample. The columns represent various compounds, the rows represent the samples, and the values quantify the relative abundance or concentration levels of each compound in each sample.
Date et al. [
13] used the metabolic profiles of yellowfin goby fish that were collected from rivers across Japan. As it was more of a proof-of-concept study, the researchers simplified the task by keeping it as a binary classification problem, and they trained a DNN to detect whether the fish originated from Kanto or not. The model reached a 97.8% accuracy, performing slightly better than RF and SVM did. Hypothesizing that the organisms of the same species feeding in different locations would vary in their metabolic profiles accordingly, the researchers explored the possibility of inferring the habitat location via the DL-based metabolomics. In a similar Japanese study [
108], the researchers used metabolic profiles to infer the physiology of the subjects. They also applied the study to fish of multiple species throughout Japan after noticing that fish size was highly correlated with metabolite composition. Using fish size as the dependent variable, they trained an ensemble of DNNs to perform regression. Each DNN was trained on bootstrapped data (samples randomly picked with replacement), and additionally, each DNN was trained on a random subset of the features/variables. The best models were combined for the final prediction, and the final result was always better than that which the regular DNNs could produce, and depending on fish species, it was comparable with the RF and SVM ones. Guo et al. [
109] used the metabolic profiles of chronic kidney disease (CKD) patients and healthy control patients to identify the biomarkers for CKD, and this made it possible to predict CKD in its early stages. The labels consisted of five CKD stages plus one class for healthy individuals. After applying the feature selection with a Lasso regression [
110] on the dataset, a DNN and a CNN were trained. The DL models ended up performing worse than the RF did, which was attributed to the extensive feature selection; the models were trained on only five out of tens of thousands of features. A comparable performance was achieved by using ML and DL when dealing with low-dimensional datasets and a reasonable (not too high) number of samples.
In [
111], a platform combining metabolomics and DL was applied for pathogen and spoilage microorganisms identification. A CNN model of three potential biomarkers for Listeria Monocytogenes was developed, reaching predictions of up to 82.2%. Moreover, 29 metabolites were identified, and six common Listeria species were distinguished in hierarchical cluster analysis. Finally, the binary and multiple CNN classifiers identified Listeria Monocytogenes and other pathogens, with an accuracy of 96.7% and 96.3%, respectively. In [
112], the metabolic profiles of breast cancer tissue samples were used to classify the estrogen receptors as positive (ER+) or negative (ER-). After the quantile normalization, the log transform, and mean centering procedures, an AE was pretrained to reconstruct the data, which was then converted into a DNN and trained for the binary classification. The method reached an AUC of 0.93, and it surpassed the traditional ML models. The end goal of the study was not the binary classification itself, but the elucidation of the biological functions and metabolic pathways that lead to different types of cancer. Starting from the binary classification, the researchers ranked the features in terms of how much they contributed to the outputs, thus identifying the important metabolites. With the database searches, they mapped these metabolites to the chemical pathways and enzymes. They also used gene expression data, thereby locating the genes that were expressed differentially in the two cancer types, and they analyzed these data to infer the gene metabolite networks. Thus, DL was used as a steppingstone, a component of a wider research methodology, to provide deeper biological insights.
3.5. Epigenomics
Epigenomic modifications and properties such as histone modification, DNA methylation, and chromatin accessibility can be seen as an additional level of information on top of the genomic level, thus adding to and modulating the information that is contained in a genomic sequence. A blood cell, neuron, or sperm cell of an organism all carry the same DNA sequence in their nucleus. It is the physical 3D structure and epigenetic modifications that define whether the DNA strand will result in a blood cell, neuron, or sperm cell. Predictive modeling, depending on the task that is performed, may require information that is not present in the DNA sequence, but this could be extracted from a higher level of biological organization, whether this is transcriptomic, proteomic, metabolomic, or epigenomic. From a computational point of view, the epigenomic data are comprised of time-series signals, with the x-axis spanning the length of an assorted DNA sequence, thus revealing which parts of the sequence have received a certain modification and bear a certain property. These signals are represented as vectors, which are either real-valued or binary.
Figure 8 illustrated the high-level view of DL with epigenomics.
In [
113], a differential gene expression was inferred from the histone modification profiles of a gene in two cell types. The same gene was expressed differently in the different cells; that differential expression was not represented in the DNA strings, but it could be extracted from the epigenomic data, which is the additional layer of information that is about which modifications were applied to the DNA sequence. Each sample in the dataset consisted of two vectors, representing the histone modification profiles of a gene in two cell types. Each vector showed the amount of histone modification taking place in the gene throughout its sequence length. A multi-head LSTM with an attention layer was trained for the gene expression levels regression, whereas a separate LSTM was trained for each input vector. An attention layer step followed this process, then an embedding layer step followed this. The embeddings were concatenated and fed into an LSTM and an attention layer before the final prediction was made. The model significantly outperformed the RF regression and the SVM regression, and it slightly outperformed the previous DL-based state-of-the-art methods. Similarly, histone modification profiles were used in [
114] to detect the genomic sequences that were enhancers (DNA sequences that stimulate gene expression) or those that contained them. The dataset contained vectors that spanned the length of a DNA string with values signifying how much that a histone modification was applied at each point in the sequence. Each sample had a number of such vectors, which were stacked one over another, and each vector documented a specific histone modification. A hybrid CNN–LSTM was trained on a variety of datasets, and this reached an accuracy ranging from 84.8% to 98.0%, which was comparable to, and mostly surpassing, the previously applied DL models. In [
14], the genetic predisposition for type-2 diabetes was detected with the help of a U-net. Part of a larger project, which was aimed at exploring the mechanisms and genetic pathways contributing to type-2 diabetes, DL was deployed to infer, with a signal from a few samples, what the signal from many samples would look like. The epigenetic information came in the form of ATAC-Seq signals, which were time-series vectors that were taken from type2 diabetic cells. The ATAC-Seq data were normally taken from multiple cells, the peaks were then integrated, and the signals were aggregated into one. The latter part was highly problematic when they were dealing with rare cells. The researchers trained a U-net to take the aggregate the ATAC-Seq signal of 28 cells, upscale it, and predict the aggregate signal of 600 cells. The result was subsequently used in the following stages of the research project.
ln [
115], the genomic and epigenomic information was combined to predict whether two DNA sequences interacted with each other. The genomic information came in the form of the DNA sequence pairs that were one-hot encoded. One string always belongs to a promoter, and the other belongs to an enhancer, and a binary label signifies whether the promoter and the enhancer interact with each other. Regarding the epigenomic information, each genomic sequence is associated with a stack of vectors, representing the epigenomic features (e.g., CpG methylation, histone modification, etc.). The vector length represents the length of the DNA sequences that are divided in bins, and the values represent the degree to which an epigenomic property applies to each part of a sequence. Three models were deployed in [
115]: a CNN, a CNN with an attention layer, and ResNet. All of the models followed a general architecture of starting as two separate models, with one taking the DNA sequences and the other one taking the epigenetic information, and later, the outputs of the two models were concatenated and fed into further dense layers to produce the final predictions. The experimental tests showed that integrating sequences with the epigenomic data yielded better results than using just using one input, and the epigenomic features were generally more informative than the DNA sequences were. The train–test split of the data was performed not randomly, but by chromosome; one chromosome provided the data for the training, and another provided the data for the testing. When dealing with enhancer–promoter interactions (EPI), random train-test splits may lead to one overestimating the model’s accuracy as there is great redundancy of the enhancer and promoter sequences and they may be present in both the training and the test sets. In [
116], an attention-based DL model was developed, namely eDICE, to impute the epigenomics tracks. The model reported a state-of-the-art overall performance, and it was able to correctly predict the individual and cell-type specific epigenetic patterns.
We have examined the tasks, such as predicting differential gene expression and enhancer–promoter interactions (EPI), whereby more useful and discriminative information is found in the epigenomic than that which is found at the genomic level. However, the epigenetic data are generally hard to acquire, and the capability to predict the sites with epigenetic activity from the DNA sequences is being pursued [
117]. Such predictions give direction to experimental research or, ideally, they can even be a substitute for the experimentally produced epigenetic data. In [
15], a CNN takes the DNA sequences from different cell types and detects the histone modifications in an effort to identify the genes and pathways that correlate with aging. The volume of the DL-based epigenomic studies is still relatively low, but as the experimental techniques become cheaper and more mature, the production and availability of the epigenomic data increases, as shown in
Figure 1. With mentions of a late “epigenomics data deluge” [
118], the research in this area is expected to take off.
3.6. Multi-Omics
Any two or more of the five different types of omics data that have been previously examined, e.g., genomics, transcriptomics, proteomics, metabolomics, and epigenomics, can be combined for predictive modeling tasks, thus taking advantage of the unique information and patterns that are encoded in the different levels of biological organization. By learning from the multi-omics data, DL extracts the informative patterns from one biological level that are absent from another, thus exploiting the multifaceted information in a synergistic, holistic manner. The basic multi-omics data integration architectures that are used for DL model training are the following three:
Different data modalities are preprocessed into a uniform type, then they are concatenated and fed into a model (
Figure 9a).
A separate model is trained for each data modality, their predictions are aggregated, and this process is followed by majority voting (
Figure 9b).
A multi-view model is used, starting as a separate model for each data modality, and then the outputs of these models are concatenated and fed into further dense layers, thus leading to the final prediction (
Figure 9c).
Due to their capacity to learn the representations at different levels of abstraction through their hidden layers, DL models are particularly useful for modeling complex multi-omics data [
119]. They are employed in multi-omics tasks, even as a small part of a pipeline that uses a plethora of computational tools. For example, in [
120], DL was used only for feature extraction and a multi-omics integration stage of a research project aiming to achieve the unsupervised identification of cancer subtypes and the classification of new patients into these subtypes with the end goal of providing better patient care. The data came from hepatocellular carcinoma (HCC) patients, and this included genomic, transcriptomic, epigenomic, and clinical data. K-means clustering assigned labels to the data, and an AE extracted the features. The feature selection took place via the Cox PH (Cox proportional hazards) [
121], and the ML models learned to classify the new samples according to the k-means-identified classes. In [
122], a similar pipeline used RNA expression, miRNA expression, DNA methylation, and clinical data from cancer patients who were labeled by k-means clustering. The features were extracted through the AEs and selected through the Cox PH. Then, the ML models classified the new samples, and the statistical analysis was utilized for a biological insight, thus elucidating a differential gene expression in the cancer patients.
In [
123], a wider variety of data was involved: mRNA expression, miRNA expression, protein expression, DNA methylation, somatic mutations, Copy Number Variations (CNVs), and clinical data. The DNA methylation data come in tabular form (
Table 6), with rows being the samples, and columns being the sites in the genome of the organism that the samples were taken from, while the values show the amount of methylation that was applied to the genome on these sites. The somatic mutation data (
Table 7) consists of the binary vectors that are of length equal to the number of the genes that are being evaluated, whereby the authors applied a value of 1 when the gene was mutated, and 0 otherwise. The CNV data come in the form of a table of shape [number of patients] × [number of genes], with the data taking one of three possible values (−1, 0, 1), showing whether for each gene the patient has no gene alteration, or whether they have some additional DNA regions that were copied, or whether they have regions that were deleted (
Table 8). The same architecture as that which was used for the AE, the Cox PH, the k-means, and the ML-based classifications was implemented, with the researchers demonstrating that the multi-omics data yielded better results than single omics data did.
In [
124], the same process was applied for ovarian cancer using a denoising AE instead of a regular or a stacked AE for the feature extraction procedure, which was followed by a feature selection procedure through logistic regression. In [
125], the feature selection was implemented with a traditional statistical analysis, a regular AE that extracted the features, and an RF that classified new samples after it was trained on the labels assigned in an unsupervised manner. In [
126], the feature extraction phase was implemented with two AE-based strategies using mRNA expression and DNA methylation data from Esophageal Squamous Cell Carcinoma (ESCC) patients. The first one was an early-fusion strategy, where the two omics datasets were concatenated and fed into a regular AE. The second one was a joint multi-modal strategy, where a distinct layer took each dataset, and their outputs were concatenated and fed into a common encoder layer. The outputs of the encoder layer are forked out into two distinct layers. The two distinct output layers learned to reconstruct the two datasets. The k-means clustering, which was based on the surviving rates, yielded two classes of high and low survival probability, the analysis of variance (ANOVA) selected the most important features, and an SVM learned to classify the data, thus showing that the joint multi-modal strategy led to better performance than early-fusion one did.
DL may also play a more central role in the multi-omics modeling tasks, as well as in cancer research. In [
127], the classification of breast cancer subtypes was conducted using a CNN that took gene expression data and another CNN that took CNV data, and their outputs were then concatenated into additional dense layers that generated the final prediction. The model yielded a 79.2% accuracy, thereby surpassing the shallow ML models, and the combination of two omics data produced better results when they were compared to those of the individual omics. In [
128], four types of omics data were treated with two AE architectures. The data, which came from breast cancer samples, consisted of gene expression, DNA methylation, miRNA expression, and copy number variations (CNVs). The AE implementations were applied with all of the possible pairs of the four omics data, i.e., the datasets were not combined, but different pairs were tested to find the best pairing. The first AE architecture, which is called the ConcatAE, consisted of a separate AE for each omics type. The extracted features were concatenated and fed into a dense layer for the classification to be performed. The latter was performed for all of the possible pairs of the four datasets. The second architecture, which is called the CrossAE, took a single omics dataset as the input, but it took two datasets as the output, thereby, it tried to reconstruct both datasets from one. The extracted features were averaged element-wise, and they were fed into dense layers for the classification to be performed. The classifications were both binary-class and multi-class, and these were used to identify the breast cancer subtypes. The best arrangement was proven to be ConcatAE with DNA methylation and miRNA expression data.
In [
129], cancer classification was tackled with a GCN by integrating the multi-omics data along with the PPI networks, and this surpassed the other baseline methods to which it was compared. In [
130], the researchers used the multi-omics data of cancer patients to generate, via a Similarity Network Fusion (SNF) method [
131], a graph of the patient similarities. Then, by using that graph and AE-extracted features of the original multi-omics data, they trained a GCN to classify the types of cancer. Thus, the model could identify the cancer not only by the multi-omics data of a patient, but also by taking into account the diagnosis of similar patients. Similarly, in [
132], the researchers did not want to rely exclusively on the multi-omics data for the classification of the cancer, thus, by using Pearson correlation, they constructed a similarity graph between the samples, and by using that combined data, they trained a GCN-based system to differentiate among three cancer subtypes, taking into account the known classifications of the cells that are similar to those of the inputs.
In [
133], two types of sequential data were combined: one-hot encoded DNA sequences and DNase-Seq signals. The latter one was a real-valued vector that was of length equal to the DNA sequence, showing which regions of the DNA sequence and to what degree that they displayed chromatin accessibility, an epigenomic property implying that these regions played functional and important roles in the specific cell the sample from which it was procured. The task was binary multi-output, with each output representing an epigenetic marker, that took a binary value depending on whether the sample acquired that property. One CNN took the one-hot encoded sequences, while another CNN took the peak signal, and their outputs concatenated and led to dense layers that produced the multi-output classifications. The proposed architecture surpassed the previous state-of-the-art DL models, and the coupling of the genomic sequence with the epigenomic signal resulted in a higher performance than any of the single omics techniques did alone.
In [
134], knowledge discovery in cardiovascular disease data was pursued by the unsupervised modeling of multi-omics data. Mice received induced cardiac hypertrophy, while their protein and metabolite levels were monitored over time and recorded to form a dataset. The data of the healthy control subjects were also collected. The goal of the project was to identify the differences between the healthy and cardiac hypertrophy subjects, and to elucidate the pathways and interaction networks of the proteins and metabolites that play a role in cardiovascular disease. Two unsupervised approaches were utilized to extract the patterns out of the data. First, an LSTM-based variational AE extracted the low-dimensional embeddings of the sequential data, and this was followed by k-means clustering. Second, a DL-based clustering model, Deep Convolutional Embedded Clustering (DCEC) [
135], took the time-series vectors that were represented in the form of line-plot graphs, wherein a variety of line widths and image sizes were tested, and then, the authors performed clustering. Additionally, conventional clustering algorithms were applied to the original data for a comparison to be made. The results were validated through the Reactome knowledgebase [
136], which was used to compare the proteins and metabolites that were clustered together with the known pathways and hierarchical relationships contributing to cardiac disease. The research revealed that DL-based clustering yielded biologically meaningful results, and it surpassed all of the other approaches.
Another domain in the computational life sciences that exploits the use of deep neural networks and multi-omics data is drug development. In [
137], the cells were treated with various chemical compounds. Gene expression data were taken from the cells, and they were concatenated with one-hot encoded gene ontology annotations and categorical information that was based on the attributes of the genes. A DNN was trained to predict whether a chemical compound affected a gene to a statistically significant level or not, reaching an AUC of up to 0.84. In [
138], a model was designed that took omics data from a biological sample plus information on two drugs and predicted whether the two drugs would have a synergistic effect of treating the cancer type that was expressed in the omics data. The gene expressions, the CNVs, and the somatic mutations of various cancer cell lines were coupled with the data on the physicochemical properties of drugs that are known to target the corresponding cancer cell lines. The drug profiles contained both the real-valued and categorical data, and the labels consisted of synergy scores for the pairs of drugs of the corresponding cell lines. Each of the three types of omics data were fed into a separate AE for their feature extraction. The extracted features were concatenated with the features of any two drugs into a DNN that varied depending on the cancer type, and it always better than the previous state-of-the-art methods did.
In [
139], the researchers concatenated four types of data, and they trained a DNN to predict the drug–target interactions (DTI), i.e., whether a drug interacted with the target or not. The four used data types were: (1) the Drug-perturbed gene Expression Profiles (DEPs), where the gene expression was measured for the samples that received chemical treatments, as well as for the control, untreated samples, (2) the Gene-knockdown Expression Profiles (GEPs), where the samples had their genes removed, which represent the gene expression profiles that show how the elimination affected the other genes, and the profiles of the control samples were also taken for comparison, (3) the Protein–Protein-Interaction (PPI) networks, which were graph data that were embedded into vectors via Node2Vec [
140], and (4) the pathway membership data, which were embedded through GloVe into vectors that associated together the genes that were functionally related, thereby grouping them into biologically relevant clusters. The first type of data referred to the drugs, while the other three to biological systems and the concatenation of these were used by a DNN to model the drug–target interactions.
In [
141], a DNN, along with the conventional ML models, utilized multi-omics data to predict the novel targets for a therapeutic treatment in the field of oncology. Starting from the lists of the genes that were either targeted by the FDA-approved drugs or those which, when they are mutated, may cause cancer, the researchers collected the data for the gene expression, the gene mutations (averaged over numerous patient samples for each cancer type), the gene essentiality (real-valued, mean sensitivity from knock-out experiments), and the gene interaction networks that were embedded via AE-based diffusion graphs [
142]. The same data were also collected for the genes that were not present in the therapeutic target or suspicious genes list, thus they made up the negative samples of the dataset. The random forest feature selection reduced the dimensionality of the dataset, and the predictive models were trained, which included a neural network with an output neuron that was activated by a softmax activation function. After learning the positive and negative gene distributions and the interaction networks of the associated genes, the model was given data for any gene and, with it having an AUC of 0.88, it revealed the probability of the gene as a potential target for anticancer drugs. In [
143], a method was developed for the early prediction of COVID-19 patient survival by combining plasma multi-omics and DL. The precise concentration of 100 proteins and metabolites in the plasma from hospitalized patients was determined, and it appeared distinctively different from that of the control, healthy patients, thus, indicating the difference between the non-surviving patients and the surviving patients. A DL model was developed, which was able to learn from multi-omics regarding the concentration of ten proteins and five metabolites, so as to predict the early survival of COVID-19 patients, thus reporting a 92% accuracy and 0.97 AUC on the hospitalization day.
In [
144], the researchers drew on a wide variety of omics data from cancer cell lines and drug data such as proton pump inhibitors (PPIs), differential gene expression, disease-gene association scores, kinase inhibitor profiling, and growth rate inhibition (GR) to construct a graph. They then applied biological knowledge to simplify the graph, i.e., to remove the edges and nodes with an insignificant influence. A GCN took the drug data and predicted the response across a variety of tumors. In [
145], a GCN learned a graph of genes that were related to cancer and PPIs, and it was trained with drug chemical structures and multi-omics data of cancer cells, and it learned to predict the drug response, thus it surpassed most of the existing methods.
The authors in [
146] used Generative Adversarial Networks (GANs) and Functional Interaction (FI) networks [
147] for the purpose of biologically informed feature extraction. The datasets were comprised of genomics, transcriptomics, and epigenomics data of seven cancer types. Instead of using a regular, fully connected neural network as the generator, the GAN learned a sparser, biologically inspired network that represented the interactions among the features. The latter scheme was shown to obtain more accurate predictions than the existing methods could.