1. Introduction
Despite the current improvements in antimicrobial therapy and vaccination, infectious diseases remain a major threat to public health worldwide. They cause significant morbidity across the nations, posing a major burden on the economy, and causing a substantial number of deaths in the less developed countries [
1]. The majority of infectious diseases are caused by pathogenic bacteria and viruses. Pathogens interact with the host system right from the point of its entry into the host, primarily to evade the host immune response and create their own niche for survival and growth [
2]. The identification of host proteins targeted by pathogens and pathogen–host protein–protein interactions (PPIs) is crucial to understand the mechanisms underlying the infectious diseases [
3]. To differentiate between the bacterial- and viral-targeted host proteins is critical to delineate the specific infection strategies for these two groups of pathogens. While this may help in the diagnosis of the etiology, it is particularly important from the treatment perspective, which is distinct for bacterial and viral infections. Antibiotics kill bacterial pathogens but are ineffective against viruses. Finally, identification of the specific biological processes for the bacterial- and viral-targeted human proteins could improve disease prognosis and treatment.
Several studies attempted to explore the mechanisms underlying infectious diseases from the study of pathogen–host PPIs [
4,
5,
6,
7,
8,
9,
10,
11,
12,
13]. The availability of experimentally verified pathogen–host PPIs in the public domain significantly helped these efforts [
14,
15,
16,
17,
18,
19,
20]. However, only one study compared pathogen–host PPIs for bacterial and viral infections [
21]. This study addressed common as well as distinct infection strategies for bacterial and viral infections. To distinguish between bacterial- and viral-targeted human proteins, they only used the degree centrality, betweenness centrality, and gene ontology (GO) features of different proteins. They drew a general conclusion that viruses tend to interact with human proteins having much higher connectivity and centrality values than those for bacteria. They proposed that viral-targeted human proteins function in the cellular process to manipulate it, while bacteria-targeted human proteins interact with the immune system. Here, we used more rigorous techniques, such as machine learning algorithms, to differentiate the bacteria-targeted human proteins from the virus-targeted proteins. To this end, we used the sequence, network, and gene ontology features of human proteins extensively. We identified the best features set for the purpose of discriminating between bacterial- and viral-targeted proteins and listed the top predicted targets. Finally, the differences between the bacterial- and viral-targeted human proteins were validated by GO and pathway enrichment analysis.
2. Material and Methods
2.1. Data Collection
All the experimentally validated bacteria–human and virus–human protein–protein interaction (PPI) datasets were collected from PHISTO: a pathogen–host interaction search tool [
22]. We found 8993 and 35,120 bacteria–human and virus–human PPIs, respectively, and detected 3673 bacterial- and 5887 viral-targeted human proteins. Out of these, 1780 proteins were common targets of both bacteria and viruses (shown in
Figure 1) and were excluded from our analysis. We searched the remaining 1893 and 4107 respective bacterial- and viral-targeted human proteins, in UniProt, a worldwide hub of protein knowledge database [
23]. We found 1618 and 3916 bacterial- and viral-targeted and reviewed human proteins, respectively, in UniProt (
Supplementary Tables S1 and S2), which were considered for further analysis.
2.2. Sequence Features
All the above human protein sequences were downloaded from the UniProt database. For the prediction of proteins and PPIs, the sequence features, such as the amino acid composition (AAC), dipeptide composition (DC), pseudo-amino acid composition (PAAC), and composition-transition-distribution (CTD) were reported as important features [
24,
25,
26]. We computed AAC, DC, PAAC, and CTD using PyDPI, a freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies [
27]. We used these sequence features to discriminate between the bacterial- from the viral-targeted human proteins.
2.3. Network Features
To compute network features for human proteins, we retrieved expert-curated human PPIs from the Human Protein Reference Database (HPRD) (Release 9) [
28] and constructed a network using these PPIs. Network analyzer (cytoscape plugin) was used to compute the network properties, such as degree, closeness centrality, neighborhood connectivity, average shortest path length, betweenness centrality, clustering coefficient, topological coefficient, eccentricity, and radiality [
29].
2.4. Gene Ontology (GO) Features
All the GO identifiers (IDs) for the respective 1618 and 3916 bacterial- and viral-targeted human proteins were downloaded from UniProt. We found a total of 23,737 GO IDs for 1618 bacteria-targeted human proteins, while the number of GO IDs for the viral-targeted human proteins was 67,035. The occurrence of each GO ID was counted separately for the above two groups, followed by sorting based on the occurrence value. The top 100 and 280 GO IDs for the bacterial- and viral-targeted human proteins were extracted for GO features. However, only 282 were unique among the top 380 GO IDs (
Supplementary Table S3). Therefore, we considered the unique IDs for GO features (
Supplementary Figure S1). For each human protein, the presence or absence of the top GO ID was considered as 1 or 0, respectively.
2.5. Classification
The distinction between the bacterial- and viral-targeted human proteins may be viewed as a binary (two-class) classification problem. To differentiate between the proteins, we used well-known classifiers, such as SVM, RF, and DNN.
2.5.1. Support Vector Machines (SVM)
The SVM classifier explicitly maps the data over a vector space to find a decision surface that maximizes the margin between data points of two classes. For the SVM classifier, we used the scikit-learn python package [
30]. To find the best performance of the SVM classifier, we tested different combinations of cost and gamma parameters of radial basis function (RBF).
2.5.2. Random Forest (RF)
Several decision trees (DTs) grow simultaneously using a random subset of features in RF. In the RF classifier, each tree is a new object and “votes” for that class. Based on a majority vote, the forest elects the classification. We also used the scikit-learn python package for the RF classifier. Optimal parameters were utilized to find the best performance.
2.5.3. Deep Neural Networks (DNN)
The DNN method was shown to perform well with diverse problems. DNN is more robust and useful than other methods for complex classification problems and is becoming a popular algorithm in the field of modern computational biology. We used TensorFlow DNN, which is a widely-used deep learning package for classification, to discriminate between the bacterial- and viral-targeted human proteins [
31].
2.6. 10-Fold Cross-Validation
To avoid the performance bias of the prediction methods, we used the 10-fold cross-validation technique. In 10-fold cross-validation, the whole dataset is divided into 10 sets (folds) of equal or nearly equal sizes. Training and testing are repeated 10 times so that each time, a different set (fold) goes out for testing, while the remaining 9 sets (folds) are used for training. The average performance measures over the 10 folds are considered for the overall performance of the model.
2.7. Feature Selection
We used several feature selection methods, such as univariate feature selection (UFS), recursive feature elimination (RFE), feature selection using SelectFromModel (SFM), and tree-based feature selection (TBFS). In UFS, the K best features were selected based on the univariate statistical tests. We used all the univariate statistical test methods available in scikit-learn for the purpose of classification. In RFE, the least important features are excluded in each recursive step, until the desired number of features is reached. The important features are selected from the model in SFM. In TBFS, a tree-based estimator computes the importance of the features and irrelevant features are discarded.
2.8. Performance Measures
The performance measures of the classification problem, such as sensitivity, specificity, accuracy, positive predictive value (PPV or precision), Mathews correlation coefficient (MCC), and F1-score were calculated using the following equations:
where
True Positive (TP): Bacterial-targeted human proteins are correctly identified as bacterial-targeted human proteins.
False Positive (FP): Viral-targeted human proteins are incorrectly identified as bacterial-targeted human proteins.
True Negative (TN): Viral-targeted human proteins are correctly identified as viral-targeted human proteins.
False Negative (FN): Bacterial-targeted human proteins are incorrectly identified as viral-targeted human proteins.
The area under the receiver operating characteristic curve (AUC), for all the cases, was also computed.
2.9. GO Enrichment Analysis
The top 100 bacterial-targeted and the same number of viral-targeted human proteins predicted by our method were considered for GO enrichment analysis. To this end, we used Enrichr, a comprehensive gene set enrichment analysis web server, 2016 update [
32]. We considered only the biological process terms with
p-values < 0.05 for the GO enrichment analysis.
2.10. Pathway Enrichment Analysis
The above mentioned 200 human proteins (100 each of the bacterial- and viral-targeted proteins) were also considered for pathway enrichment analysis. We used the Reactome Pathway Knowledgebase for this purpose [
33]. Pathways with
p-value < 0.05 were treated as enriched pathways.
4. Discussion
Rapid, safe, cost-effective, and accurate tools for etiological diagnosis of suspected infections are of paramount importance for individual and public health. Particularly important is to discriminate between the bacterial and viral causes of infectious diseases given the alarming rise of antibiotic resistance, due to their indiscriminate and unnecessary use. An estimated 30–50% of antibiotics are prescribed in hospitalized patients of the United States for wrong indications, most commonly viral infections (
https://www.cdc.gov/antibiotic-use/stewardship-report/outpatient.html, accessed on 21 October 2021) [
34]. Traditional culture methods for bacterial infections are low throughput, time consuming, and labor intensive, in addition to the challenges of sample collection from some of the infected tissues, and the lack of wide availability of culture techniques for many pathogen species. On the other hand, the diagnosis of viral infections by serology may lack specificity, while nucleic acid detection methods require sophisticated equipment and technical expertise. However, no reliable methods or markers are currently available for the rapid diagnosis of bacterial and viral etiologies of infectious diseases.
Attempts have been made to develop complementary diagnostics for infectious diseases by focusing on specific host responses. In addition to being capable of discriminating between colonization and infection, this approach is not limited by the availability of infected tissue samples. Moreover, host response-based categorization of infections provides additional insights into the disease pathogenesis and immune response and may help to identify new targets for therapeutic intervention.
Multiple attempts have been made to diagnose infectious diseases based on host-specific biomarkers. Widely used parameters such as WBC counts and C-reactive protein (CRP), may aid to differentiate between bacterial and viral infections, but lack sensitivity and specificity, leading to frequent misdiagnosis. Newer bacterial infection markers, such as presepsin, procalcitonin, and CD64, are used for severe sepsis, while proADM may predict prognosis of the disease [
35,
36]. In contrast, cytokines, such as IL-2, IL-8, and IL-10 were suggested as early biomarkers for viral infection [
37]. Several research groups reported that the antiviral host protein MxA is a clinically useful marker for acute viral infection and, combined with CRP and/or procalcitonin, may distinguish between bacterial and viral infections [
38]. A double-blind, multicenter study found that a strategy to integrate CRP, tumor necrosis factor-related apoptosis-inducing ligand (TRAIL) and interferon γ-induced protein-10 (IP-10) performed significantly better than the individual markers to identify acute viral infection in pediatric patients [
39]. However, they did not validate their tools against reference diagnostic methods, limiting its utility. Other studies also suggested that a combination of markers may perform better than a single biomarker [
40]. However, combining CRP with other markers did not improve the former’s ability to differentiate between bacterial and viral lower respiratory tract infections in a different study [
41].
High throughput genomic and proteomic studies have been employed to identify infection-specific host gene sets. Although they were useful for novel biomarker discovery, the gene sets often contained a large number of candidates, making them difficult to apply clinically [
42,
43,
44]. Through multi-cohort analysis of these large datasets, smaller gene sets optimized for the diagnosis of bacterial and viral infections were identified later on [
45].
Machine learning techniques have been extensively used for disease biomarker discovery, including infectious diseases. However, they were mostly used for individual microbial species or groups of pathogens. The increasing availability of bacteria–human and virus–human PPIs now permits researchers to compare bacterial- and viral-specific infection strategies and identify host proteins that are differentially targeted by these two classes of pathogens. We employed well-known machine learning methods, such as SVM, RF, and DNN to the available PPI datasets to distinguish between bacteria- and virus-targeted human proteins.
We considered all the updated and comprehensive sets of experimentally validated bacteria–human and virus–human PPIs from PHISTO. We found 1780 human proteins that are common targets for bacteria and viruses. During the bacterial and viral infection, these common proteins might help to execute several commonalities, such as immune response patterns, acute onset, and response to antimicrobial agents in humans. The primary goal of the current study was to differentiate between bacterial- and viral-targeted human proteins. Therefore, we excluded these 1780 human proteins from our analysis. The proposed method used 1618 and 3917 bacterial- and viral-targeted human proteins. To ensure utilization of a larger dataset of two classes, we considered the complete dataset for building the model. For imbalance datasets, we found that performance measures, such as the AUC, MCC, and F1-score, were more important as opposed to sensitivity, specificity, and accuracy. Therefore, we compared the AUC, MCC, and F1-score for all the cases. We found that sequence and gene ontology features performed far better than network features. We witnessed that the network properties of human proteins was unable to distinguish between bacterial- and viral-targeted human proteins (
Table 1), suggesting indistinguishable network feature patterns for bacterial and viral targeted human proteins. The majority of frequent GO IDs for bacterial- and viral-targeted human proteins are common (
Supplementary Figure S1). Therefore, gene ontology features were unable to perform better than the sequence features. Among the sequence features, we found that DC achieved better performance than the others. A combination of AAC, DC, and PAAC features (445 features) achieved the best performance (
Table 1). In addition to these, the feature set selected by different feature selection techniques also showed a poorer performance than the above features set. Therefore, we reported that the combination of AAC, DC, and PAAC (445 features) is the best feature set for discriminating between bacterial- and viral-targeted human proteins. If the two classes are distinct due to true biological reasons, then we can also get good performance results for conventional MLTs like SVM and RF (shown in
Table 1, and
Figure 2 and
Figure 3). The DNN performed well due to a large number of data and features. Furthermore, we identified the top 100 human proteins targeted by bacteria and the top 100 human proteins targeted by viruses. The gene ontology enrichment analysis of these 200 proteins showed a greater number of enriched biological processes for viral-targeted human proteins rather than bacterial-targeted human proteins (
Figure 4). Similarly, we observed a greater number of enriched pathways for viral-targeted human proteins than bacterial targeted human proteins. These results imply that viruses are influencing more biological processes and pathways than bacteria. As is known, viruses are totally dependent on the host. Therefore, they exploit more host machinery than bacteria. The above results indicate the same. In addition to this, we observed that the majority of the enriched biological processes and pathways were different for bacterial- and viral-targeted human proteins. These functional annotations also validated our method for discriminating between bacterial- and viral-targeted human proteins.