Next Article in Journal
A Rapid Method for Detection of Antigen-Specific B Cells
Next Article in Special Issue
PPSW–SHAP: Towards Interpretable Cell Classification Using Tree-Based SHAP Image Decomposition and Restoration for High-Throughput Bright-Field Imaging
Previous Article in Journal
Oxalic Acid Inhibits Feeding Behavior of the Brown Planthopper via Binding to Gustatory Receptor Gr23a
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identification of Clinically Relevant HIV Vif Protein Motif Mutations through Machine Learning and Undersampling

by
José Salomón Altamirano-Flores
1,
Luis Ángel Alvarado-Hernández
1,
Juan Carlos Cuevas-Tello
1,*,
Peter Tino
2,
Sandra E. Guerra-Palomares
3 and
Christian A. Garcia-Sepulveda
3
1
Engineering Faculty, UASLP, San Luis Potosí 78290, Mexico
2
School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK
3
Viral and Human Genomics Laboratory, Faculty of Medicine, UASLP, San Luis Potosí 78210, Mexico
*
Author to whom correspondence should be addressed.
Cells 2023, 12(5), 772; https://doi.org/10.3390/cells12050772
Submission received: 28 October 2022 / Revised: 8 February 2023 / Accepted: 21 February 2023 / Published: 28 February 2023

Abstract

:
Human Immunodeficiency virus (HIV) and its clinical entity, the Acquired Immunodeficiency Syndrome (AIDS) continue to represent an important health burden worldwide. Although great advances have been made towards determining the way viral genetic diversity affects clinical outcome, genetic association studies have been hindered by the complexity of their interactions with the human host. This study provides an innovative approach for the identification and analysis of epidemiological associations between HIV Viral Infectivity Factor (Vif) protein mutations and four clinical endpoints (Viral load and CD4 T cell numbers at time of both clinical debut and on historical follow-up of patients. Furthermore, this study highlights an alternative approach to the analysis of imbalanced datasets, where patients without specific mutations outnumber those with mutations. Imbalanced datasets are still a challenge hindering the development of classification algorithms through machine learning. This research deals with Decision Trees, Naïve Bayes (NB), Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs). This paper proposes a new methodology considering an undersampling approach to deal with imbalanced datasets and introduces two novel and differing approaches (MAREV-1 and MAREV-2). As theses approaches do not involve human pre-determined and hypothesis-driven combinations of motifs having functional or clinical relevance, they provide a unique opportunity to discover novel complex motif combinations of interest. Moreover, the motif combinations found can be analyzed through traditional statistical approaches avoiding statistical corrections for multiple tests.

1. Introduction

Human immunodeficiency virus (HIV) and its clinical entity, the Acquired Immunodeficiency Syndrome (AIDS) continue to represent an important health burden worldwide. Since the first reports of HIV more than 35 years ago, 78 million people have been infected with HIV and 35 million have died from AIDS-related illnesses. In 2021, approximately 1.5 million people contracted HIV and 650,000 people died from HIV-related diseases (UNAIDS, https://www.unaids.org/en, accessed on 28 October 2022. Although the overall number of new infections has declined since 2010, the resource limited countries of Latin America, Asia, and Africa have shown a steady increase in new infections and excess deaths due to HIV [1]. Different strategies have been employed in the fight against HIV and AIDS, mostly focused on either preventative measures or the development of novel anti-retroviral drugs targeting the main viral enzymes involved in HIV replication [2]. On the other hand, current HIV research efforts continue to focus on increasing our understanding of viral-host interactions at the molecular level, with the aim to discover those worth exploiting to interfere with viral tropism, fusion, replication, integration, and transmission.
Our understanding of the function of some viral proteins such as the protease, reverse transcriptase, and integrase enzymes has allowed for the development of potent preventative and therapeutic strategies [3]. However, for some accessory and non-structural viral proteins, little is known with regards to the function and their potential as candidate targets for antiviral drug development. While the use of molecular biology techniques allows for an estimation of functional or clinical relevance of these proteins, complex genetic and clinical variable comparisons decrease the statistical power of such studies.
The HIV genome has 9719 base pairs (HXB2 reference strain) and a total of 3 open reading frames encoded in a prototypical lentivirinae genome organization comprised of gag, pol, and env genes, long terminal repeat regions (LTRs) and accessory-protein-encoding regions (Vif, vpr, tat, rev, vpu, and nef). The gag gene encodes for the matrix, capsid, nucleocapsid, and p6 proteins, pol encodes for the enzymes protease, reverse-transcriptase, and integrase and env encodes for the glycoproteins GP41 and GP120. The different aforementioned accessory proteins facilitate or promote HIV replication and viral fitness. The best studied accessory proteins include tat (which acts as viral transcriptional transactivator), rev (which regulates RNA trafficking), and nVifef which promotes viral maturation and release from the host cell [4,5]. Vif is a 192-amino acid HIV accessory protein essential for replication. Vif protein counteracts human antiviral proteins of the APOlipoprotein Bmessenger RNA Editing enzyme, Catalytic polypeptide-like (APOBEC3) family. APOBEC3 proteins are zinc-dependent deaminases which mutate viral cytidine (dC) to uridine (dU) in both viral DNA and RNA molecules, thus interfering with the fidelity of the viral genome. APOBEC3 is a host innate mechanism that protects human cells from exogenous viruses and endogenous mobile retroelements. The Vif protein allows HIV to evade such innate mechanisms. This viral protein has recently become a candidate target for both therapeutic and preventive interventions in HIV/AIDS. Nevertheless, little is known about the clinical relevance of Vif accessory protein, particularly among HIV-infected patients of developing countries and Latin America [6].
Members of the human APOBEC family of proteins include APOBEC1, APOBEC2, APOBEC3, and the poorly expressed APOBEC4. The APOBEC3 subfamily has seven known members, including APOBEC3A, APOBEC3B, APOBEC3C, APOBEC3DE, APOBEC3F, APOBEC3G, and APOBEC3H. Among all APOBEC3 subfamily members, APOBEC3G is notable for exerting the strongest antiviral effect [7]. APOBEC3G is incorporated into the HIV-1 virions as they emerge from an infected cell when HIV-1 lacks the capacity to encode for Vif protein. During the second round of viral replication, after infecting a second cell, APOBEC3G would normally cause extensive dC to dU mutations of the single-stranded viral DNA during reverse transcription [8]. HIV’s Vif protein inhibits and interferes with APOBEC3G activity and thus renders the virus immune to this important innate immunity. However, HIV-1 evolution and quasi-species diversification within a single human being might lead to the accumulation of mutations in the Vif region, which might affect protein function and have clinical significance by either decreasing viral replication or affecting integration and transmission.
The use of machine learning approaches has been extensively applied to the search of statistical associations between genetic and clinical variables during the last years given their known capacity at tackling high dimensional data [9,10]. Previously, some research groups have applied combined algorithm based approaches, such as ANN coupled to genetic algorithms, grammatical evolution, and genetic programming, to the discovery of genetic associations and classification [11,12,13,14,15]. Other combined-algorithm approaches have been SVMs with genetic algorithms [16] and ANNs coupled to Rule Association Mining (Apriori algorithm) [17]. Although combining different machine learning approaches does not guarantee better performance, there is ample evidence supporting the statistical benefits and capabilities at discovering novel genetic associations in the context of infectious diseases [18,19].
One important factor in assessing the importance of different genetic variables mentioned in previously published studies is their combined effect on classification performance. We previously applied this approach to the study of HIV’s Vif gene mutations by using four different machine learning approaches for the discovery of clinical endpoint associations [20]. A mayor caveat to our previous effort was the availability of an imbalanced dataset arising from the difficulty in collecting large cohort samples and extensive genetic data. Data imbalance is a fundamental and challenging problem in machine learning that limits the power of small clinical datasets. This limitation has also been shown to be present in other non-medical applications such as fraud detection, finance, ecology, and biology [21,22]. As such, in this study we set forth to evaluating the performance of state-of-the-art machine learning approaches (Decision Trees, NB, SVMs, and ANNs) enhanced with an undersampling process for dealing with the data imbalance in the dataset. Furthermore, we present a probabilistic method capable of suggesting the most clinically relevant variable combinations associated to clinical outcomes.
The paper is organized as follows: Section 2 describes the dataset and the undersampling approach. The methods are presented in Section 3, followed by the results and conclusions sections.

2. Dataset

For the purpose of this study we relied on a previously consolidated dataset including Vif protein amino acid physicochemical changes and clinical outcome variables (CD4 T cell numbers and HIV viral load at both initial diagnosis and on follow-up) [23]. From the original 192 amino-acid sites conforming the Vif protein, those pertaining to 17 protein motifs were encoded into binary data as either conserved or mutated, as described previously [20]. Eight of the 17 variables representing Vif protein domains are known to interact with APOBEC3 proteins (herein designated as APOBEC-1 to APOBEC-8). Other motifs considered in this study include the Nuclear Localisation Inhibitory Signal (NLIS), two (CBF β -1 and -2) interaction sites as well as three Cullin-5 binding sites (Cul5-1, Cul5-2, and Cul5-3). When the different Vif motif sequences implied a non-conservative change in physicochemical properties, the genetic variable for that motif was encoded as a “1”, and when the site was conserved it was encoded as “0”.
The values for the clinical endpoints (outcome class) were encoded based on thresholds recommended by the World Health Organization and the U.S. Centers for Disease Control and Prevention. The CD4Ini and CD4Hist clinical endpoints reflect the levels of CD4+ T cells number (cells/per micro liter) at the first time of diagnosis (CD4Ini) and as the median number of CD4+ cells from quarterly assessments during two years of patient follow-up (CD4Hist). For both CD4Ini and CD4Hist, ≥500 CD4+ T cells/μL corresponds to a value of “0”, as CD4+ T cell numbers above this threshold are not indicative of poor clinical prognosis. Contrarily, the clinical endpoint is encoded as “1”, when ≤500 CD4+ T cells/μL when the cell numbers are below normal and reflecting immunodeficiency. Similarly, VLIni and VLHist outputs reflect another clinical aspect used to assess HIV-prognosis, where high viral loads are associated with worsening clinical progression. As mentioned above, VLIni and VLHist reflect HIV viral titres at the time of initial diagnosis and the median of quarterly follow-up assessments of viral load (copies/milliliter). For both VLIni and VLHits ≥ 10,000 copies/mL/μL corresponds to a value of “1”, as viral loads above 10,000 cp/mL are suggestive of intense viral replication and worsening clinical prognosis. Contrarily, this value is encoded as “0”, when ≤500 copies/mL/μL when the viral load is below 10,000 cp/mL and stable [24].

Undersampling

In the case of binary classification, the class-imbalance is defined as the over representation of one class (the majority class) over another class (the minority class). Over representation affects the learning process of the algorithms as most of them are designed to construct the most general and simplest hypothesis from the data [25]. Undersampling can lead to a bias towards the over-represented class during the learning process.
Different approaches have been used to resolve the problem of undersampling, which range from applying data balancing strategies (either undersampling or oversampling), modifying the machine learning process to address data imbalance or through data penalization to enhance minority class attribute detection [26]. Undersampling balancing strategies are the most popular approach as they are based on the original dataset, whereas oversampling requires the generation of artificial data, derived from the original dataset but not necessarily true in content [27].
As the use of oversampling involves the generation of artificial data, in this work we decided to use an undersampling approach to better preserve the biological distribution of genetic variables and clinical endpoints of our dataset.
Figure 1 describes the undersampling process. The original dataset contains m + n examples where n is the minority class and m is the majority class. The algorithm identifies the least represented class (i.e., n) and then creates a new balanced dataset by subtracting m class elements until it is similar in size to n class subset. These undersampled balanced sets are generated 100 times ( 1 , 2 , , p ), and each one is used for machine learning and training.

3. Methods

This paper compared the classification performance of the well-known machine learning methods: Decision Trees, NB, SVMs, and Multi-Layer Perceptron (MLP).

3.1. Decision Trees

Decision trees represent the simplest and most widely used non-parametric supervised learning method. There are many algorithmic implementations to generate decision trees from data including Iterative Dichotomiser 3 (ID3) [28], its successor—C4.5, Classification And Regression Tree (CART), Chi-square Automatic Interaction Detection (CHAID), and Multivariate Adaptive Regression Splines (MARS). This paper focus only on the CART implementation [29] available in Scikit-learn [30,31].
For CART, the use of the Gini index and a max depth of five were used as predefined parameters, as they provided a similar performance to the C4.5 algorithm. Contrary to C4.5, CART helped identify the most significant variables and to eliminate non-significant ones [32].

3.2. Multinomial Naïve Bayes

NB classifiers include several highly-scalable and simple probabilistic classifiers that rely on Bayes theorem with strict independence assumptions between features. When coupled with kernel density estimation they can achieve elevated classification accuracy levels [26].
The NB classifier is defined as:
c l a s s n b = arg max c l a s s j C p ( c l a s s j ) i p ( v i | c l a s s j ) ,
where p ( v 1 , v 2 , , a i , , v 17 | c l a s s j ) = i p ( v i | c l a s s j ) , because this classifier assumes that the variables, v i are conditionally independent, given the class, and c l a s s j C are the classes or labels [33]. NB usage relied on calculations of the prior probabilities and estimation on the prior probabilities.

3.3. Multi Layer Perceptron (MLP)

MLP is based on classical ANN models, in particular the Perceptron introduced by F. Rosenblatt in 1957 [34]. MLP architecture is a more complex ANN where at least one or more hidden layers are included before the clinical endpoint variable layer [35]. MLP is also known as backpropagation [36,37,38,39], a generalization of the delta rule learning algorithm proposed by B. Widrow in 1962 [40]. MLPs are also referred to as feedforward neural networks. Figure 2 illustrates a general MLP architecture with v 1 , v 2 , , v 17 input variables (green), a hidden layer (blue) and a single clinical endpoint (red). There is a single MLP for each of the clinical endpoint variable classes: CD4Ini, CD4Hist, VLIni, and VLHist.
For MLP training, we use the logistic activation function, a hidden layer with 8 neurons, 2 outputs, and 10,000 epochs with the Limited-memory BFGS algorithm (the Broyden–Fletcher–Goldfarb–Shanno algorithm), which is a method for numerical optimization [41].

3.4. Support Vector Machine (SVM)

SVMs are state-of-the-art algorithms initially introduced by Cortes and Vapnik as support-vector networks [42,43]. SVM were developed in an effort to develop artificial intelligence strategies for complex problems. SVM have mostly been applied to classification or regression problems. For classification purposes, SVMs aim to produce a mathematical n-dimensional space function capable of non-linearly distinguishing between different classes from complex and multivariate (training and test) datasets
Given a dataset
D = { ( x 1 , y 1 ) , , ( x l , y l ) } ,
where x R 17 (inputs), y { 1 , + 1 } (clinical endpoint), and l is the size of the dataset.
The SVM classifier is defined as
f ( x ) = s g n i S V s α i K ( x i , x )
which is a linear combination of kernels, K ( x i , x ) , where the sign function ( s g n ) gives the class [42]:
s g n : R { 1 , 0 , 1 } x y = s g n ( x ) .
with constrains, 0 α i C , i = 1 , , l , and j = 1 l α j y j = 0 . The parameter C is known as the margin and the Support Vectors (SV) will have non-zero Lagrange multipliers, α i ; K ( x i , x j ) is the kernel function performing the non-linear mapping into feature space ϕ , known as the “kernel trick” [26,42,43].
There are many kernel functions available for use with SVMs including linear, Gaussian Radial Basis Function (RBF), sigmoid, and polynomial. Our approach made use of the RBF kernel, where the width of a kernel is given by the γ parameter.
Across this research, SVMs used RBF as kernel with the following values: C = 10 and γ = 1.0 .

3.5. Methods for Assessing the Relevance of Each Vif Variable

In order to assess the relevance that the different Vif variables (input) have on each of the included clinical endpoint variables (output), a series of steps were used, including:
  • Generating p balanced datasets through undersampling (see Section 2);
  • Constructing input variable combinations of less than 10 in size (k);
  • Identifying the variable combinations of each balanced datasets providing the best classification performance;
  • Calculating the relevance of each variable through a probabilistic approach, and;
  • Optimizing the selection of the most relevant variables by using a threshold value.
For the first step, balanced datasets are generated through undersampling by creating p partitions, which include all elements of the minority class (n) and an equal number of randomly selected elements of the majority class (i.e., n examples out of m), as shown in Figure 1. After producing balanced datasets, a second step addresses the construction of k size variable combinations by using each of them as input in different classification algorithms. For this, a five-fold cross-validation training process using weighted accuracy was used. The construction of the variable combinations relied on using greedy step-wise variable selection, as shown in Figure 3, in such a way as to identify the best variable capable of discriminating between the clinical endpoint classes. This process was repeated for a second variable in combination with the first identified and the process was repeated k-times so as to identify the k best variable combinations available.
A third step involved discovering the best k combinations for each p balanced dataset. As the discovery of a global optimum is not guaranteed, a reasonably good local optimum (based on classification performance) was used, as shown in Figure 4. Global optimums are not realistically feasible as the search space exponentially explodes with k.
In a fourth step, variable relevance assessment is achieved using the p best combinations through a probabilistic approach. For this, the probability of each input Vif variable appearing at j th position on the variable combination matrix produced in the previous step is calculated using Equation (3).
p ( v i j ) = f ( v i j ) a f ( v j a )
where p ( v i j ) indicates the probability that the i th variable was selected at the j th position of the generated combinations. The frequencies for the variable and that of the different variables at the position j th are expressed as f ( v i j ) . This equation is applied for each one of the k positions ( j k ). These probabilities define the relevance score (r) for each variable by using Equation (4):
r i = j = 1 k ( k + 1 j ) × p ( v i j )
where r i indicates the relevance score for the variable i th , considering its probability of appearing on each of the k positions in the combination matrix. This process assigns greater weight to the variables that are found closest to the root (lower entropy) of the combination matrix and less weight to those that appear farther from the root (higher entropy).
In a fifth step, the relevance scores obtained in the previous step are then used for sorting the variables considering their relevance scores and by establishing a threshold value (which involves calculating the upper limit of a 99% confidence interval of their relevance scores) to determine the most relevant variables (those surpassing the threshold limit).

3.5.1. MAREV-1

The first Method for Assessing the Relevance of Each Variable (hereafter called MAREV-1) considers the classification results produced by each algorithm (CART, Multinomial NB, SVMs, and MLP) on p = 100 balanced datasets. This yielded a total of 400 variable combinations having the highest classification performances, all of which were then tested further, including traditional statistical analysis, as mentioned below, see Section 3.5.3.

3.5.2. MAREV-2

The second method, MAREV-2, selects only the best variable combinations assessed as classification performance for each algorithm (the third step described above), see Section 3.5. This yielded four input variable combinations, one per algorithm. Again, as mentioned above for the score assessment on each variable, all were then tested through the following traditional statistical analysis.

3.5.3. Hypothesis Evaluation on the MAREV-1 and MAREV-2 Approaches

Once the most relevant variables had been identified in the previous steps, subsequent analysis involved establishing the clinical importance of the different machine learning algorithm-suggested variable combinations and their status (Mut or Cons) through traditional statistical association methods. For this, the Vif protein conserved sites, synonymous amino acid substitutions, or those being non-synonymous but conserved in physicochemical properties were encoded as “0” (Cons in the following discussion, figures, and tables). Contrarily, mutations leading to non-synonymous amino acid substitutions resulting in non-conserved physicochemical properties of the Vif protein (polar to non-polar changes, acidic to basic changes, gross molecular structure size changes, as well as changes in susceptibility to post-translational modifications such as phosphorilation, ubiquitination, SUMOylation, methylation, and glycosylation) were encoded as “1” (Mut). The definition of explicit variable-value combinations used the ID3 algorithm as implemented in the Waikato Environment for Knowledge Analysis (WEKA) workbench v3.6 [44]. ID3 was used for generating a decision tree for each clinical endpoint relying on tree branches to incorporate variable status (Mut or Cons) combinations. The calculation of the statistical significance of variable frequency differences between clinical endpoint groups relied on two-sided Fisher’s exact test using IBM SPSS Statistics (version 21, IBM Corporation, Armonk, NY, USA).

4. Results

The position of the Vif encoding region within the HIV-1 reference sequence HXB2, and the position and nomenclature of the Vif protein motifs and their putative ligands, is provided in Figure 1. The APOBEC-1 variable, corresponding to the N-terminal APOBEC3 binding site ( 14 DRMR 17 ), was excluded from the original dataset as it remained conserved.

4.1. Classification on the Balanced Datasets

The assessment of the relevance of each variable, as explained in Section 3.5, was based on the classification performance from four different classifiers (CART, MLP, SVMs, and Multinomial-NB) as implemented in the Scikit-learn package [30].
We have identified the top 100 variable-combinations associated to each clinical endpoint class by applying the proposed method to assess variable relevance. We obtained 1600 top-performing genetic variable-combinations associated to each clinical endpoint (CD4Ini, CD4Hist, VLIni, and VLHist) using the four classification algorithms. The balanced-accuracy was calculated with a 5-Cross-Validation approach during each training process. Algorithm accuracy was defined as the correct identification of both true positive and true negative registry examples (patients) and encompasses true-positive and true-negative predictive rates.
Out of the four machine learning algorithms tested, MLP superseded the three other machine learning algorithms during the analysis of each of the four clinical endpoints, accurately classifying, 79.6%, 76%, 68.5%, and 66.3% of CD4Ini, CD4HIts, VLIni, and VLHist patient registries, respectively. The classification performance of each machine learning algorithm for each clinical endpoint is summarized in Table 1.
Although the best classification results achieved higher values than those previously reported elsewhere [20], this can easily be explained by the use of balanced datasets and 5-Cross-Validation settings in this report. The genetic variable combinations providing the best classification performance are summarized in Table 2.
Considering the top scores per clinical endpoint shown in Table 2, the best discrimination was achieved for the CD4 T cells counts (CD4Ini and CD4Hist clinical endpoints).
On the other hand, low performance was observed on the VLIni clinical endpoint [71.5–80.2], and even lower for the VLHist [68.5–73.8].
Some variables were shown to be present in all “top combinations” identified for each different clinical endpoints. These were: [BCbox-3, BCbox-2, and APOBEC-2] for CD4Ini, [APOBEC-2, APOBEC-4, and BCbox-3] for CD4Hist, [APOBEC-2 and APOBEC-4] for VLIni, and [NLIS, APOBEC-2, and BCbox-1] for VLHist. Only the variable APOBEC-2 was present in 15 of the 16 best-combinations, except for in the combination with the highest classification when using MLP with the CD4Ini clinical endpoint. On the other hand, BCbox-3 was present in all the best combinations related to the CD4 T cell count.

4.2. Results Using the MAREV-1

After defining the 100 best-combinations per clinical endpoint by each algorithm, an assessment on the relevance of each variable was then undertaken. This involved calculating the probabilities for each variable of being selected as the most informative (i.e., root variable) in each of the best combinations. The relevance scores (r) per algorithm and positions are shown in Appendix A, see Table A1, Table A2, Table A3 and Table A4. After evaluating all the variables for each clinical endpoint, a threshold was calculated per clinical endpoint and used for selecting the most relevant variables as mentioned previously; see Section 3.5. The calculated threshold values for the most relevant variables are summarized in Appendix A, see Table A6a. The variables indicated as most relevant for CD4Ini (ordered by their relevance scores) were: [BCbox-3, APOBEC-3, APOBEC-5, APOBEC-2]; for CD4Hist: [APOBEC-2, APOBEC-3, APOBEC-5]; for VLIni they were [APOBEC-2, BCbox-1, APOBEC-3] and, finally; for VLHist they were [NLIS, APOBEC-3, APOBEC-5]. Considering these most relevant variables, APOBEC-3 proved to be associated with all the clinical endpoints, while APOBEC-2 and APOBEC-5 were present in only three clinical endpoints. BCbox-1 was seen to be the most relevant for only VLIni. BCbox-3 was only relevant for CD4Ini, and NLIS was suggested as being the most relevant in only VLHist.
The most relevant variables identified were in agreement with the best variables identified in previous efforts using alternative approaches [20], as shown in Table A7b; see Appendix A. This was also the case for the second variables in the clinical endpoints CD4Hist and VLIni. Another difference was that the quantity of variables defined as the most relevant when using the MAREV-1 approach was much higher for the clinical endpoints CD4Ini and CD4Hist than reported previously.

4.3. Results Using the MAREV-2

In this approach, the variable assessment process was done considering only the combinations of variables having the best classification performance, see Table 2. As happens with MAREV-1, MAREV-2 also calculated the probability for each variable to appear at every available position. This was later used to determine the score per variable and clinical endpoint as shown in Table A6b); see Appendix A. The variables discovered to be more relevant for CD4Ini (ordered by their scores) were: [BCbox-3, BCbox-2]; [APOBEC-2, APOBEC-4, BCbox-3] for CD4Hist; [APOBEC-2, APOBEC-4, BCbox-1] for VLIni; and [APOBEC-2, NLIS, BCbox-1] for VLHist. None of the variables were shown to be present in all clinical endpoints unlike MAREV-1. However, APOBEC-2 was present in CD4Hist, VLIni and VLHist. On the other hand, APOBEC-2 and APOBEC-4 are related to CD4Hist and VLIni; BCbox-1 is relevant for VLIni and VLHist. Finally, BCbox-3 is relevant for CD4Ini and CD4Hist. BCbox-2 is only relevant for CD4Ini, while NLIS is relevant for VLHist. These variables are compared with the previous findings and those suggested by the 100-model analysis (see Table A7c in Appendix A).
The comparison among the variables identified as the most relevant by the previous approach, MAREV-1 and MAREV-1, show a coincidence in some of the variables detected as most relevant. This is the case of BCBox-3 in CD4Ini and APOBEC-2 in both CD4Hist and VLIni. Although MAREV-1 and the previous approach agreed on assigning NLIS as the most relevant variable for VLHist, this motif was only suggested as the second most relevant for this clinical endpoint by MAREV-2.

4.4. Decision Trees and the Most Relevant Variable Combinations from MAREV-1 and MAREV-2

The decision trees defined with the variables determined by the MAREV-1 are shown in Figure 5, while those using the MAREV-2 are shown in Figure 6.
ID3 branch frequency was used to identify specific combinations of input variable status ( M u t or C o n s ) as related to the clinical endpoints in Fisher’s exact test. Only branches having more than 1 variable were considered, yielding a total of 20 variable combinations for the MAREV-1 approach (6 for CD4Ini, 5 for CD4Hist, 6 for VLIni, and 3 for VLHist) whereas the MAREV-2 approach identified 22 different relevant variable combinations (4 for CD4Ini, 6 for CD4Hist, 6 for VLIni, and 6 for VLHist. The results of the statistical assessment for the MAREV-1 and MAREV-2 approaches are shown in Table 3.
Four of the 20 ID3-combinations defined from the MAREV-1 approach were detected as associated with clinical endpoints after further statistical testing. One was present for CD4Ini (p-value = 0.0011 ), two for CD4Hist (p-value = 0.0136 , p-value = 0.0182 ), and one for VLIni (p-value = 0.0207 ). None of the associated combinations were present in VLHist. The combination for CD4Ini [BCboc-3 M u t , APOBEC-3 C o n s ] suggests protection from having lower numbers of CD4 T lymphocytes at the time of initial medical assessment as it was present in only 6 patient samples having ≤500 CD4 T cells, compared to 53 patient samples not having said combination. In the case of CD4Hist, only one combination [APOBEC-2 C o n s , APOBEC-3 C o n s ] suggested protection from having less than 500 T Lymphocytes on medical follow-up, as was also found in our previously published work. A second combination [APOBEC-2 M u t , APOBEC-3 C o n s , APOBEC-5 C o n s ] was found to be associated with the risk of progression to less than 500 CD4 T lymphocytes on medical follow-up. The absence of said combination was detected in 14 out of 15 sequences with ≥500 CD4 T cells. Finally, in the case of VLIni, the [APOBEC-2 M u t , BCbox-1 C o n s , APOBEC-3 C o n s ] combination suggested a risk of having higher HIV viral loads on the first medical examination as it was absent in 22 out of the 26 cases with less than 10,000 virus copies.
On the other hand, the 22 ID3-combinations generated using the variables defined by the MAREV-2 yielded 5 clinical associations. Both of the associations found in CD4Ini involved variables BCBox-2 and BCBox-3 where the conservation of both protein regions was associated with a higher risk of having lower initial CD4 T lymphocytes on the first medical examination (p-value = 0.0068 ). This variable combination was present in 26 of the patient cases with < 500 CD4 cells/ μ L, compared with a single occurrence in a patient having 500 . A second variable combination, [BCBox-2 M u t and BCBox-3 M u t ], was associated with protection from low CD4 T lymphocytes counts as it was observed to be more frequent in patients having 500 CD4 cell count/ μ L (p-value = 0.0049 ). Regarding historic CD4 T cell counts, one variable combination [APOBEC-2 M u t , BCbox-3 C o n s ] was associated with the risk of having low CD4 T cell counts on medical follow-up as it was present in 20 cases with a CD4 cell count below 500 and not in patients having 500 CD4 T cells/ μ L. Regarding initial viral load assessments, [APOBEC-2 M u t , APOBEC-4 M u t , BCbox-1 C o n s ] was associated with the risk of having high viral titres (≥10,000 viral copies) at the time of initial medical examination and was present in 11 patients having ≥10,000 viral copies, yet in only a single patient having lower viral loads. Finally, [NLIS M u t , APOBEC-2 M u t , BCbox-1 C o n s ] was observed to be associated with a higher risk of low historical viral loads on patient follow-up as it was seen only once in a patient having <10,000 copies but it was present in 6 patients having more than 10,000 copies of the virus. As mentioned before, eight novel HIV associations were identified through this approach: three by MAREV-1, and five with MAREV-2.
Distinct Vif protein regions were identified through this approach as being highly relevant by MAREV-1, mainly involved in APOBEC3 interactions and Elongin B/C binding. Relevant APOBEC3 interaction motifs included APOBEC-3, which was found to be conserved in all cases as well as APOBEC-2, which only failed to be relevant with regard to CD4Ini. Similarly, APOBEC-5 was found to be absent in CD4Hist while BCbox-1 was related to VLIni. Similarly, MAREV-2 also identified APOBEC-3, APOBEC-2, and APOBEC-4, and the Elongin B/C-box binding motifs, BCbox-1, BCbox-2, and BCbox-3 as most relevant. The results from the MAREV-2 for VLHist agree with our previously published findings by suggesting a higher relevance of the NLIS segment.
These results help supporting the variables detected as more informative in our previous findings [20], being: (i) [BCbox-3] for CD4Ini, (ii) [APOBEC-2] for CD4Hist, VLIni and VLHist, (iii) [BCbox-1] for VLIni and VLHist, and iv) [NLIS] for VLHist. Additionally, the MAREV-1 approach places relevance for the variables [APOBEC-3 and APOBEC-5] while MAREV-2 places relevance for [APOBEC-4, BCbox-2, and BCbox-3]. On the other hand, the four associations determined with MAREV-1 and the five determined by MAREV-2 were less than the seven suggested with the previously methodology. Only one of said associations was present when using both approaches. Fewer associations were found when considering the viral load clinical status, both the initial and historical. This was the case for VLHist, where no association was found when using the MAREV-1 approach. However, determining which set of associations have more biological significance requires further research.
Table 3 concentrates the most relevant associations of genetic variable combinations with each of the four clinical endpoint variables out of the 20 and 22 hypotheses tested by the MAREV-1 and MAREV-2 algorithms, respectively. On initial examination, the reiterative appearance of APOBEC and Elongin B/C Box motifs stands out in the results generated by both algorithms, irrespective of site status (mutated or conserved). This is a reflection of the importance of Vif protein, a function which involves both binding of Elongin B/C and recognition of APOBEC molecules to provide HIV with the capacity to escape from APOBEC-mediated innate immunity. From within the eight different APOBEC binding sites included in the analysis, APOBEC-2 and APOBEC-3 stand out for the number of times they appear in the associations shown in Table A5. Interestingly, the APOBEC-2 and -3 sites bind APOBEC3G and APOBEC3F, the two most relevant members of the APOBEC3 family of antiviral proteins. Nevertheless, our results are indicative that the APOBEC3G and APOBEC3F protein binding site (APOBEC-2) is perhaps the least important of all the genetic Vif variables assessed. This is based on the fact that both MAREV-1 and MAREV-2 results show higher viral titres and lower CD4 T cell numbers (suggesting ongoing viral robustness) even in the presence of APOBEC-2 mutations, as long as the other APOBEC-binding regions or Elongin B/C binding regions remain conserved. This was observed in historic CD4 T cell numbers, the initial viral loads, and regarding the historic viral loads.
Similarly, the recursive appearance of Elongin B/C box-1 and box-3 binding sites also highlights the relevance that the Elongin interactions have for the Vif protein mediated ubiquitination of APOBEC3 anti-viral proteins. Overall, our results emphasize the clinical relevance of both APOBEC3G and Elongin B/C binding sites from among the remaining Vif protein domains assessed. Figure 7 illustrates the position of the Vif encoding region within a reference (HXB2) HIV-1 genome, the Vif protein domains and regions, as well as some of the putative or known ligands. Even greater detail is provided by our results regarding the weight of each of these genetic variables when individual clinical outcomes are considered. At least one previous report has identified that amino acid substitutions in Elongin B/C sites lead to a loss-of-infectivity in HIV [45].
The results of both MAREV-1 and -2 suggest that initial CD4 T cell numbers seem to depend more on Elongin B/C site status than any other Vif protein attribute. When Elongin B/C box mutations are present, such as in [BCbox-3 M u t , APOBEC-3 C o n s ] (MAREV-1) and [BCbox-3 M u t , BCbox-2 M u t ] (MAREV-2), a greater number of patients are seen to be present in the ≥500 cells/ μ L class than in the ≤500 cells/ μ L class. This supports the notion that Elongin B/C binding box mutations are detrimental to viral fitness and thus prevent HIV from escaping APOBEC3 inhibition or interference.
An additional interesting finding relates to historic CD4 T cell numbers and viral loads. HIV patients are normally enrolled into anti-retroviral therapy protocols after being diagnosed, irrespective of CD4 T cell counts and viral load numbers. The clinical impact that viral mutations have at this stage, after initiating treatment, has largely been linked to protease, reverse-transcriptase, and integrase sites, those most subjected to selective pressures by anti-retroviral drugs. Our results indicate that the conservation of APOBEC binding motifs are essential to viral fitness (and worsening clinical progression), at least in the MAREV-1 results. As such, [APOBEC-2 M u t , APOBEC-3 C o n s , APOBEC-5 C o n s ] and [APOBEC-2 C o n s , APOBEC-3 C o n s ] were more common among patients having lower CD4 T Cell numbers on follow-up. This was also true for BCbox-3 in MAREV-2 results, where [APOBEC-2 M u t , BCbox-3 C o n s ] was also more common among patients having ≤500 cells/ μ L. Previous reports have highlighted how the conservation of APOBEC binding sites is crucial for vif-mediated viral fitness. Our results suggest that the mutation of certain APOBEC3 binding site motifs (i.e., APOBEC-2) is tolerated without a significant effect on viral fitness as long as other, perhaps more important, remaining motifs are conserved (i.e., APOBEC-3 and or -5) [46].

5. Conclusions

This paper proposes a new methodology based on machine learning algorithms (CART, NB, SVMs, and MLP) combined with an undersampling approach to deal with an imbalanced HIV dataset. Additionally, we present evidence of the classification performance of two different approaches (MAREV-1 and MAREV-2) for the identification of associations of Vif protein motifs with clinical endpoints in HIV. These variables subsequently proved to play a crucial role when different combinations of them were linked to HIV outcome, a difficult task that is not possible to achieve in human terms without relying on statistical corrections that decrease the statistical power of the study. These findings are in agreement with the known properties and with the functional and clinical relevance of the different Vif protein motifs found to be relevant. Needless to say, further research employing cell biology and molecular epidemiology tools is warranted so as to provide further support for these claims. Efforts are currently underway in our group to test the clinical utility of the identified variable combinations in a novel, larger HIV cohort.
When comparing the different strategies described in this manuscript, MAREV-2 was able to identify many more clinical associations, at least one per clinical outcome. This might be interpreted to suggest that this approach might prove more useful in future analysis and in clinical settings.
Many techniques are currently available to deal with imbalanced datasets. Although we studied the capacity of an undersampling approach to resolve this limitation, future work will explore the performance of oversampling techniques. These results provide further evidence on the usefulness and potential that machine learning methods have at analyzing complex datasets. Given the exponential growth of applications of artificial intelligence and classification strategies, this field is likely to benefit from the results presented herein.
Elongin B/C binding site mutations might prove to be the single most important Vif genetic feature determining CD4 T cell numbers at the time of clinical debut and at a time when viral replication has not been subjected to the influence of anti-retroviral drugs (as patients are treatment-naïve at this time). This opens the possibility that molecular approaches targeting HIV-1 Elongin B/C binding motifs or those inhibiting the interactions of Elongin B/C and Vif might provide innovative preventative strategies in the fight against HIV.
Overall, our results provide insight into the utility that both MAREV-1 and -2 algorithms have at discriminating complex genetic variable combinations linked to clinical endpoints in HIV, the practical utility of screening for accessory protein encoding region mutations in HIV prognosis, as well as at guiding the development of novel therapeutic interventions in HIV.

Author Contributions

Conceptualization, J.S.A.-F., J.C.C.-T., P.T., and C.A.G.-S.; methodology, J.S.A.-F. and P.T.; software, L.Á.A.-H.; validation, J.S.A.-F., J.C.C.-T. and C.A.G.-S.; formal analysis, J.S.A.-F. and C.A.G.-S.; investigation, J.S.A.-F. and L.Á.A.-H.; resources, S.E.G.-P.; data curation, S.E.G.-P.; writing—original draft preparation, J.S.A.-F.; writing—review and editing, J.S.A.-F., J.C.C.-T. and C.A.G.-S.; visualization, J.S.A.-F. and L.Á.A.-H. All authors have read and agreed to the published version of the manuscript.

Funding

The first author wants to give thanks to the National Science and Technology Council (CONACYT) for the funding through the scholarship #436028 and for the support for a research stay. He also thanks the University of Birmingham in the UK for its generous support during the research visit. The authors also thank the wonderful people caring for HIV patients at Centro Ambulatorio para la Prevención y Atención del SIDA e Infecciones de Transmisión Sexual (CAPASITS, SLP) for their unconditional help and work.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this research was already published [23], and it is publicly available at http://www.genomica.uaslp.mx/Research/HIV.html, accessed on 28 October 2022.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HIVHuman Immunodeficiency Virus
VifViral Infectivity Factor
CD4Cluster of Differentiation 4
APOBEC3APOlipoprotein Bmessenger RNA Editing enzyme, Catalytic polypeptide-like
NLISNuclear Localisation Inhibitory Signal
SVMsSupport Vector Machines
ANNsArtificial Neural Networks
NBNaïve Bayes
MLPMulti-Layer Perceptron
RBFRadial Basis Function

Appendix A. Variable Assessment

Following the proposed methodology for assessing the relevance of each variable (see Section 3.5), the results from the fourth step are in Table A1, Table A2, Table A3 and Table A4. There is a table per clinical endpoint CD4Ini, CD4Hist, VLIni, and VLHist, respectively. Each table shows the results from each classification algorithm: CART, MLP, NB, and SVMs.
Table A1. Relevance scores (r) in a descending order per algorithm and variable considering the clinical endpoint CD4Ini using the MAREV-1 approach. The four variables with higher values are highlighted in bold.
Table A1. Relevance scores (r) in a descending order per algorithm and variable considering the clinical endpoint CD4Ini using the MAREV-1 approach. The four variables with higher values are highlighted in bold.
(a) CART
Rank Variable pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 Total
1BCbox-361.3550.54500.188000.1070.08708.282
2APOBEC-302.711.8181.7930.4690.19200006.982
3APOBEC-50021.111.8750.6730.1860.1070.08706.038
4APOBEC-202.6131.3640.7680.5620.2880.1860005.782
5BCbox-22.41.1610.2730.9390.0940.0960.09300.3480.15.504
6APOBEC-60000.9390.4691.3460.7440.3210.43504.254
7APOBEC-41.60.0970.4550.4270.6560.0960.09300.08703.511
8Cul5-3001.0910.4270.5620.3850.27900.0870.23.031
9CBFb-10000.0850.2810.6730.4650.9640.3480.12.917
10APOBEC-700.0970.0910.3410.3750.2880.837000.32.33
11APOBEC-8000.0910.1710.0940.2880.2790.5360.17401.633
12NLIS00.9680.182000.19200.107001.449
13CBFb-2000.09100.1880.0960.2790.21400.10.968
14Cul5-100000.0940.0960.3720.2140.0870.10.963
15Cul5-200000.0940.0960.0930.3210.0870.10.791
16BCbox-1000000.1920.0930.1070.17400.566
(b) MLP
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1BCbox-37.30.090.330.15100.070.07000.1258.136
2APOBEC-302.792.0620.5270.7140.2110.140.071006.516
3APOBEC-5001.7321.3550.6430.9150.2110.0710.0690.0625.059
4BCbox-21.21.440.4950.5270.3570.2110.21100.0690.0624.572
5APOBEC-202.070.8250.5270.3570.2820.140.14300.0624.406
6APOBEC-41.30.090.5770.9030.2860.2820.21100.0690.0623.78
7Cul5-2000.0820.3761.7140.3520.2810.2860.1380.0623.292
8APOBEC-70.11.350.4120.3010.1430.4230.3510.0710.0690.0623.283
9Cul5-300.090.660.6020.3570.2820.4210.3570.34503.114
10APOBEC-600.090.330.8280.3570.5630.2810.2860.1380.1883.06
11APOBEC-800.0900.2260.50.3520.4210.1430.20701.939
12CBFb-10000.0750.2140.4230.6320.35700.1251.826
13NLIS0.10.720.1650.1510.0710.2110.140.0710.0690.0621.761
14CBFb-200.090.1650.30100.1410.140.3570.2760.0621.533
15BCbox-100.090.1650.0750.2140.2110.2110.2860.13801.39
16Cul5-10000.0750.0710.070.140.50.4140.0621.334
(c) NB
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1APOBEC-21000000000010
2BCbox-306.480.080.2980.4680.3170.1860.077007.906
3APOBEC-300.271.920.9681.0910.9520.372000.15.673
4APOBEC-400.092.080.9680.5450.3970.1860.3080.14304.717
5BCbox-200.091.121.1170.1560.3170.2790.3080.14303.53
6APOBEC-6000.640.5210.7790.5560.1860.3080.28603.276
7CBFb-200.630.720.2980.3120.1590.1860.2310.1430.12.778
8APOBEC-5000.080.2230.7010.4760.6510.2310.2860.12.749
9APOBEC-701.080.080.4470.3120.2380.0930.3080.14302.7
10Cul5-3000.640.5210.1560.2380.7440.1540.14302.596
11NLIS00.270.080.5210.2340.3970.2790.1540.1430.22.278
12BCbox-1000.40.2230.6230.3970.0930.2310.1430.12.21
13CBFb-10000.2230.390.2380.2790.38500.21.715
14APOBEC-800.090.160.3720.1560.1590.2790.1540.14301.513
15Cul5-20000.2230.07800.1860.07700.20.764
16Cul5-10000.07400.15900.0770.28600.596
(d) SVMs
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1BCbox-35.50.8270.7740.4020.6760.32300.083008.585
2BCbox-21.52.021.290.8050.4230.0810.22600.28606.631
3APOBEC-42.60.2761.1180.8850.7610.3230.2260.16700.16.455
4APOBEC-20.11.7451.3760.5630.4230.72600.4170.190.15.64
5APOBEC-300.2761.0322.0110.6761.0480.3020.1670.09505.607
6APOBEC-702.6630.6880.4830.5070.2420.2260.0830.1905.083
7NLIS0.10.6431.0320.4830.1690.2420.5280.0830.1903.471
8APOBEC-500.0920.0860.2410.5920.6450.9060.5830.1903.335
9Cul5-300.0920.2580.4830.6760.5650.5280.0830.1902.875
10APOBEC-60000.1610.4230.2420.6790.5830.190.22.478
11CBFb-20.10.3670.2580.32200.2420.07500.1901.555
12APOBEC-80.100.0860.080.3380.1610.0750.250.0950.21.387
13BCbox-10000.080.1690.0810.0750.16700.10.672
14Cul5-200000.1690.0810.0750.1670.09500.587
15CBFb-100000000.1670.0950.30.562
16Cul5-10000000.0750000.075
Table A2. Relevance scores (r) in descending order per algorithm and variable considering the clinical endpoint CD4Hist using the MAREV-1 approach. The three variables with higher values are highlighted in bold.
Table A2. Relevance scores (r) in descending order per algorithm and variable considering the clinical endpoint CD4Hist using the MAREV-1 approach. The three variables with higher values are highlighted in bold.
(a) CART
Rank Variable pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 Total
1APOBEC-26.80.20.4940.4670.3380.09400008.393
2APOBEC-304.80.6910.4670.67600.1330006.767
3APOBEC-500.41.4812.240.8450.3770.40.214005.958
4BCbox-32.11.30.8890.6530.1690.189000.16705.467
5APOBEC-400.42.7650.280.7610.4720.1330.2140.33305.359
6Cul5-30.50.80.1980.6530.2540.660.40.5360.16704.167
7APOBEC-600.30.0991.121.1830.4720.40.32100.1114.006
8APOBEC-70.100.3950.0930.930.3770.5330.21400.2222.865
9CBFb-10000.0930.2541.0380.9330.2140.16702.699
10BCbox-20.50.60.3950.4670.1690.18900.10700.1112.538
11CBFb-200.10.2960.280.0850.0940.6670.3210.33302.177
12APOBEC-800.10.099000.6600.53600.1111.506
13Cul5-1000000.1890.400.50.1111.2
14NLIS000.0990.1870.3380.18900.10700.2221.142
15Cul5-2000.09900000.21400.1110.424
16BCbox-1000000000.33300.333
(b) MLP
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1APOBEC-26.70.360.49500.2070.07400.07500.1368.047
2APOBEC-304.050.8250.6920.2760.1470.1700.14806.308
3BCbox-32.10.540.2470.3850.3450.1470.2550.3750.2220.0454.662
4APOBEC-500.631.5671.3850.3450.5880.085000.0454.645
5APOBEC-600.811.1550.7690.8280.2940.2550.2250.0740.0454.455
6APOBEC-400.361.8970.2310.6210.2940.340.2250.14804.116
7Cul5-30.50.90.4950.84600.36800.52500.0453.679
8CBFb-1000.330.3850.5520.7350.5960.150.14802.895
9APOBEC-800.270.0820.8460.2760.3680.5110.150.0740.0452.622
10Cul5-200.540.0820.1540.6210.2940.25500.4440.2272.618
11APOBEC-70.100.1650.4620.690.5150.0850.150.2220.0452.434
12BCbox-20.50.270.1650.3850.2760.1470.2550.2250.07402.297
13NLIS00.090.1650.1540.3450.4410.340.07500.1361.747
14CBFb-20.10.180.1650.0770.1380.2210.4260.0750.1480.1361.665
15Cul5-1000.1650.0770.1380.2940.340.300.0911.405
16BCbox-10000.1540.3450.0740.0850.450.29601.404
(c) NB
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1APOBEC-21000000000010
2APOBEC-401.80.8510.61210.3170.6510.1430.0650.1115.551
3APOBEC-300.092.0431.40.9230.3970.1860.0710.06505.174
4APOBEC-6000.5960.8750.5380.7940.8370.1430.3230.1114.217
5APOBEC-801.890.4260.5250.2310.2380.2790.2860.12904.003
6BCbox-202.520.4260.0880.15400.0930.2860.0650.2223.852
7APOBEC-5000.2551.050.5380.9520.4650.2860.1290.1113.787
8BCbox-300.540.170.70.8460.8730.0930.2860.12903.637
9Cul5-301.350.5960.1750.2310.3170.0930.2140.06503.041
10APOBEC-7000.0850.5250.5380.3970.4650.2140.1290.2222.576
11CBFb-2000.6810.4380.3850.15900.1430.38702.192
12NLIS00.810.170.4380.2310.23800.0710.12902.087
13CBFb-1000.170.1750.23100.3720.3570.3230.1111.739
14Cul5-2001.36200.0770.07900.214001.732
15Cul5-1000.08500.0770.1590.3720.1430.06500.9
16BCbox-1000.085000.0790.0930.14300.1110.511
(d) SVMs
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1APOBEC-28.80.3290.10700000009.236
2BCbox-30.12.8540.961.2420.9410.6450.190.4007.332
3APOBEC-302.8540.8531.2420.4710.484000.250.1676.32
4APOBEC-5000.4271.0161.5290.8060.5710.60.250.1675.367
5APOBEC-400.6592.1330.5650.8240.3230.5710005.074
6APOBEC-700.6591.0670.4520.11800.7620.40.250.1673.873
7APOBEC-60.10.110.640.3390.7060.4840.190.40.503.469
8BCbox-20.40.4390.640.6770.2350.1610.190.200.1673.11
9Cul5-30.40.4390.640.2260.4710.323000.502.998
10APOBEC-80.10.220.320.3390.1180.6450.5710.2002.512
11NLIS0.10.220.1070.2260.1180.4840.5710.2002.025
12CBFb-200.110.1070.4520.3530.3230.190.2001.734
13CBFb-10000.1130.1180.3230.190.40.250.3331.727
14Cul5-20000.1130000000.113
15BCbox-100.11000000000.11
16Cul5-100000000000
Table A3. Relevance scores (r) in descending order per algorithm and variable considering the clinical endpoint VLIni using the MAREV-1 approach. The three variables with higher values are highlighted in bold.
Table A3. Relevance scores (r) in descending order per algorithm and variable considering the clinical endpoint VLIni using the MAREV-1 approach. The three variables with higher values are highlighted in bold.
(a) CART
Rank Variable pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 Total
1APOBEC-26.30.2760.5560000.4440007.576
2BCbox-105.2350.6670.750.1300000.257.032
3APOBEC-300.8271.7781.6250.7830.1670.1480.2005.527
4APOBEC-500.9180.8890.8751.4350.3330.5930005.043
5CBFb-210.5511.3330.750.130.1670.2960004.228
6APOBEC-400.6431.5560.50.5220.6670.2960004.183
7APOBEC-700.2760.6670.3750.6520.50.2960.200.253.216
8APOBEC-600.0920.2220.3750.7830.8330.2960.40.203.201
9BCbox-22.700.1110.125000.1480003.084
10CBFb-10000.1250.5221.1670.2960.20.20.252.76
11Cul5-100000.130.3330.29610.402.16
12NLIS00.09200.3750.2610.50.1480.6001.976
13Cul5-300.0920.1110.6250.26100.1480.20.201.637
14Cul5-20000.125000.2960.20.601.221
15APOBEC-8000.1110.1250.2610.1670.14800.401.212
16BCbox-30000.250.130.1670.148000.250.945
(b) MLP
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1APOBEC-26.30.2730.40.09200.3570.1180.130.1540.1257.949
2BCbox-105.1820.50.2760.6890.4760.11800.15407.394
3APOBEC-500.54510.2761.0820.9520.4710.2610.15404.741
4CBFb-210.8181.70.3680.0980.3570.1180004.46
5APOBEC-301.1820.81.1050.590.23800.130.30804.353
6CBFb-1000.11.2890.4920.2380.4710.6520.1540.1253.521
7BCbox-22.700.30.0920000003.092
8Cul5-1000.10.5530.590.5950.5880.652003.078
9APOBEC-400.2730.80.6450.39300.3530.130.15402.748
10APOBEC-700.1820.80.5530.590.2380.1180.13002.611
11Cul5-2000.10.0920.2950.8330.4710.2610.3080.1252.485
12NLIS000.60.1840.1970.1190.3530.130.1540.251.987
13BCbox-300.0910.10.7370.1970.23800.130.30801.801
14APOBEC-8000.30.0920.4920.1190.3530.1300.251.736
15APOBEC-600.1820.30.3680.0980.1190.2350.1300.1251.558
16Cul5-300.2730.10.2760.1970.1190.2350.130.15401.484
(c) NB
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1APOBEC-21000000000010
2Cul5-305.221.4950.2360.6510.06900.07300.1437.887
3APOBEC-403.240.7030.3930.6510.3470.0820.14600.1435.705
4CBFb-2001.4950.7080.940.3470.490.3660.13304.478
5APOBEC-3000.9671.4160.2170.8330.4080.3660.06704.274
6BCbox-100.361.3190.7870.5780.2080.32700.0670.1433.788
7BCbox-2000.441.5730.4340.1390.4080.0730.0670.1433.276
8APOBEC-5000.7910.1570.3610.2780.5710.220.1330.1432.655
9APOBEC-7000.1760.2360.5780.6250.3270.5120.13302.587
10CBFb-1000.0880.1570.5780.8330.1630.2930.402.513
11NLIS00.180.0880.3930.3610.4860.1630.0730.1330.1432.021
12APOBEC-6000.3520.2360.3610.2080.2450.439001.841
13BCbox-3000.0880.15700.3470.2450.2930.201.33
14APOBEC-80000.2360.0720.0690.4080.0730.33301.192
15Cul5-10000.1570.1450.1390.0820.0730.26700.862
16Cul5-20000.1570.0720.0690.08200.0670.1430.59
(d) SVMs
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1APOBEC-25.81.2991.1290.33900.17900008.746
2BCbox-22.11.4851.60.3390.150.17900.167006.018
3APOBEC-30.21.2061.3181.5810.60.3570.4620005.723
4CBFb-20.32.8761.0350.5650.150.17900005.105
5APOBEC-500.0930.1880.9031.050.3570.6150.6670.36404.237
6APOBEC-70.40.1860.5650.2261.80.17900.3330.36404.052
7BCbox-30.30.7420.2820.33900.8930.30800.1820.53.546
8APOBEC-40.60.3710.4710.9030.150.5360.30800.18203.52
9NLIS0.10.1860.3760.1130.450.5360.9230.5003.184
10Cul5-30.20.2780.6590.4520.450.1790.1540.1670.18202.72
11APOBEC-800.18600.3390.60.7140.1540.3330.36402.689
12APOBEC-6000.2820.4520.450.3570.30800.18202.031
13CBFb-10000.1130.150.3570.1540.50.18201.456
14BCbox-100.0930.0940.226000.6150.333001.361
15Cul5-10000.113000000.50.613
16Cul5-200000000000
Table A4. Relevance scores (r) in descending order per algorithm and variable considering the clinical endpoint VLHist using the MAREV-1 approach. The three variables with higher values are highlighted in bold.
Table A4. Relevance scores (r) in descending order per algorithm and variable considering the clinical endpoint VLHist using the MAREV-1 approach. The three variables with higher values are highlighted in bold.
(a) CART
Rank Variable pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 Total
1NLIS9.90000.1580000010.058
2APOBEC-305.0910.4270.28000.20005.998
3APOBEC-500.9094.160.28000.20005.549
4APOBEC-20.120.7470.70.47400.40.50.405.32
5CBFb-1000.641.121.2630.8330.40004.256
6BCbox-100.6360.8530.420.6320.3330.40.50.404.175
7APOBEC-8000.2131.820.6320.50.80003.965
8APOBEC-700.1820.1070.420.6320.50.80.250.403.29
9Cul5-1000.2130.420.4740.8330.40.75003.09
10APOBEC-6000.1071.120.789000.5002.516
11APOBEC-400.0910.1070.2800.3330.20.2500.51.761
12BCbox-300.0910.3200.158100001.569
13CBFb-2000.1070.140.1580.3330000.51.238
14Cul5-200000.1580.1670.20.250.401.175
15BCbox-200000.1580.167000.400.725
16Cul5-300000.316000000.316
(b) MLP
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1NLIS9.90000000009.9
2BCbox-103.330.7911.280.3120.1370.1230.450.160.0536.636
3CBFb-1001.670.7681.1690.7530.3080.225004.894
4APOBEC-500.811.3191.110.390.4790.1850.0750.40.0534.82
5APOBEC-302.430.440.6830.23400.2460.0750.160.0534.32
6Cul5-1000.2640.4271.1691.0960.80.3750.080.1054.316
7APOBEC-20.11.620.6150.3410.1560.2050.4310.150.160.1583.937
8Cul5-20000.2561.3250.7530.8620.3750.160.1053.836
9APOBEC-8000.440.8540.390.1370.4920.450.320.0533.135
10APOBEC-700.091.3190.4270.0780.2050.1850.2250.160.0532.741
11APOBEC-400.360.527000.2740000.1581.319
12APOBEC-6000.3520.0850.3120.2050.0620.150.080.0531.298
13BCbox-300.3600.1710.0780.2740.1230.150.080.0531.288
14CBFb-2000.2640.1710.2340.1370.0620.150.080.0531.149
15Cul5-30000.2560.1560.1370.1230.0750.080.0530.88
16BCbox-20000.17100.20500.0750.0800.531
(c) NB
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1APOBEC-21000000000010
2Cul5-301.713.20.7940.1740.5740.0730.077006.601
3NLIS03.781.041.010.0870.0820.1450.1540.12506.424
4BCbox-100.541.121.1551.130.6560.2180.2310.12505.175
5APOBEC-802.160.320.3610.4350.410.5820.231004.498
6APOBEC-300.090.241.2270.7830.410.7270.3080.1250.2224.131
7APOBEC-5000.320.2890.870.6560.2910.0770.37502.877
8CBFb-1000.160.4330.4350.6560.3640.3080.1250.1112.591
9APOBEC-700.090.080.5050.2610.410.2180.3850.250.2222.421
10BCbox-3000.240.4330.6960.1640.29100.2502.073
11CBFb-2000.560.2890.1740.410.1450.2310.1250.1112.045
12BCbox-200.450.160.3610.0870.2460.2180.231001.753
13APOBEC-400.180.3200.1740.2460.43600.12501.481
14APOBEC-6000.160.1440.34800.1450.3080.1250.1111.341
15Cul5-1000.080000.1450.2310.250.1110.817
16Cul5-200000.3480.08200.23100.1110.772
(d) SVMs
RankVariablepos1pos2pos3pos4pos5pos6pos7pos8pos9pos10Total
1NLIS9.20.450.1880.1060000009.944
2BCbox-30.13.510.9410.6360.80.5680.20.0970.207.052
3APOBEC-20.31.81.9760.9550.5330.2270.10005.892
4APOBEC-400.91.61.3790.40.5680.10.1940.205.341
5APOBEC-300.271.0351.2730.5330.4550.60.484004.65
6BCbox-10.20.270.3760.7420.1331.0230.70.3870.604.432
7APOBEC-800.090.4710.4241.3330.2270.30.19400.5563.595
8APOBEC-5000.2820.6360.6670.4550.50.290.20.2223.252
9BCbox-200.90.0940.4240.2670.3410.40.194002.619
10APOBEC-6000.0940.1060.2670.4550.30.3870.402.008
11CBFb-20.20.630.3760.1060.13300.10.0970.201.843
12Cul5-3000.5650.1060.1330.1140.40.29001.608
13APOBEC-700.1800.1060.5330.2270.20.097001.343
14CBFb-100000.2670.1140.10.0970.20.1110.888
15Cul5-2000000.11400.194000.307
16Cul5-1000000.1140000.1110.225
Table A5 shows the results from the fourth step of the proposed methodology, see Section 3.5.
Table A5. Assessment on the variables considering the clinical endpoints using the MAREV-1 approach. The variables with values surpassing the calculated threshold are highlighted in bold.
Table A5. Assessment on the variables considering the clinical endpoints using the MAREV-1 approach. The variables with values surpassing the calculated threshold are highlighted in bold.
CD4Ini
Rank Variable CART MLP NB SVMs Total
1BCbox-38.2828.1367.9068.58532.909
2APOBEC-35.7824.406105.6425.828
3APOBEC-56.9826.5165.6735.60724.778
4APOBEC-25.5044.5723.536.63120.237
5BCbox-23.5113.784.7176.45518.463
6APOBEC-66.0385.0592.7493.33517.181
7APOBEC-42.333.2832.75.08313.396
8Cul5-34.2543.063.2762.47813.068
9CBFb-13.0313.1142.5962.87511.616
10APOBEC-71.4491.7612.2783.4718.959
11APOBEC-82.9171.8261.7150.5627.02
12NLIS0.9681.5332.7781.5556.834
13CBFb-21.6331.9391.5131.3876.472
14Cul5-10.7913.2920.7640.5875.434
15Cul5-20.5661.392.210.6724.838
16BCbox-10.9631.3340.5960.0752.968
CD4Hist
RankVariableCARTMLPNBSVMsTotal
1APOBEC-28.3938.047109.23635.676
2APOBEC-36.7676.3085.1746.3224.569
3APOBEC-55.4674.6623.6377.33221.098
4BCbox-35.3594.1165.5515.07420.1
5APOBEC-45.9584.6453.7875.36719.757
6Cul5-34.0064.4554.2173.46916.147
7APOBEC-64.1673.6793.0412.99813.885
8APOBEC-72.5382.2973.8523.1111.797
9CBFb-12.8652.4342.5763.87311.748
10BCbox-21.5062.6224.0032.51210.643
11CBFb-22.6992.8951.7391.7279.06
12APOBEC-82.1771.6652.1921.7347.768
13Cul5-11.1421.7472.0872.0257.001
14NLIS0.4242.6181.7320.1134.887
15Cul5-21.21.4050.903.505
16BCbox-10.3331.4040.5110.112.358
VLIni
RankVariableCARTMLPNBSVMsTotal
1APOBEC-27.5767.949108.74634.271
2BCbox-15.5274.3534.2745.72319.877
3APOBEC-37.0327.3943.7881.36119.575
4APOBEC-54.2284.464.4785.10518.271
5CBFb-25.0434.7412.6554.23716.676
6APOBEC-44.1832.7485.7053.5216.156
7APOBEC-73.0843.0923.2766.01815.47
8APOBEC-61.6371.4847.8872.7213.728
9BCbox-23.2162.6112.5874.05212.466
10CBFb-12.763.5212.5131.45610.25
11Cul5-11.9761.9872.0213.1849.168
12NLIS3.2011.5581.8412.0318.631
13Cul5-30.9451.8011.333.5467.622
14Cul5-21.2121.7361.1922.6896.829
15APOBEC-82.163.0780.8620.6136.713
16BCbox-31.2212.4850.5904.296
VLHist
RankVariableCARTMLPNBSVMsTotal
1NLIS10.0589.96.4249.94436.326
2APOBEC-35.323.937105.89225.149
3APOBEC-54.1756.6365.1754.43220.418
4APOBEC-25.9984.324.1314.6519.099
5CBFb-15.5494.822.8773.25216.498
6BCbox-13.9653.1354.4983.59515.193
7APOBEC-84.2564.8942.5910.88812.629
8APOBEC-71.5691.2882.0737.05211.982
9Cul5-11.7611.3191.4815.3419.902
10APOBEC-63.292.7412.4211.3439.795
11APOBEC-40.3160.886.6011.6089.405
12BCbox-33.094.3160.8170.2258.448
13CBFb-22.5161.2981.3412.0087.163
14Cul5-21.2381.1492.0451.8436.275
15BCbox-21.1753.8360.7720.3076.09
16Cul5-30.7250.5311.7532.6195.628
Table A6 shows the results from the fifth step of the proposed methodology, see Section 3.5.
Table A6. The most informative variables per clinical endpoint considering those surpassing a calculated threshold (relevance scores in boldface). a, Scores when considering the classifications results from all the combinations; b, Scores calculated using only the best classification performance per clinical endpoint and algorithm (see Table 2).
Table A6. The most informative variables per clinical endpoint considering those surpassing a calculated threshold (relevance scores in boldface). a, Scores when considering the classifications results from all the combinations; b, Scores calculated using only the best classification performance per clinical endpoint and algorithm (see Table 2).
a MAREV-1 b MAREV-2
Variable CD4Ini CD4Hist VLIni VLHist CD4Ini CD4Hist VLIni VLHist
APOBEC-220.23735.67634.27119.099 6.510.010.08.75
APOBEC-325.82824.56919.57525.149 7.752.255.0833.75
APOBEC-413.39619.75716.1569.405 1.758.08.02.25
APOBEC-524.77821.09818.27120.418 2.53.255.1671.5
APOBEC-617.18113.88513.7289.795 1.251.51.6673.667
APOBEC-78.95911.79715.4711.982 05.1672.251.0
APOBEC-87.027.7686.71312.629 01.6673.3334.833
BCbox-12.9682.35819.87715.193 3.01.56.57.167
BCbox-218.46310.64312.4666.09 8.53.6675.01.75
BCbox-332.90920.14.2968.448 8.57.003.583
CBFb-111.61611.74810.2516.498 002.01.667
CBFb-26.4729.0616.6767.163 01.7501.5
Cul5-15.4347.0019.1689.902 0001.333
Cul5-24.8383.5056.8296.275 001.3331.5
Cul5-313.06816.1477.6225.628 5.254.251.6672.25
NLIS6.8344.8878.63136.326 02.02.08.5
Threshold20.220.2519.1819.85 8.1876.3286.4545.326
Table A7 compares the findings of MAREV-1 and MAREV-2 with the previous results.
Table A7. Variables with the highest scores per clinical endpoint. a, Previous results [20]; b, Considering the MAREV-1 approach; c, Considering the MAREV-2 approach.
Table A7. Variables with the highest scores per clinical endpoint. a, Previous results [20]; b, Considering the MAREV-1 approach; c, Considering the MAREV-2 approach.
a Previous Results b MAREV-1 c MAREV-2
Clinical Endpoint Variable Rank Rank Variable Rank Variable
CD4IniBCbox-31=1BCbox-3=1BCbox-3
APOBEC-42 -APOBEC-3 -BCbox-2
Cul-53 -APOBEC-5
-APOBEC-2
CD4HistAPOBEC-21=1APOBEC-2=1APOBEC-2
APOBEC-32=2APOBEC-3 -APOBEC-4
-APOBEC-5 -BCbox-3
VLIniAPOBEC-21=1APOBEC-2=1APOBEC-2
BCbox-12=2BCbox-1 -APOBEC-4
BCBox-23 -APOBEC-3 -BCbox-1
VLHistNLIS1=1NLIS -APOBEC-2
BCbox-12 -APOBEC-3 -NLIS
APOBEC-23 -APOBEC-5 -BCbox-1

References

  1. UNAIDS. Data 2020. 2020. Available online: https://www.unaids.org/en/resources/documents/2020/unaids-data (accessed on 28 May 2020).
  2. Clercq, E.D. Emerging anti-HIV drugs. Expert Opin. Emerg. Drugs 2005, 10, 241–274. [Google Scholar] [CrossRef]
  3. Greene, W.C.; Debyser, Z.; Ikeda, Y.; Freed, E.O.; Stephens, E.; Yonemoto, W.; Buckheit, R.W.; Esté, J.A.; Cihlar, T. Novel targets for HIV therapy. Antivir. Res. 2008, 80, 251–265. [Google Scholar] [CrossRef]
  4. Eberle, J.; Gürtler, L.G. HIV Types, Groups, Subtypes and Recombinant Forms: Errors in Replication, Selection Pressure and Quasispecies. Intervirology 2012, 55, 79–83. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Scarlata, S.; Carter, C. Role of HIV-1 Gag domains in viral assembly. Biochim. Biophys. Acta (BBA) Biomembr. 2003, 1614, 62–72. [Google Scholar] [CrossRef] [Green Version]
  6. Coloccini, R.S.; Dilernia, D.; Ghiglione, Y.; Turk, G.; Laufer, N.; Rubio, A.; Socías, M.E.; Figueroa, M.I.; Sued, O.; Cahn, P.; et al. Host Genetic Factors Associated with Symptomatic Primary HIV Infection and Disease Progression among Argentinean Seroconverters. PLoS ONE 2014, 9, e113146. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Goila-Gaur, R.; Strebel, K. HIV-1 Vif, APOBEC, and Intrinsic Immunity. Retrovirology 2008, 5, 1–16. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Romani, B.; Engelbrecht, S.; Glashoff, R.H. Antiviral roles of APOBEC proteins against HIV-1 and suppression by Vif. Arch. Virol. 2009, 154, 1579–1588. [Google Scholar] [CrossRef]
  9. Beam, A.L.; Motsinger-Reif, A.; Doyle, J. Bayesian neural networks for detecting epistasis in genetic association studies. BMC Bioinform. 2014, 15, 368. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Jiang, R.; Tang, W.; Wu, X.; Fu, W. A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinform. 2009, 10, S65. [Google Scholar] [CrossRef] [Green Version]
  11. Ritchie, M.D.; White, B.C.; Parker, J.S.; Hahn, L.W.; Moore, J.H. Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC Bioinform. 2003, 4, 28. [Google Scholar] [CrossRef] [Green Version]
  12. Motsinger-Reif, A.A.; Lee, S.L.; Mellick, G.; Ritchie, M.D. GPNN: Power studies and applications of a neural network method for detecting gene-gene interactions in studies of human disease. BMC Bioinform. 2006, 7, 39. [Google Scholar] [CrossRef] [Green Version]
  13. Motsinger, A.; Dudek, S.; Hahn, L.; Ritchie, M.D. Comparison of Neural Network Optimization Approaches for Studies of Human Genetics. Appl. Evol. Comput. 2006, 3907, 103–114. [Google Scholar] [CrossRef]
  14. Motsinger-Reif, A.A.; Ritchie, M.D. Neural networks for genetic epidemiology: Past, present, and future. BioData Min. 2008, 1, 3. [Google Scholar] [CrossRef] [Green Version]
  15. Tong, D.L.; Schierz, A.C. Hybrid genetic algorithm-neural network: Feature extraction for unpreprocessed microarray data. Artif. Intell. Med. 2011, 53, 47–56. [Google Scholar] [CrossRef]
  16. Cuevas-Tello, J.C.; Hernández-Ramírez, D.; García-Sepúlveda, C.A. Support vector machine algorithms in the search of KIR gene associations with disease. Comput. Biol. Med. 2013, 43, 2053–2062. [Google Scholar] [CrossRef] [PubMed]
  17. Boutorh, A.; Guessoum, A. Complex diseases SNP selection and classification by hybrid Association Rule Mining and Artificial Neural Network—based Evolutionary Algorithms. Eng. Appl. Artif. Intell. 2016, 51, 58–70. [Google Scholar] [CrossRef]
  18. Oriol, J.D.V.; Vallejo, E.E.; Estrada, K.; Peña, J.G.T.; Initiative, T.A.D.N. Benchmarking machine learning models for late-onset alzheimer’s disease prediction from genomic data. BMC Bioinform. 2019, 20, 709. [Google Scholar] [CrossRef]
  19. Hardin, J.; Waddell, M.; Page, C.D.; Zhan, F.; Barlogie, B.; Shaughnessy, J.; Crowley, J.J. Evaluation of Multiple Models to Distinguish Closely Related Forms of Disease Using DNA Microarray Data: An Application to Multiple Myeloma. Stat. Appl. Genet. Mol. Biol. 2004, 3, 1–21. [Google Scholar] [CrossRef] [Green Version]
  20. Altamirano-Flores, J.S.; Guerra-Palomares, S.E.; Hernandez-Sanchez, P.G.; Ramirez-Garcialuna, J.L.; Arguello-Astorga, J.R.; Noyola, D.E.; Cuevas-Tello, J.C.; Garcia-Sepulveda, C.A. Identification of HIV-1 Vif Protein Attributes Associated With CD4 T Cell Numbers and Viral Loads Using Artificial Intelligence Algorithms. IEEE Access 2020, 8, 87214–87227. [Google Scholar] [CrossRef]
  21. López, V.; Fernández, A.; García, S.; Palade, V.; Herrera, F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 2013, 250, 113–141. [Google Scholar] [CrossRef]
  22. Zieba, M.; Tomczak, J.M. Boosted SVM with active learning strategy for imbalanced data. Soft Comput. 2014, 19, 3357–3368. [Google Scholar] [CrossRef] [Green Version]
  23. Guerra-Palomares, S.E.; Hernandez-Sanchez, P.G.; Esparza-Pérez, M.A.; Arguello, J.R.; Noyola, D.E.; García-Sepúlveda, C.A. Molecular Characterization of Mexican HIV-1 Vif Sequences. AIDS Res. Hum. Retroviruses 2015, 31, 290–295. [Google Scholar] [CrossRef] [PubMed]
  24. Govender, S.; Otwombe, K.; Essien, T.; Panchia, R.; de Bruyn, G.; Mohapi, L.; Gray, G.; Martinson, N. CD4 counts and viral loads of newly diagnosed HIV-infected individuals: Implications for treatment as prevention. PLoS ONE 2014, 9, e90754. [Google Scholar] [CrossRef] [PubMed]
  25. Lane, P.C.; Clarke, D.; Hender, P. On developing robust models for favourability analysis: Model choice, feature sets and imbalanced data. Decis. Support Syst. 2012, 53, 712–718. [Google Scholar] [CrossRef] [Green Version]
  26. Hastie, T.; Friedman, J.; Tisbshirani, R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2017; pp. 210–211. [Google Scholar]
  27. Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar] [CrossRef]
  28. Ignizio, J. An Introduction to Expert Systems; Mc Graw-Hill: New York, NY, USA, 1991. [Google Scholar]
  29. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Brooks/Cole Advanced Books & Software: Monterey, CA, USA, 1984. [Google Scholar]
  30. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  31. Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. API design for machine learning software: Experiences from the scikit-learn project. In Proceedings of the ECML PKDD Workshop: Languages for Data Mining and Machine Learning, Prague, Czech Republic, 23–27 September 2013; pp. 108–122. [Google Scholar]
  32. Singh, S.; Gupta, P. Comparative study ID3, CART and C4.5 decision tree algorithm: A survey. Int. J. Adv. Inf. Sci. Technol. 2014, 27, 97–103. [Google Scholar]
  33. Mitchell, T. Machine Learning; Mc Graw-Hill: New York, NY, USA, 1997. [Google Scholar]
  34. Rosenblatt, F. The Perceptron—A Perceiving and Recognizing Automaton; Technical Report 85-460; Cornell Aeronautical Laboratory: Buffalo, NY, USA, 1957. [Google Scholar]
  35. Hinton, G.E. Connectionist learning procedures. Artif. Intell. 1989, 40, 185–234. [Google Scholar] [CrossRef] [Green Version]
  36. Rumelhart, D.E.; Hinton, G.E.; Williams, R. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  37. Bishop, C.M.; Hinton, G.E. Neural Networks for Pattern Recognition; Clarendon Press: Oxford, UK, 1995. [Google Scholar]
  38. Rojas, R. Neural Networks: A Systematic Introduction; Springer: Berlin/Heidelberg, Germany, 1996. [Google Scholar]
  39. Haykin, S. Neural Networks: A Comprehensive Foundation; Prentice Hall: Hoboken, NJ, USA, 1999. [Google Scholar]
  40. Widrow, B.; Hoff, M. Associative Storage and Retrieval of Digital Information in Networks of Adaptive ‘Neurons’. Biol. Prototypes Synth. Syst. 1962, 1, 160. [Google Scholar]
  41. Byrd, R.; Peihuang, L.; Nocedal, J. A Limited-Memory Algorithm for Bound-Constrained Optimization; Technical Report; U.S. Department of Energy: Washington, DC, USA, 1996. [CrossRef] [Green Version]
  42. Gunn, S. Support Vector Machines for Classification and Regression; Technical Report; University of Southampton: Southampton, UK, 1998. [Google Scholar]
  43. Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  44. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA Data Mining Software: An Update. SIGKDD Explor 2009, 11, 10–18. [Google Scholar] [CrossRef]
  45. Simon, J.H.M.; Sheehy, A.M.; Carpenter, E.A.; Fouchier, R.A.M.; Malim, M.H. Mutational Analysis of the Human Immunodeficiency Virus Type 1 Vif Protein. J. Virol. 1999, 73, 2675–2681. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  46. Chen, G.; He, Z.; Wang, T.; Xu, R.; Yu, X.F. A Patch of Positively Charged Amino Acids Surrounding the Human Immunodeficiency Virus Type 1 Vif SLVx4Yx9Y Motif Influences Its Interaction with APOBEC3G. J. Virol. 2009, 83, 8674–8682. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Producing multiple (p) balanced datasets through undersampling of the imbalanced dataset composed of a majority class (m) and a less represented, minority, class (n) by randomly removing majority class elements until m = n .
Figure 1. Producing multiple (p) balanced datasets through undersampling of the imbalanced dataset composed of a majority class (m) and a less represented, minority, class (n) by randomly removing majority class elements until m = n .
Cells 12 00772 g001
Figure 2. MLP architecture. There is a MLP per clinical endpoint; here is an example for CD4Ini.
Figure 2. MLP architecture. There is a MLP per clinical endpoint; here is an example for CD4Ini.
Cells 12 00772 g002
Figure 3. The process for defining the most relevant variables involves the search for the best combinations of variables including at most k-elements by using each balanced dataset. This search explores the interactions among the variables and their impact on the classification performance.
Figure 3. The process for defining the most relevant variables involves the search for the best combinations of variables including at most k-elements by using each balanced dataset. This search explores the interactions among the variables and their impact on the classification performance.
Cells 12 00772 g003
Figure 4. The selection of the overall-best combinations for each p balanced dataset by using their classification performance.
Figure 4. The selection of the overall-best combinations for each p balanced dataset by using their classification performance.
Cells 12 00772 g004
Figure 5. ID3 inducted trees using the selected most relevant variables per output as defined by the MAREV-1 approach. (a) The tree for CD4Ini; (b) The tree for CD4Hist; (c) The tree for VLIni; (d) The tree for VLHist.
Figure 5. ID3 inducted trees using the selected most relevant variables per output as defined by the MAREV-1 approach. (a) The tree for CD4Ini; (b) The tree for CD4Hist; (c) The tree for VLIni; (d) The tree for VLHist.
Cells 12 00772 g005
Figure 6. ID3 inducted trees using the suggested most relevant variables per output as defined by the MAREV-2 approach. (a) The tree for CD4Ini; (b) The tree for CD4Hist; (c) The tree for VLIni; (d) The tree for VLHist.
Figure 6. ID3 inducted trees using the suggested most relevant variables per output as defined by the MAREV-2 approach. (a) The tree for CD4Ini; (b) The tree for CD4Hist; (c) The tree for VLIni; (d) The tree for VLHist.
Cells 12 00772 g006aCells 12 00772 g006b
Figure 7. Position of the Vif encoding region within a reference (HXB2) HIV-1 genome.
Figure 7. Position of the Vif encoding region within a reference (HXB2) HIV-1 genome.
Cells 12 00772 g007
Table 1. Summary of the performance of the algorithm considering 100 runs on the balanced datasets for each clinical endpoint in descending order of their mean value. Our results demonstrate that MLP produced the best classification performance for all the comparisons made.
Table 1. Summary of the performance of the algorithm considering 100 runs on the balanced datasets for each clinical endpoint in descending order of their mean value. Our results demonstrate that MLP produced the best classification performance for all the comparisons made.
Clinical EndpointAlgorithmMean/S.D.Range Clinical EndpointAlgorithmMean/S.D.Range
CD4IniMLP79.6 ± 5.768.6–93.8 VLIniMLP68.5 ± 3.261.1–75.2
CART77.8 ± 6.065.7–91.0 CART68.0 ± 3.759.1–80.2
SVMs76.2 ± 5.661.0–88.1 NB66.5 ± 3.457.4–75.0
NB74.9 ± 5.959.5–90.5 SVMs62.0 ± 4.051.7–71.5
CD4HistMLP76.0 ± 5.463.3–91.0 VLHistMLP66.3 ± 2.760.9–73.8
CART74.0 ± 6.262.9–88.1 CART64.2 ± 2.559.1–71.4
NB72.6 ± 5.860.0–87.6 NB64.1 ± 2.957.5–71.1
SVMs66.7 ± 6.453.8–81.9 SVMs63.2 ± 3.051.3–68.5
Table 2. Best classification performance achieved by each algorithm, considering 100 balanced datasets for each clinical endpoint. These combinations were used for calculating the variables scores with the MAREV-2 approach.
Table 2. Best classification performance achieved by each algorithm, considering 100 balanced datasets for each clinical endpoint. These combinations were used for calculating the variables scores with the MAREV-2 approach.
Clinical EndpointAlgorithmCombinationAccuracy
CD4IniMLPBCbox-3, APOBEC-3, BCbox-2, Cul5-3, BCbox-1, APOBEC-593.8
CARTBCbox-3, BCbox-2, Cul5-3, APOBEC-2, APOBEC-3, APOBEC-591.0
NBAPOBEC-2, BCbox-3, APOBEC-3, BCbox-2, Cul5-3, APOBEC-690.5
SVMsBCbox-2, APOBEC-2, APOBEC-3, APOBEC-4, BCbox-1, BCbox-388.1
CD4HistMLPAPOBEC-2, Cul5-3, APOBEC-4, BCbox-3, APOBEC-7, BCbox-2, NLIS, BCbox-191.0
CARTAPOBEC-2, BCbox-3, BCbox-2, APOBEC-4, APOBEC-588.1
NBAPOBEC-2, APOBEC-4, Cul5-3, CBFb-2, BCbox-3, APOBEC-787.6
SVMsAPOBEC-2, APOBEC-3, APOBEC-4, APOBEC-5, APOBEC-6, APOBEC-8, APOBEC-7, BCbox-381.9
VLIniCARTAPOBEC-2, BCbox-1, APOBEC-4, BCbox-280.2
MLPAPOBEC-2, BCbox-1, APOBEC-8, APOBEC-3, APOBEC-4, APOBEC-5, Cul5-275.2
NBAPOBEC-2, APOBEC-4, BCbox-1, BCbox-2, NLIS, Cul5-3, APOBEC-3, APOBEC-5, CBFb-175.0
SVMsAPOBEC-2, APOBEC-7, APOBEC-3, APOBEC-4, APOBEC-5, APOBEC-6, APOBEC-8, BCbox-271.5
VLHistMLPNLIS, APOBEC-3, APOBEC-2, APOBEC-8, BCbox-1, CBFb-1, Cul5-1, Cul5-273.8
CARTAPOBEC-2, BCbox-3, BCbox-1, APOBEC-8, NLIS total71.4
NBAPOBEC-2, Cul5-3, NLIS, BCbox-2, APOBEC-3, BCbox-1, APOBEC-8, CBFb-2, APOBEC-6, APOBEC-771.1
SVMsNLIS, APOBEC-4, BCbox-1, APOBEC-2, APOBEC-5, APOBEC-6, BCbox-368.5
Table 3. The most relevant Vif protein variable combinations associated with the clinical endpoints. (a) Significant associations after testing the 20 hypothesis suggested by the MAREV-1 approach; (b) Significant associations after testing the 22 hypothesis suggested by the MAREV-2 approach. Vif protein regions can either be conserved (Cons) or mutated (Mut) and associated with protection (prot) or risk to either <500 cells/ μ L CD4 T cells or ≥10,000 cp/mL of viral load.
Table 3. The most relevant Vif protein variable combinations associated with the clinical endpoints. (a) Significant associations after testing the 20 hypothesis suggested by the MAREV-1 approach; (b) Significant associations after testing the 22 hypothesis suggested by the MAREV-2 approach. Vif protein regions can either be conserved (Cons) or mutated (Mut) and associated with protection (prot) or risk to either <500 cells/ μ L CD4 T cells or ≥10,000 cp/mL of viral load.
Contingency TablesClassification
ApproachOutputVif Variable CombinationStatus≥500 cells/μL<500 cells/μLAccuracyErrorp-Valueeffect
(a) MAREV-1Initial CD4BCbox-3 Mut , APOBEC-3 Cons absent85381.3%18.7%0.0011 prot
present86(61/75)(14/75)
Historic CD4APOBEC-2 Mut , APOBEC-3 Cons , APOBEC-5 Cons absent143552.0%48.0%0.0136 risk
present125(39/75)(36/75)
APOBEC-2 Cons , APOBEC-3 Cons absent22956%44.0%0.0182 prot
present1331(42/75)(33/75)
<10,000 cp/mL≥10,000 cp/mL
Initial VLAPOBEC-2 Mut , BCbox-1 Cons , APOBEC-3 Cons absent222857.3%42.7%0.0207 risk
present421(43/75)(32/75)
Historic VL—– —– —– —– —– —– —– —–
(b) MAREV-2Initial CD4BCbox-3 Cons , BCbox-2 Cons absent153354.7%45.3%0.0068 risk
present126(41/75)(34/75)
BCbox-3 Mut , BCbox-2 Mut absent105581.3%18.7%0.0049 prot
present64(61/75)(14/75)
Historic CD4APOBEC-2 Mut , BCbox-3 Cons absent154053.3%46.7%0.0077 risk
present020(40/75)(35/75)
<10,000 cp/mL≥10,000 cp/mL
Initial VLAPOBEC-2 Mut , BCbox-1 Cons , APOBEC-4 Mut absent253852.0%48.0%0.0477 risk
present111(39/75)(36/75)
Historic VLNLIS Mut , BCbox-1 Cons , APOBEC-2 Mut absent412762.7%37.3%0.0392 risk
present16(47/75)(28/75)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Altamirano-Flores, J.S.; Alvarado-Hernández, L.Á.; Cuevas-Tello, J.C.; Tino, P.; Guerra-Palomares, S.E.; Garcia-Sepulveda, C.A. Identification of Clinically Relevant HIV Vif Protein Motif Mutations through Machine Learning and Undersampling. Cells 2023, 12, 772. https://doi.org/10.3390/cells12050772

AMA Style

Altamirano-Flores JS, Alvarado-Hernández LÁ, Cuevas-Tello JC, Tino P, Guerra-Palomares SE, Garcia-Sepulveda CA. Identification of Clinically Relevant HIV Vif Protein Motif Mutations through Machine Learning and Undersampling. Cells. 2023; 12(5):772. https://doi.org/10.3390/cells12050772

Chicago/Turabian Style

Altamirano-Flores, José Salomón, Luis Ángel Alvarado-Hernández, Juan Carlos Cuevas-Tello, Peter Tino, Sandra E. Guerra-Palomares, and Christian A. Garcia-Sepulveda. 2023. "Identification of Clinically Relevant HIV Vif Protein Motif Mutations through Machine Learning and Undersampling" Cells 12, no. 5: 772. https://doi.org/10.3390/cells12050772

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop