Predicting HIV Status among Men Who Have Sex with Men in Bulawayo & Harare, Zimbabwe Using Bio-Behavioural Data, Recurrent Neural Networks, and Machine Learning Techniques

Chingombe, Innocent; Dzinamarira, Tafadzwa; Cuadros, Diego; Mapingure, Munyaradzi Paul; Mbunge, Elliot; Chaputsira, Simbarashe; Madziva, Roda; Chiurunge, Panashe; Samba, Chesterfield; Herrera, Helena; Murewanhema, Grant; Mugurungi, Owen; Musuka, Godfrey

doi:10.3390/tropicalmed7090231

Open AccessArticle

Predicting HIV Status among Men Who Have Sex with Men in Bulawayo & Harare, Zimbabwe Using Bio-Behavioural Data, Recurrent Neural Networks, and Machine Learning Techniques

by

Innocent Chingombe

^1,2,

Tafadzwa Dzinamarira

^2,3,*

,

Diego Cuadros

⁴

,

Munyaradzi Paul Mapingure

²,

Elliot Mbunge

⁵,

Simbarashe Chaputsira

²,

Roda Madziva

⁶,

Panashe Chiurunge

¹,

Chesterfield Samba

⁷,

Helena Herrera

⁸

,

Grant Murewanhema

⁹

,

Owen Mugurungi

¹⁰ and

Godfrey Musuka

²

¹

Graduate Business School, Chinhoyi University of Technology, Chinhoyi, Zimbabwe

²

ICAP, Columbia University, Harare, Zimbabwe

³

School of Health Systems & Public Health, University of Pretoria, Pretoria 0002, South Africa

⁴

Department of Geography and Geographic Information Science, University of Cincinnati, Cincinnati, OH 45221, USA

⁵

Department of Information Technology, Faculty of Accounting and Informatics, Durban University of Technology, Durban 4000, South Africa

⁶

School of Sociology and Social Policy, University of Nottingham, Nottingham NG7 2RD, UK

⁷

GALZ, Harare, Zimbabwe

⁸

School of Pharmacy and Biomedical Sciences, University of Portsmouth, Portsmouth PO1 2UP, UK

⁹

Unit of Obstetrics and Gynaecology, Department of Primary Health Care Sciences, Faculty of Medicine and Health Sciences, University of Zimbabwe, Harare, Zimbabwe

¹⁰

Ministry of Health and Child Care, AIDS and TB Programme, Harare, Zimbabwe

^*

Author to whom correspondence should be addressed.

Trop. Med. Infect. Dis. 2022, 7(9), 231; https://doi.org/10.3390/tropicalmed7090231

Submission received: 11 August 2022 / Revised: 31 August 2022 / Accepted: 2 September 2022 / Published: 5 September 2022

(This article belongs to the Special Issue HIV Testing, Prevention and Care Cascade)

Download

Browse Figures

Versions Notes

Abstract

:

HIV and AIDS continue to be major public health concerns globally. Despite significant progress in addressing their impact on the general population and achieving epidemic control, there is a need to improve HIV testing, particularly among men who have sex with men (MSM). This study applied deep and machine learning algorithms such as recurrent neural networks (RNNs), the bagging classifier, gradient boosting classifier, support vector machines, and Naïve Bayes classifier to predict HIV status among MSM using the dataset from the Zimbabwe Ministry of Health and Child Care. RNNs performed better than the bagging classifier, gradient boosting classifier, support vector machines, and Gaussian Naïve Bayes classifier in predicting HIV status. RNNs recorded a high prediction accuracy of 0.98 as compared to the Gaussian Naïve Bayes classifier (0.84), bagging classifier (0.91), support vector machine (0.91), and gradient boosting classifier (0.91). In addition, RNNs achieved a high precision of 0.98 for predicting both HIV-positive and -negative cases, a recall of 1.00 for HIV-negative cases and 0.94 for HIV-positive cases, and an F1-score of 0.99 for HIV-negative cases and 0.96 for positive cases. HIV status prediction models can significantly improve early HIV screening and assist healthcare professionals in effectively providing healthcare services to the MSM community. The results show that integrating HIV status prediction models into clinical software systems can complement indicator condition-guided HIV testing strategies and identify individuals that may require healthcare services, particularly for hard-to-reach vulnerable populations like MSM. Future studies are necessary to optimize machine learning models further to integrate them into primary care. The significance of this manuscript is that it presents results from a study population where very little information is available in Zimbabwe due to the criminalization of MSM activities in the country. For this reason, MSM tends to be a hidden sector of the population, frequently harassed and arrested. In almost all communities in Zimbabwe, MSM issues have remained taboo, and stigma exists in all sectors of society.

Keywords:

HIV/AIDS; status; MSM; deep learning; machine learning; prediction models

1. Introduction

According to UNAIDS, in 2019, there were 36.2 million [30.2 million–42.5 million] adults and 1.8 million [1.3 million–2.2 million] children (0–14 years) living with HIV globally [1], up from 34 million in 2010 [2]. Although there has been remarkable progress in diagnosis and access to antiretroviral therapy (ART) [3], HIV prevention measures in sub-Saharan Africa are still short of attaining the UNAIDS 90–90-90 fast track targets set in 2014. Zimbabwe is one of the few countries in Africa to have made significant progress toward achieving HIV epidemic control, with findings from a recent national survey revealing that 86.8% of people living with HIV were aware of their status; among them, 97.0% were on ART and 90.3% of these were virally suppressed (ZIMPHIA Summary Sheet). As Zimbabwe has reached the UNAIDS 90-90-90 targets by 2020, focusing on men who have sex with men (MSM) will be crucial in ensuring the country can achieve and sustain the 95-95-95 UNAIDS targets before 2030 [4].

Notwithstanding the significant progress in HIV testing and treatment, there remains a need to develop innovative approaches to reach hidden population groups such as MSM. The challenges faced by MSM include criminalization, which tremendously hinders their access to HIV and other essential health care services in Zimbabwe. Despite substantial advances in the testing, prevention, and treatment of HIV/AIDS, the overall trend of HIV incidence among MSM has been consistently upwards. A recent study conducted by Zimbabwe’s Ministry of Health and Child Care (MoHCC) found that HIV prevalence among this hard-to-reach vulnerable community was higher than that of the general population (17.1% vs. 12.9%) [5]. The same study found that achieving the UNAIDS HIV testing targets is still meagre in MSM at about 44%. This poses a considerable threat to attaining HIV epidemic control for the country and the population. Furthermore, MSM are at higher risk of psychological distress than the general population due to health issues because of multiple factors, including stigmatization, discrimination, and isolation in the community [6].

Higher HIV/STIs, low HIV testing [7], and lack of engagement with innovative interventions such as digital interventions are common challenges among MSM [8]. Several interventions such as intelligent mobile applications [3], electronic mail, short message services [9], voice messages [6], social media platforms [10], and online virtual simulation intervention (socially optimized learning in virtual environments) have been utilised to alleviate impediments found in this hard-to-reach population. Some of these interventions can be implemented at a large scale with low costs to enhance health promotion, change risky behaviours [9], strengthen self-efficacy, and create awareness [11], and may be helpful to in the MSM community to improve the goal of early HIV screening. In the Western European Region, early universal HIV screening, low stigmatization, increased access to HIV testing, and sustained antiretroviral therapy are considered part of significant indicators for low HIV prevalence among MSM [12]. However, universal HIV screening is relatively expensive, especially in developing countries [13], as it involves securing testing kits. Therefore, there is a need to incorporate computational techniques that consider primary health data to identify when individuals should be prioritized for HIV screening [12]. Using their existing health data for this purpose may reduce the inequalities to which MSM are subject. This may also potentially contribute to identifying individuals at increased risk of acquiring HIV, including those who might be pre-exposure prophylaxis (PrEP) candidates. There is a need to devise predictive models for identifying individuals who are likely to test HIV positive within MSM communities to present conducive health facilities for its clinical diagnosis. Therefore, this study applied machine learning techniques to predict HIV status among MSM. These models can process large amounts of health data and infer useful patterns that healthcare professionals can utilise to improve healthcare service provision in the MSM community. Based on our knowledge, this study is one within the limited body of studies that applied deep learning and machine learning models in predicting HIV status among MSM specifically in developing countries in the sub-Saharan Africa region. The study sought to achieve the following objectives.

Apply RNNs and machine learning models to predict HIV status among MSM.
Compare the performance of RNNs and machine learning models in predicting HIV status among MSM.
Propose recommendations for future research directions on applying deep learning and machine learning models in predicting HIV status among MSM.

The entire organization of the paper is based as follows: Section 2 presents the study methodology, Section 3 presents the study findings and discussion of these findings, and Section 4 presents the conclusion of the study based on study findings.

2. Methodology

2.1. Data Sources and Ethical Considerations

The study used secondary data collected as part of a prevalence study conducted by ICAP in 2018, targeting 1538 MSM from two prominent cities in Zimbabwe: Bulawayo and Harare. The protocol and tools used in this study were approved by the Columbia University Irving Medical Center Institutional Review Board (#IRB-AAAR8950), CDC ADS (#2018-444), and the Medical Research Council of Zimbabwe (#—MRCZ/A/2156). The protocol was also reviewed per the US Centers for Disease Control and Prevention (CDC) human research protection procedures. A bio-behavioural survey (BBS) collected demographic, behavioural, and bio-marker data on sexually transmitted infections (STIs), including Hepatitis B, syphilis, HIV status, and HIV Recency status. Interviews and health tests were conducted in private and secure spaces. Responses were captured using tablets that were programmed with the survey CTO Collect. Participation in the study was voluntary. Above all, informed consent, subjects’ privacy, and confidentiality of their data were also suitably observed during the work. The study used written informed consent. The survey had many components, and participants would consent to each component, for example, consent to be interviewed, consent for blood draw, consent to be tested for STIs, or consent to have blood stored for future studies. The participant would sign and date the consent form. No minors were included in this study.

HIV Recency data were available for all individuals who tested HIV positive and consented to be tested for HIV Recency. Tests for HIV status were used to determine if individuals were either positive or negative for HIV, whilst those for HIV Recency were conducted to determine if the HIV infection was recent (acquired less than 12 months from the date of the Recency test) or long-term (acquired more than 12 months before the Recency test).

The dataset included 863 features from the 1538 participants. These included all components extracted from the survey questionnaire: demographic, behavioural, and bio-marker data on STSs, including Hepatitis B, Syphilis, HIV status, and HIV Recency status. The dataset had no duplicates, meaning that the 1538 rows all represented different individuals. However, not all features from the dataset were relevant to predicting HIV status among MSM. We analysed several pieces of the literature that conducted studies in the same domain to determine significant predictors before applying feature selection models. Several studies alluded that income, awareness, and knowledge about HIV are essential HIV status predictors. Further, other predictors include a history of substance use, STIs, multiple male sex partners, specific sexual behaviours, and frequency of condom use [14]. In addition, demographic characteristics and health status indicators can predict HIV transmission risk among HIV-infected MSM [15].

A trial conducted by [16] considered predictors such as relationship status, self-reported history of the number of HIV/STI screens, STIs, and post-exposure prophylaxis (PEP) in the previous 12 months of sexual behaviour in the last 90 days as essential variables. Another study [17] applied machine learning models to predict the diagnosis of HIV and STIs using demographic, clinical, behavioural, and laboratory data from the clinic records of MSM. Their study posits that past syphilis infection, STIs symptoms, residential rurality, and frequency of condom use with casual male sexual partners during receptive anal sex in the past 12 months were the most critical predictors of HIV diagnosis [17]. A study conducted by [12] used demographic data, socio-economic characteristics, and STI history to predict HIV status. This study included different predictors to predict HIV status among MSM. The selected predictors (features) and their respective descriptions are shown in Table 1.

2.2. Data Preprocessing

The dependent variable (hivresult) had three possible values, which were “negative”, “positive”, and “nan”. The “nan” values showed values that were missing values. The dataset had 27 “nan” values in the column for the HIV results, representing 1.76% of all the values in the queue. The null values in the HIV result column could indicate individuals who refused to consent to take an HIV test. The rows with null values were removed. Data preprocessing was performed computationally and objectively from the dataset with 863 features to select relevant components. Feature importance was performed using the Extra Trees classifier, also known as the Extremely Randomized Trees classifier, a variant of a random forest. The Extra Trees classifier is an ensemble learning technique that creates a group of unpruned decision trees following the traditional top-down method to output classification results [18]. The Extra Trees classifier randomizes attribute and cut-point selection when selecting essential features while splitting a tree node. The Extra Trees classifier differs from ensemble learning techniques; it splits nodes by picking cut-points at random. It also uses the whole training sample (instead of bootstrap replica) to grow the trees [19]. Independent features with high correlation were removed from selected features. Only 28 relevant features were selected using the Extra Trees classifier, and the heatmap correlation matrix of the selected features is shown in Figure 1.

2.3. HIV Status Prediction Models

The proposed HIV status prediction model consists of input variables from the HIV dataset of MSM, data preprocessing, deep learning and machine learning models, and finally, the performance evaluation standard in three different phases. These phases are shown in Figure 2.

2.3.1. Gaussian Naïve Bayes

The Gaussian Naïve Bayes classifier is a classification algorithm based on Bayes’ Theorem [20]. Naïve Bayes is not a single algorithm, but a family of algorithms that share a common principle, i.e., features being used to classify are assumed to be independent of each other. It predicts the membership probabilities for each class [12], and the class with the highest chance is considered the most likely [21]. Before deep learning, the Naive Bayes classifier was a commonly used classification algorithm. Apart from being simple, the Naïve Bayes classifier performs exceptionally well in many applications such as forecasting, classification recognition, and prediction.

2.3.2. Support Vector Machines

The Support Vector Machine (SVM) is a supervised machine learning algorithm typically used to solve either classification or regression problems [20]. The SVM is a binary classification algorithm that classifies data and separates the two classes by constructing an operating separating hyperplane [22]. The support vectors are the data points closest to the hyperplane, while the hyperplane is a decision space divided between a set of objects with different classes [23]. All parameters for SVM in sklearn were left on default. We tried various combinations of parameters such as changing the C value, gamma value, and kernel type, but the defaults once produced the best results. We used gamma as auto, C as 1, and kernel as RBF.

2.3.3. Bagging Classifier

A bagging classifier is an ensemble meta-estimator that fits base classifiers on random subsets of the original dataset and then aggregates their predictions (either by voting or averaging) to form a final prediction [24]. Such a meta-estimator can typically reduce the variance of a black-box estimator (e.g., a decision tree) by introducing randomization into its construction procedure and then making an ensemble out of it [25]. We set the bool parameter (bootstrap) to True to allow sampling with replacement. We also set n_jobs to 3 since we used a computer with three processors. We did this to reduce the time taken for execution.

2.3.4. Gradient Boosting Classifier

Gradient boosting is a gradient-based approach to learning a boosting classifier incrementally [26]. Gradient boosting classifiers are a group of machine learning algorithms that combine many weak learning models to create a robust predictive model [27]. Decision trees are usually used when performing gradient boosting. The principle idea of the gradient boosting classifier is to construct the new base-learners to be maximally correlated with the negative gradient of the loss function associated with the whole ensemble [28]. The loss functions applied can be arbitrary, but to give a better intuition, the learning procedure would result in consecutive error-fitting if the error function is the classic squared-error loss [29]. To overcome overfitting, we reduced the learning rate from a default of 0.5 to 0.1, and tree depth to 6. Several hyperparameters were tuned to reduce overfitting and reduce loss. Several estimators were set to 10, and this number was experimental. We also put our learning rate at 0.5 to balance learning and prediction.

2.3.5. Recurrent Neural Network

Recurrent neural networks are deep neural networks that process sequential data (data in which order is important), such as audio processing or a series of words in a sentence [30,31]. This means RNNs can be used for NLP. RNNs allow previous outputs to be used as inputs while hiding the state [32]. RNNs apply the same operation on every sequence element, hence, the word recurrent. RNNs can be used, for instance, to predict the next word in a sentence [30]. Figure 3 shows the architectural design of RNNs.

The output

y^{< t >}

and activation function

a^{< t >}

for timestamp t are expressed as follows:

a^{< t >} = g 1 (W_{a a} a^{< t - 1 >} + W_{a x} x^{< t >} + b_{a}

y^{< t >} = g 2 (W_{y a} a^{< t >} + b_{y})

where

W_{a a}, W_{a x}, W_{y a}

,

b_{a}

, and

b_{y}

are temporarily shared coefficients, and

g_{1}

and

g_{2}

are activation functions. Wang et al. [33] adopted RNNs to forecast HIV incidence. For instance, RNNs have been widely used in clinical prediction tasks due to their strong modelling capacity in sequential data [34]. RNNs consume more memory during training and took more time compared to other algorithms used in the research. We used the loss as error measure to avoid overfitting.

2.4. Performance Evaluation Standards

The performance evaluation of the machine learning models was performed using precision, recall, accuracy, F1-score, and receiver operating characteristic (ROC) curve calculating the Area Under the Curve (AUC). The values of the evaluation metrics are calculated from the confusion matrix (CM). The matrix is composed of four categories. Firstly, true positives (TP) are examples correctly labelled as positives (MSM who are positive and classified as positive). Secondly, false positives (FP) refer to negative examples incorrectly labelled as positive (MSM who tested positive were classified as negative). Thirdly, true negatives (TN) (MSM who tested negative and were classified as negative) correspond to negatives correctly labelled as negative. Lastly, false negatives (FN) refer to positive examples incorrectly labelled as negative. We can use these categories to determine each model’s precision, recall, accuracy, F1-score, and AUC. Precision is the number of true positives separated by the number of false positives and true positives [35]. It is calculated as follows:

Precision = \frac{TP}{TP + FP}

A recall is the number of true positives to all positive class instances. It is calculated as follows:

Recall = \frac{TP}{(TP + FN)}

Accuracy is the percentage of correctly classified positive and negative examples [36]. It is calculated as follows:

Accuracy = \frac{TP + TN}{(TP + TN + FP + FN)}

The F1-score is the balance between recall and precision [37]. It is calculated as follows:

F 1 - Score = 2 \times \frac{Precision \times Recall}{(Precision + Recall)}

A receiver operating characteristic (ROC) curve is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters namely, True Positive Rate (TPR) and False Positive Rate (FPR). TPR is a synonym for recall [38]. ROC is created by using a recall plot against a false positive rate (1-specificity) at different threshold values. The Area Under the Curve is another helpful measure in performance measures [39]. The AUC of ROC is a discrimination measure that tells us how well our predictor can classify MSM into the following two groups: those with HIV and those without HIV. AUC stands for “Area under the ROC Curve.” That is, the AUC measures the entire two-dimensional area underneath the entire ROC curve [40]. The AUC provides an aggregate measure of performance across all possible classification thresholds.

3. Results

We used 10-fold validation and the results are presented in Table 2. The results in Table 2 show that the Gaussian Naïve Bayes algorithm had a precision rate of 0.92 for predicting HIV-negative results, and 0.61 for predicting HIV-positive results. The model had a better recall rate for HIV negatives at 0.87 than the recall for HIV positive, which stood at 0.71. In terms of the F1-score, the Gaussian Naïve Bayes algorithm recorded the lowest score for the HIV-negative instances at 0.90, compared to 0.65 for the prediction of HIV-positive instances.

Overall, the results from the Gaussian Naïve Bayes algorithm reveal that the model had an accuracy rate of 0.84, meaning that it could correctly predict true HIV positives as positive and true negatives as negatives for 0.84 of the instances/cases in the test dataset. Regarding the model’s performance as measured by the ROC curve’s Area Under the Curve (AUC), the Gaussian Naïve Bayes algorithm scored 0.87, as shown in Figure 4.

The SVM evaluation metrics revealed that the model had a higher precision score for predicting the HIV-positives instances at 0.97, compared to other machine learning models. Still, the HIV-negative scores’ prediction stood at 0.90. In terms of the model’s recall, which measures the model’s sensitivity, the results showed that the model performed better for the recall for the HIV-negative instances at 0.99 compared to the recall for the HIV-positive instances, which was 0.59. The SVM model’s F1-score was higher for the HIV-negative instances at 0.94 than for HIV-positive ones, which stood at 0.73. The overall accuracy for the SVM model was 0.91. Regarding the model’s performance as measured by its AUC for the ROC curve, the model had a score of 0.82, as shown in Figure 5.

The bagging classifier recorded a higher precision score for predicting HIV-negative instances, at 0.91, than its precision score for HIV positive, which stood at 0.89. Regarding the model’s recall scores, the model had a better score for the HIV-negative instances at 0.98, compared to the 0.64 for its recall for the HIV-positive instances. The model’s F1-score for the HIV-positive and -negative instances was also higher (0.94) than for the HIV positive. The model’s overall accuracy rate stood at 0.91. As assessed by the AUC of its ROC curve below, its performance was 0.87, as shown in Figure 6.

The gradient boosting algorithm’s performance results show that its precision for the prediction of the HIV results was higher for the HIV-negative instances at 0.91, compared to that for the prediction of HIV-positive results, which stood at 0.89. However, the model’s recall capacity was higher for the HIV-negative instances, at 0.94 compared to 0.64 for the recall of HIV-positive instances. Its F1-score was higher for the HIV-negative instances, at 0.94, than for the HIV-positive instances, which was pegged at 0.74. The model’s overall accuracy rate was 0.91. In terms of the model’s performance as measured by the ROC curve’s area under the curve, the algorithm scored 0.91, as shown in Figure 7.

The RNN model’s performance results show that the model’s precision for the HIV results was higher for the HIV-negative instances at 0.99, compared to the prediction of HIV-positive results, which stood at 0.92. However, the model’s recall capacity was higher for the HIV-negative instances, at 1.00, compared to 0.98 for the recall of HIV-positive instances. Its F1-score was higher for the HIV-negative instances, at 0.99, than for the HIV-positive instances, which was pegged at 0.96. The model’s overall accuracy rate was 0.99. The algorithm reached optimum accuracy after 12 epochs using the Early Stopping method to avoid overfitting as shown in Figure 8.

3.1. Policy Recommendations on the Application of Deep Learning and Machine Learning in Predicting HIV among MSM

Before key policy recommendations can be drawn from this work, there is a need to validate the model by enrolling members of the MSM community for classification of their HIV status before the same individuals are given an actual HIV test diagnosis, for which the results from the two will be compared for accuracy. Once the desired level of model classification performance is reached, the tool can be presented to the Zimbabwe Ministry of Health and Child Care for consideration as a pre-test HIV screening tool for members of the MSM community or for modification to include data for the general population. Tools such as a prediction model will assist HIV programmes in Zimbabwe and southern Africa to better address the needs of MSM. After that, it is essential to include this model and its attributes in health ministry policies and guidelines in order to ensure that the model is fully utilised.

3.2. Limitations

This study provides insights into some factors predicting HIV status among men who have sex with men using recurrent neural networks and machine learning techniques. Its findings need to be considered in public health policy, strategy, and practice. However, the cross-sectional design sampling technique and resultant distribution of demographic characteristics may have prevented all relevant factors associated with predicting HIV status among men who have sex with men from being identified. Cross-sectional studies do not show a temporal relationship between exposures and outcomes as longitudinal studies would. In addition, it may also have been underpowered to detect differences in some variables. Therefore, more extensive studies with a bigger sample size to accord higher statistical power may be required to explore all potential variables of interest fully.

All questionnaire data (condom use, history of HIV testing, awareness of status, and ART use) were self-reported and may be subject to social desirability bias. Although interviewers were trained in techniques to put participants at ease and support the accurate reporting of dates and events, self-reported data remain susceptible to recall bias and social desirability bias concerning sensitive topics, especially considering that same-sex sexual behaviours are illegal and highly stigmatized in Zimbabwe. Those that were recruited and agreed to participate may be a self-selected group of individuals more comfortable disclosing their sexual behaviour. The survey was limited to Harare and Bulawayo and did not reflect MSM and TGW/GQ activity throughout all of Zimbabwe.

4. Conclusions

Recurrent neural networks, bagging, gradient boosting, the support vector machine, and Gaussian Naïve Bayes classifier were applied to predict HIV status among MSM. In circumstances like this, where stigma and fear of prosecution prevent access to services, methods such as recurrent neural networks, the bagging classifier, gradient boosting classifier, support vector machine, and Gaussian Naïve Bayes classifier can successfully provide essential information on MSM, and potentially other hard-to-reach groups. HIV prediction models applied in this study used the known HIV risk factors of individuals exposed to different risks and at varying degrees of exposure. This study revealed that machine learning classifiers could significantly improve HIV testing capacity among MSM, and their use should be advocated. Recurrent neural networks achieved a high prediction accuracy of 0.98 as compared to other machine learning models. In addition, RNNs achieved a high precision of 0.98 for both HIV-positive and -negative cases, a recall of 1.00 for HIV-negative cases and 0.94 for HIV-positive cases, and an F1-score of 0.99 for HIV-negative cases and 0.96 for positive cases. Integrating HIV status prediction models into clinical software systems may complement existing strategies, such as indicator condition-guided HIV testing, and help identify individuals that may require healthcare services among the MSM community. With continued HIV research, including research targeting MSM in Zimbabwe, further knowledge on HIV among this group is likely to become more readily available, thereby increasing the amount of data to be used in the training of HIV predictive models and ultimately resulting in the improvement of the model performance beyond the current study’s best accuracy of 0.93. Establishing HIV status predictive models is essential for intensifying HIV testing and providing healthcare services, especially to the MSM community. However, future studies are required to optimize deep learning and machine learning models further and integrate them into primary care settings by incorporating HIV risk factors, clinical data, and socio-behavioural-driven data. This research helped identify the key questions that help in predicting HIV status among the MSM community.

Additionally, methods such as those used in this manuscript are increasingly used in studies addressing various issues in the health sector [41,42]. Future research will explore how these mentioned studies and others can provide additional ideas to enhance our studies. Our work in the future will use recurrent neural networks and machine learning techniques to understand factors associated with improved adherence to antiretroviral treatment and pre-exposure prophylaxis by MSM in the country.

Author Contributions

Conceptualization—I.C., G.M. (Godfrey Musuka) and P.C.; methodology—I.C., S.C., G.M. (Godfrey Musuka) and P.C.; software—I.C., S.C. and P.C.; validation—G.M. (Godfrey Musuka) and T.D.; formal analysis—I.C., S.C. and P.C.; data curation—I.C., S.C. and P.C.; writing—original draft preparation, I.C.; writing—review and editing—T.D., D.C., M.P.M., E.M., S.C., R.M., P.C., C.S., H.H., G.M. (Grant Murewanhema), O.M. and G.M. (Godfrey Musuka); visualization—I.C. and S.C.; supervision—P.C. and G.M. (Godfrey Musuka); project administration—I.C. and P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This secondary data analysis research received no external funding. However, the HIV and STI Biobehavioral Survey among Men Who Have Sex with Men, Transgender Women, and Genderqueer Individuals in Zimbabwe was funded by (PEPFAR) through CDC under Cooperative Agreement #NU2GGH001939. The funder did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the current manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

The study used written informed consent.

Data Availability Statement

The data used in this study is available upon reasonable request from the Ministry of Health and Child Care, Zimbabwe.

Conflicts of Interest

The authors declare no conflict of interest.

References

UNAIDS. FACT SHEET. 2019. Available online: https://www.unaids.org/sites/default/files/media_asset/2019-UNAIDS-data_en.pdf (accessed on 19 May 2022).
WHO. Key Facts on Global HIV Epidemic and Progress in 2010; WHO: Geneva, Switzerland, 2011. Available online: https://www.who.int/teams/global-hiv-hepatitis-and-stis-programmes/hiv/strategic-information/hiv-data-and-statistics. (accessed on 19 May 2022).
Nguyen, L.H.; Tran, B.X.; Rocha, L.E.; Nguyen HL, T.; Yang, C.; Latkin, C.A.; Thorson, A.; Strömdahl, S. A Systematic Review of eHealth Interventions Addressing HIV/STI Prevention among Men Who Have Sex with Men. AIDS Behav. 2019, 23, 2253–2272. Available online: https://link.springer.com/article/10.1007/s10461-019-02626-1 (accessed on 10 November 2021). [CrossRef] [PubMed]
Musuka, G.; Dzinamarira, T. Targeting those left behind in Zimbabwe’s HIV response: A call for decriminalisation of key populations to achieve 95-95-95 targets rapidly. S. Afr. Med. J. 2021, 111, 385. Available online: http://www.samj.org.za/index.php/samj/article/view/13287 (accessed on 21 November 2021). [CrossRef] [PubMed]
ICAP at Columbia University. HIV and STI Biobehavioral Survey among Men Who Have Sex with Men, Transgender Women, and Genderqueer Individuals in Zimbabwe—Final Report; ICAP at Columbia University: New York, NY, USA, 2020; Available online: https://icap.columbia.edu/wp-content/uploads/Zimbabwe-IBBS-Report_Final_17Aug20.pdf (accessed on 19 May 2022).
Hess, K.L.; Crepaz, N.; Rose, C.; Purcell, D.; Paz-Bailey, G. Trends in Sexual Behavior among Men Who have Sex with Men (MSM) in High-Income Countries, 1990–2013: A Systematic Review. AIDS Behav. 2017, 21, 2811–2834. Available online: https://link.springer.com/article/10.1007/s10461-017-1799-1 (accessed on 10 November 2021). [CrossRef] [PubMed]
Dzinamarira, T.; Mulindabigwi, A.; Mashamba-Thompson, T.P. Co-creation of a health education program for improving the uptake of HIV self-testing among men in Rwanda: Nominal group technique. Heliyon 2020, 6, e05378. [Google Scholar] [CrossRef] [PubMed]
Dzinamarira, T.; Muvunyi, C.M.; Kamanzi, C.; Mashamba-Thompson, T.P. HIV self-testing in Rwanda: Awareness and acceptability among male clinic attendees in Kigali, Rwanda: A cross-sectional survey. Heliyon 2020, 6, e03515. [Google Scholar] [CrossRef] [PubMed]
Schnall, R.; Travers, J.; Rojas, M.; Carballo-Diéguez, A. eHealth Interventions for HIV Prevention in High-Risk Men Who Have Sex with Men: A Systematic Review. J. Med. Internet Res. 2014, 16, e3393. Available online: https//www.jmir.org/2014/5/e134 (accessed on 10 November 2021). [CrossRef]
Hirshfield, S.; Downing, M.J., Jr.; Parsons, J.T.; Grov, C.; Gordon, R.J.; Houang, S.T.; Scheinmann, R.; Sullivan, P.S.; Yoon, I.S.; Anderson, I.; et al. Developing a Video-Based eHealth Intervention for HIV-Positive Gay, Bisexual, and Other Men Who Have Sex with Men: Study Protocol for a Randomized Controlled Trial. JMIR Res. Protoc. 2016, 5, e5554. Available online: https//www.researchprotocols.org/2016/2/e125 (accessed on 10 November 2021). [CrossRef]
Olatosi, B.; Sun, X.; Chen, S.; Zhang, J.; Liang, C.; Weissman, S.; Li, X. Application of machine-learning techniques in classification of HIV medical care status for people living with HIV in South Carolina. AIDS 2021, 35, S19–S28. Available online: https://journals.lww.com/aidsonline/Fulltext/2021/05011/Application_of_machine_learning_techniques_in.3.aspx (accessed on 10 November 2021). [CrossRef]
Ahlström, M.G.; Ronit, A.; Omland, L.H.; Vedel, S.; Obel, N. Algorithmic prediction of HIV status using nation-wide electronic registry data. EClinicalMedicine 2019, 17, 100203. [Google Scholar] [CrossRef]
Mutai, C.K.; McSharry, P.E.; Ngaruye, I.; Musabanganji, E. Use of machine learning techniques to identify HIV predictors for screening in sub-Saharan Africa. BMC Med. Res. Methodol. 2021, 21, 1–11. Available online: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-021-01346-2 (accessed on 10 November 2021). [CrossRef]
Menza, T.W.; Hughes, J.P.; Celum, C.L.; Golden, M.R. Prediction of HIV Acquisition among Men Who Have Sex with Men. Sex. Transm. Dis. 2009, 36, 547. Available online: https://pubmed.ncbi.nlm.nih.gov/19707108/ (accessed on 1 December 2021). [CrossRef] [PubMed]
Morin, S.E.; Steward, W.T.; Charlebois, E.D.; Remien, R.H.; Pinkerton, S.D.; Johnson, M.O.; Rotheram-Borus, M.J.; Lightfoot, M.; Goldstein, R.B.; Kittel, L.; et al. Predicting HIV transmission risk among HIV-infected men who have sex with men: Findings from the healthy living project. J. Acquir. Immune. Defic. Syndr. 2005, 40, 226–235. Available online: https://journals.lww.com/jaids/Fulltext/2005/10010/Predicting_HIV_Transmission_Risk_Among.16.aspx (accessed on 13 November 2021). [CrossRef] [PubMed]
White, E.; Dunn, D.T.; Desai, M.; Gafos, M.; Kirwan, P.; Sullivan, A.K.; McCormack, S. Predictive factors for HIV infection among men who have sex with men and who are seeking PrEP: A secondary analysis of the PROUD trial. Sex. Transm. Infect. 2019, 95, 449–454. Available online: https://sti.bmj.com/content/95/6/449 (accessed on 13 November 2021). [CrossRef]
Bao, Y.; Medland, N.A.; Fairley, C.K.; Wu, J.; Shang, X.; Chow, E.P.F.; Xu, X.; Ge, Z.; Zhuang, X.; Zhang, L. Predicting the diagnosis of HIV and sexually transmitted infections among men who have sex with men using machine learning approaches. J. Infect. 2021, 82, 48–59. [Google Scholar] [CrossRef] [PubMed]
Sharaff, A.; Gupta, H. Extra-Tree Classifier with Metaheuristics Approach for Email Classification. Adv. Intell. Syst. Comput. 2019, 924, 189–197. Available online: https://link.springer.com/chapter/10.1007/978-981-13-6861-5_17 (accessed on 17 November 2021).
Ampomah, E.K.; Qin, Z.; Nyame, G. Evaluation of tree-based ensemble machine learning models in predicting stock price direction of movement. Information 2020, 11, 332. [Google Scholar] [CrossRef]
Balzer, L.B.; Havlir, D.V.; Kamya, M.R.; Chamie, G.; Charlebois, E.D.; Clark, T.D.; Koss, C.A.; Kwarisiima, D.; Ayieko, J.; Sang, N.; et al. Machine Learning to Identify Persons at High-Risk of Human Immunodeficiency Virus Acquisition in Rural Kenya and Uganda. Clin. Infect. Dis. 2020, 71, 2326–2333. Available online: https://academic.oup.com/cid/article/71/9/2326/5614347 (accessed on 13 November 2021). [CrossRef] [PubMed]
Xie, W.; Ji, M.; Huang, R.; Hao, T.; Chow, C.Y. Predicting Risks of Machine Translations of Public Health Resources by Developing Interpretable Machine Learning Classifiers. Int. J. Environ. Res. Public Health 2021, 18, 8789. Available online: https://www.mdpi.com/1660-4601/18/16/8789/htm (accessed on 13 November 2021). [CrossRef]
Mbunge, E.; Simelane, S.; Fashoto, S.G.; Akinnuwesi, B.; Metfula, A.S. Application of deep learning and machine learning models to detect COVID-19 face masks—A review. Sustain. Oper. Comput. 2021, 2, 235–245. [Google Scholar] [CrossRef]
Akinnuwesi, B.A.; Fashoto, S.G.; Mbunge, E.; Odumabo, A.; Metfula, A.S.; Mashwama, P.; Uzoka, F.-M.; Owolabi, O.; Okpeku, M.; Amusa, O.O. Application of intelligence-based computational techniques for classification and early differential diagnosis of COVID-19 disease. Data Sci. Manag. 2021, 4, 10–18. [Google Scholar] [CrossRef]
Zareapoor, M.; Shamsolmoali, P. Application of Credit Card Fraud Detection: Based on Bagging Ensemble Classifier. Procedia Comput. Sci. 2015, 48, 679–685. [Google Scholar] [CrossRef] [Green Version]
Sreng, S.; Maneerat, N.; Hamamoto, K.; Panjaphongse, R. Automated Diabetic Retinopathy Screening System Using Hybrid Simulated Annealing and Ensemble Bagging Classifier. Appl. Sci. 2018, 8, 1198. Available online: https://www.mdpi.com/2076-3417/8/7/1198/htm (accessed on 13 November 2021). [CrossRef]
Son, J.; Jung, I.; Park, K.; Han, B. Tracking-by-Segmentation with Online Gradient Boosting Decision Tree. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3056–3064. [Google Scholar]
Chakrabarty, N.; Kundu, T.; Dandapat, S.; Sarkar, A.; Kole, D.K. Flight Arrival Delay Prediction Using Gradient Boosting Classifier. Adv. Intell. Syst. Comput. 2019, 813, 651–659. Available online: https://link.springer.com/chapter/10.1007/978-981-13-1498-8_57 (accessed on 13 November 2021).
Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef] [PubMed]
Taha, A.A.; Malebary, S.J. An Intelligent Approach to Credit Card Fraud Detection Using an Optimized Light Gradient Boosting Machine. IEEE Access 2020, 8, 25579–25587. [Google Scholar] [CrossRef]
Michelucci, U. Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks; Mirashi, A., Moodie, M., John, C.S., Eds.; Apress: Dübendorf, Switzerland, 2018; 425p. [Google Scholar]
Clark, K.; Luong, M.-T.; Manning, C.D.; Le, Q.V. Semi-Supervised Sequence Modeling with Cross-View Training. arXiv 2018, arXiv:1809.08370. [Google Scholar]
Yuan, J.; Wang, H.; Lin, C.; Liu, D.; Yu, D. A Novel GRU-RNN Network Model for Dynamic Path Planning of Mobile Robot. IEEE Access 2019, 7, 15140–15151. [Google Scholar] [CrossRef]
Wang, G.; Wei, W.; Jiang, J.; Ning, C.; Chen, H.; Huang, J.; Liang, B.; Zang, N.; Liao, Y.; Chen, R.; et al. Application of a long short-term memory neural network: A burgeoning method of deep learning in forecasting HIV incidence in Guangxi, China. Epidemiol. Infect. 2019, 147, e194. Available online: https://www.cambridge.org/core/journals/epidemiology-and-infection/article/application-of-a-long-shortterm-memory-neural-network-a-burgeoning-method-of-deep-learning-in-forecasting-hiv-incidence-in-guangxi-china/B1A6C408C6106A6B133E208FB2CD2EF8 (accessed on 12 November 2021). [CrossRef]
Xiang, Y.; Du, J.; Fujimoto, K.; Li, F.; Schneider, J.; Tao, C. Application of artificial intelligence and machine learning for HIV prevention interventions. Lancet HIV 2021, 9, e54–e62. Available online: http://www.thelancet.com/article/S2352301821002472/fulltext (accessed on 12 November 2021). [CrossRef]
Fashoto, S.G.; Mbunge, E.; Ogunleye, G.; den Burg, J.V. Implementation of machine learning for predicting maize crop yields using multiple linear regression and backward elimination. Malays. J. Comput. 2021, 6, 679–697. Available online: https://mjoc.uitm.edu.my (accessed on 5 October 2021). [CrossRef]
Yadav, S.S.; Kadam, V.J.; Jadhav, S.M.; Jagtap, S.; Pathak, P.R. Machine learning based malaria prediction using clinical findings. In Proceedings of the 2021 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India, 5–7 March 2021; pp. 216–222. [Google Scholar]
Uddin, S.; Khan, A.; Hossain, M.E.; Moni, M.A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 2019, 19, 1–16. Available online: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-1004-8 (accessed on 12 November 2021). [CrossRef] [PubMed]
Elujide, I.; Fashoto, S.G.; Fashoto, B.; Mbunge, E.; Folorunso, S.O.; Olamijuwon, J.O. Application of deep and machine learning techniques for multi-label classification performance on psychotic disorder diseases. Inform. Med. Unlocked 2021, 23, 100545. [Google Scholar] [CrossRef]
Jeni, L.A.; Cohn, J.F.; De La Torre, F. Facing imbalanced data—Recommendations for the use of performance metrics. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), Geneva, Switzerland, 2–5 September 2013; pp. 245–251. [Google Scholar]
Mbunge, E.; Fashoto, S.G.; Bimha, H. Prediction of box-office success: A review of trends and machine learning computational models. Int. J. Bus. Intell. Data Min. 2022, 20, 192. Available online: http://www.inderscience.com/link.php?id=120825 (accessed on 19 February 2022). [CrossRef]
Vulli, A.; Srinivasu, P.N.; Sashank, M.S.K.; Shafi, J.; Choi, J.; Ijaz, M.F. Fine-Tuned DenseNet-169 for Breast Cancer Metastasis Prediction Using FastAI and 1-Cycle Policy. Sensors 2022, 22, 2988. [Google Scholar] [CrossRef] [PubMed]
Ali, F.; Khan, P.; Riaz, K.; Kwak, D.; Abuhmed, T.; Park, D.; Kwak, K.S. A fuzzy ontology and SVM–based Web content classification system. IEEE Access 2017, 5, 25781–25797. [Google Scholar] [CrossRef]

Figure 1. Correlation matrix of the selected features.

Figure 2. HIV status prediction models.

Figure 3. RNN Architecture.

Figure 4. Naïve Bayes ROC algorithm ROC Curve.

Figure 5. Support Vector Machine’s ROC Curve.

Figure 6. Bagging classifier ROC Curve.

Figure 7. Gradient boosting algorithm ROC curve.

Figure 8. RNN’s training and testing loss.

Table 1. Description of features.

Feature Name	Feature Description
PPRKNOW	Knowledge of Pre-exposure prophylaxis as (PrEP)
PPEKNOW	Knowledge of Post-exposure prophylaxis as (PEP)
HKHVPRSK	Self-perceived chances of becoming HIV infected in the next 12 months
DEAGENUM	Age in completed years
HIVNOTES	HIV Services where one was ever referred to
DEMARSTA	Marital status
INLRNWHT_9	Desire to learn more HIV treatment
SYPHTRE	Syphilis test result
INLRNWHT	HIV-related topics to learn more about
PPRTAKE	Ever taken PrEP
STCIRCM	Circumcision status
DEOUTWHO_2	Disclosed sexual identity to family members
DEOUTWHO_6	Disclosed sexual identity to a Health care provider
LUFREE	Ever been given “packets” of lubricant for free? For example, through an outreach service, drop-in centre, or health clinic, in the last six months
INLRNWHT_2	Desire to learn more about how to prevent HIV
LUTYPE_c	Use of water-based lubricant (Durex, etc.) during anal sex in the last 6 months
RCFEMNA	Type of sex (anal, oral, both), during last sex with main female partner
INLRNWHT_1	Desire to learn more about HIV prevention
DEATTRA	Sex/gender most sexually attracted to
LUNEVUSE	The main reason for not using a lubricant during anal sex in the past six months
DELIVESX	Currently living with a sexual partner or not
DEINCOME	Last monthly income
COANO_a	Condom use during anal sex when drunk
RCMAMNFQ	Frequency of condom use with the male partner one has sex with the most, in the last 6 months
DEREADWR	Ability to read and write
COLIKELY	Whether one is likely to use the condom when a man inserts his penis into his anus (butt) or when he is the one inserting a penis into someone’s anus or equally likely for both cases
LU12LUTG	Frequency of use of lubricants during anal sex with a man or transgender woman, in the last six months

Table 2. Performance of HIV status prediction models.

Prediction Model	Precision		Recall		F1-Score		Accuracy	AUC
Prediction Model	Negative	Positive	Negative	Positive	Negative	Positive
RNN	0.98	0.98	1.00	0.94	0.99	0.96	0.98	0.94
Gaussian Naïve Bayes	0.89	0.65	0.88	0.68	0.88	0.66	0.83	0.87
Bagging Classifier	0.89	0.90	0.96	0.62	0.92	0.73	0.90	0.85
SVM	0.89	0.96	0.93	0.62	0.91	0.75	0.91	0.81
Gradient Boosting Classifier	0.91	0.89	0.97	0.65	0.94	0.75	0.91	0.89

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chingombe, I.; Dzinamarira, T.; Cuadros, D.; Mapingure, M.P.; Mbunge, E.; Chaputsira, S.; Madziva, R.; Chiurunge, P.; Samba, C.; Herrera, H.; et al. Predicting HIV Status among Men Who Have Sex with Men in Bulawayo & Harare, Zimbabwe Using Bio-Behavioural Data, Recurrent Neural Networks, and Machine Learning Techniques. Trop. Med. Infect. Dis. 2022, 7, 231. https://doi.org/10.3390/tropicalmed7090231

AMA Style

Chingombe I, Dzinamarira T, Cuadros D, Mapingure MP, Mbunge E, Chaputsira S, Madziva R, Chiurunge P, Samba C, Herrera H, et al. Predicting HIV Status among Men Who Have Sex with Men in Bulawayo & Harare, Zimbabwe Using Bio-Behavioural Data, Recurrent Neural Networks, and Machine Learning Techniques. Tropical Medicine and Infectious Disease. 2022; 7(9):231. https://doi.org/10.3390/tropicalmed7090231

Chicago/Turabian Style

Chingombe, Innocent, Tafadzwa Dzinamarira, Diego Cuadros, Munyaradzi Paul Mapingure, Elliot Mbunge, Simbarashe Chaputsira, Roda Madziva, Panashe Chiurunge, Chesterfield Samba, Helena Herrera, and et al. 2022. "Predicting HIV Status among Men Who Have Sex with Men in Bulawayo & Harare, Zimbabwe Using Bio-Behavioural Data, Recurrent Neural Networks, and Machine Learning Techniques" Tropical Medicine and Infectious Disease 7, no. 9: 231. https://doi.org/10.3390/tropicalmed7090231

Article Menu

Predicting HIV Status among Men Who Have Sex with Men in Bulawayo & Harare, Zimbabwe Using Bio-Behavioural Data, Recurrent Neural Networks, and Machine Learning Techniques

Abstract

1. Introduction

2. Methodology

2.1. Data Sources and Ethical Considerations

2.2. Data Preprocessing

2.3. HIV Status Prediction Models

2.3.1. Gaussian Naïve Bayes

2.3.2. Support Vector Machines

2.3.3. Bagging Classifier

2.3.4. Gradient Boosting Classifier

2.3.5. Recurrent Neural Network

2.4. Performance Evaluation Standards

3. Results

3.1. Policy Recommendations on the Application of Deep Learning and Machine Learning in Predicting HIV among MSM

3.2. Limitations

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI