Multi-Label Classification for Predicting Antimicrobial Resistance on E. coli

Gidiglo, Prince Delator; Ngnamsie Njimbouom, Soualihou; Aly Abdelkader, Gelany; Mosalla, Soophia; Kim, Jeong-Dong

doi:10.3390/app14188225

Open AccessArticle

Multi-Label Classification for Predicting Antimicrobial Resistance on E. coli

by

Prince Delator Gidiglo

¹

,

Soualihou Ngnamsie Njimbouom

¹

,

Gelany Aly Abdelkader

¹,

Soophia Mosalla

² and

Jeong-Dong Kim

^1,2,3,*

¹

Department of Computer Science and Electronics Engineering, Sun Moon University, Asan 31460, Republic of Korea

²

Division of Computer Science and Engineering, Sun Moon University, Asan 31460, Republic of Korea

³

Genome-based BioIT Convergence Institute, Sun Moon University, Asan 31460, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8225; https://doi.org/10.3390/app14188225

Submission received: 28 July 2024 / Revised: 26 August 2024 / Accepted: 10 September 2024 / Published: 12 September 2024

(This article belongs to the Special Issue Syntheses and Applications in Medicinal Chemistry)

Download

Browse Figure

Versions Notes

Abstract

:

Antimicrobial resistance (AMR) represents a pressing global health challenge with implications for developmental progress, as it increasingly manifests within pathogenic bacterial populations. This phenomenon leads to a substantial public health hazard, given its capacity to undermine the efficacy of medical interventions, thereby jeopardizing patient welfare. In recent years, an increasing number of machine learning methods have been employed to predict antimicrobial resistance. However, these methods still pose challenges in single-drug resistance prediction. This study proposed an effective model for predicting antimicrobial resistance to E. Coli by utilizing the eXtreme Gradient Boosting model (XGBoost), among ten other machine learning methods. The experimental results demonstrate that XGBoost outperforms other machine learning classification methods, particularly in terms of precision and hamming loss, with scores of 0.891 and 0.110, respectively. Our study explores the existing machine learning models for predicting antimicrobial resistance (AMR), thereby improving the diagnosis as well as treatment of infections in clinical settings.

Keywords:

antibiotics; antimicrobial resistance; Escherichia coli; drug development; medicinal chemistry; multi-label classification; machine learning

1. Introduction

According to the World Health Organization (WHO) and Centers for Disease Control and Prevention (CDC), at least 2 million cases of antibiotic-resistant infections are recorded annually worldwide [1,2]. By 2050, the annual death toll caused by bacteria resistant to antibiotics is estimated to reach 10 million, leading to a global economic burden in tackling the issue [3]. Antimicrobial resistance is defined as the capability of microorganisms, such as bacteria, fungi, viruses, and parasites, to resist the effects of antimicrobial drugs. The emergence of antimicrobial resistance is attributed mainly to the overuse and abuse of antimicrobial agents [4]. Consequently, microorganisms resistant to common and widely used antibiotics are becoming increasingly common [5]. Antimicrobial resistance in Escherichia coli (E. coli) is a significant concern due to its potential impact on both the veterinary and human health sectors. Understanding and predicting antimicrobial resistance to develop effective prevention and control strategies is of the utmost importance.

Escherichia coli is a type of bacteria that is often present in the intestines of both humans and animals. This bacterium is commonly used as a marker for monitoring antimicrobial resistance, making it essential for research on the development of resistance in the fields of veterinary and human health on a global scale [6]. There is a need for alternative strategies to fight antimicrobial resistance and prevent the spreading of drug-resistant E. coli in humans and animals.

Advances in the field of medicinal chemistry have facilitated the syntheses of antimicrobial compounds that aim to combat resistant E. coli strains. Evidently, utilizing synthesized compounds in high-throughput screening assays against various E. coli strains enables the identification of compounds that are effective against resistant strains. In addition, through the integration of the synthesized compounds, predictive models using machine learning (ML) algorithms are employed to analyze patterns and identify the factors contributing to AMR in E. coli, allowing for early identification and intervention [7]. Multi-label classification, a field in ML, aids in achieving this goal.

Multilabel classification is a supervised classification method where each instance in a dataset can be assigned multiple labels simultaneously from a predefined set of possible labels [8]. Unlike the traditional single-label classification, multiple overlapping labels can be assigned to instances based on features in multi-label classification [9]. The current methods used to predict AMR often rely on the manual interpretation of laboratory tests, which can be time consuming and subject to human error. Moreover, with the increasing variety of antimicrobial agents available, it becomes challenging to predict the resistance patterns for individual microorganisms accurately [7].

ML algorithms have shown promise in predicting antimicrobial resistance in E. coli. Studies that have utilized ML algorithms and mass spectrometry have been successful in predicting antimicrobial susceptibility, differentiating closely related species, serotyping, and clonal lineages of significant clinical pathogens such as methicillin-resistant Staphylococcus aureus [10]. Moreover, ML models have been utilized to predict peptides’ antimicrobial activity against gram-negative and gram-positive bacteria, highlighting their potential in combating AMR [11]. The researchers in Refs. [12,13] analyzed AMR by utilizing various ML algorithms such as support vector machine (SVM), logistic regression (LR), and random forest (RF) on data from whole-genome sequencing. The researchers then attained a high accuracy in predicting AMR. ML algorithms have also been employed to predict linear antimicrobial peptides active against E. coli, allowing for the identification of potential treatments for drug-resistant strains [14]. Furthermore, the researchers in Ref. [15] used deep transfer learning (Deep TL) to predict the antimicrobial resistance to novel antibiotics, obtaining a high accuracy of 91.5%. Also, an Ensemble Classifier Chain (ECC) was used for the multi-label classification of E. coli with the base classifiers RF, SVM, and LR. The authors achieved an accuracy of 0.915 in the ECC model using RF as the base classifier [16]. For the instance of imbalanced labels, transfer learning is frequently utilized in the medical field [17,18]. For example, deep transfer learning was employed by Gao et al. [19] to lessen the healthcare disparities brought on by imbalanced biomedical data. To enhance the model’s performance, the researchers trained the model using the majority group and then the knowledge learned was transferred to each minority group.

The goal of this study was to propose a computational approach using ML models to predict antimicrobial resistance in E. coli. By conducting extensive experiments with various ML methods, we strive to achieve state-of-the-art performance, thereby enhancing our ability to predict antimicrobial resistance effectively. Our approach seeks to prioritize drug candidates that are less likely to be resisted by E. coli bacteria, thus streamlining the development of more effective antimicrobial agents.

The organization of this paper is as follows: Section 2 provides an overview of the prediction of antimicrobial resistance, which includes the collection of the data, preprocessing of the data, and prediction models. Section 3 outlines the experimental procedures applied to the datasets, the resulting findings, and a discussion of the outcomes. Finally, Section 4 offers the conclusion of the paper.

2. Predicting Antimicrobial Resistance

The research was carried out in three main steps; these include the collection of the data, the preprocessing of the data, and the prediction model used to predict the output. The conceptual view of the experimental approach is shown in Figure 1.

2.1. Data Collection

The data were collected from research carried out by researchers from the University of Giessen, and a collection of data from environmental sources [20,21]. Four antibiotics were used in this research. Ciprofloxacin (CIP) is a fluoroquinolone antibiotic used for the treatment of several bacterial infections including skin infections, urinary tract infections, and respiratory tract infections. CIP is effective against gram-negative bacteria, including E coli. Both cefotaxime (CTX) and ceftazidime (CTZ) are cephalosporin antibiotics used for the treatment of urinary, abdominal, and respiratory infections, as well as the majority of gram-positive and gram-negative bacteria such as meningitis. Gentamicin (GEN) is an aminoglycoside antibiotic used to treat several infections caused by gram-negative bacteria and severe bacterial infections.

The data include human and animal samples of E coli. The data were collected from a total of 987 whole-genome sequences of E. coli containing the resistance information of CIP, CTX, CTZ and GEN. The additional data from environmental sources include samples curated from the environment and ecological sources, notably the diagnostic laboratory at Bhuddhasothom University Hospital and canals surrounding the hospital, the Cambridge University Hospitals NHS Foundation Trust, and the Bacteraemia Resistance Surveillance Programme (BSAC collection) sewage plant. The samples collected from human isolates included urine and blood samples from infected patients.

2.2. Data Preprocessing

Data preprocessing is an essential stage in ML. The different operations performed during this stage included data cleaning and data transformation. Data cleaning involves identifying and checking categorical and missing values in the data. Data transformation involves one-hot and label encoding. The VITEK 2 system was used for antimicrobial susceptibility testing, and interpretation was performed based on the EUCAST guidelines. The isolates were then filtered to obtain the missing antibiotic resistance information. The variants of the samples were called using bcftools software [22] version 1.12, Samtools [22] version 1.12 was used for sorting the aligned reads and vcftools [23] version 0.1.13 was used for filtering the raw variants.

To prepare the SNP data for subsequent machine learning, label encoding and one-hot encoding were utilized. Nucleotide bases A, G, C, T, and N were encoded into numerical values 1, 2, 3, 4, and 0, respectively, using label-encoding techniques. The final output was an SNP matrix that is structured such that each row corresponds to a sample and each column represents a variant allele. Subsequently, all isolates were merged according to the reference allele positions, facilitating feature integration for machine learning models. One-hot encoding was applied to convert the DNA sequences (A, G, C, T) into a binary matrix format, composing 0 and 1, for the vectorization of the sequence data, and the matrix was then vectorized.

2.3. Prediction Models

A total of eleven ML models were used in this research. These included Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), Adaptive Boosting classifier (AdaBoost), Extra Trees Classifier (ETC), Gradient Boosted Classifier (GBC), k-Nearest Neighbors (kNN), Support Vector Machine (SVM), Logistic Regression (LR), and Multinomial Naïve Bayes (MNB).

2.3.1. Decision Tree

DT is structured as a tree-like model in which the internal nodes denote features, branches depict decision paths, and leaves correspond to outcomes. The measures for selecting DT in multilabel classification include entropy, information gain for multiple labels, and selecting the best split, whereas the Gini Index accounts for the impurity in the dataset. Entropy is calculated as shown in Equation (1). The information gain and Gini index are computed as shown in Equations (2) and (3), respectively.

E n t r o p y = \sum_{i = 1}^{n} - p_{i} {l o g}_{2} (p_{i})

(1)

G i n i = 1 - \sum_{i = 1}^{n} {(p_{i})}^{2}

(2)

I n f o r m a t i o n G a i n = E n t r o p y (p a r e n t) - \sum_{j = 1}^{m} \frac{N_{j}}{N} * E n t r o p y ({c h i l d}_{j})

(3)

where

p_{i}

is the proportion of the number of samples that have the

i

-th label,

m

is the number of child nodes after the split,

N

is the total number of samples at the parent node and

N_{j}

is the number of samples in the

j

th child node.

2.3.2. Random Forest

RF is a classification algorithm that uses decision trees as the base learning model. Developed by Breiman [24], RF is a meta-estimator that enhances the prediction accuracy and mitigates overfitting by averaging the predictions from multiple categorizations of DTs trained on distinct subsamples of the dataset. The Gini index is used with RF on classification data to determine which branches are more likely to occur and is calculated in Equation (2).

Entropy uses the probability of outcomes to determine how nodes branch in a decision tree, as computed in Equation (1).

2.3.3. eXtreme Gradient Boosting

Introduced in 2011 by Tianqi Chen and Carlos Guestrin, XGBoost is an optimized learning system built upon Boosting Tree models, which typically only consider the rate of change (first derivative) of the loss function [25]. XGBoost employs a second-order Taylor expansion, incorporating both the rate of change and the rate of change in the rate of change (second derivative). This allows XGBoost to capture more complex patterns in the data. XGBoost uses the Taylor expansion to calculate these residuals, enabling efficient parallel processing on multiple CPU cores. The architecture of XGBoost, which integrates multiple weak learners, inherently provides resistance to overfitting.

2.3.4. Categorical Boosting

CatBoost is an ML algorithm that utilizes gradient boosting on DTs. CatBoost creates multiple DTs, known as an ensemble of trees, to predict a classification label and works effectively with the categorical features in datasets. Moreover, CatBoost reduces overfitting, and its faster convergence means that it trains quickly, thereby making it suitable for multi-label classification.

2.3.5. Adaptive Gradient Boosting

AdaBoost is an ML algorithm that belongs to a class of algorithms called boosting models [26]. The process involves boosting performance through the creation of a strong classifier, which is achieved by a linear aggregation of weak classifiers, each assigned specific weights. AdaBoost exemplifies a broader category of ML strategies known as ensemble methods, where these methods improve prediction by combining several classifiers into a unified predictive model.

2.3.6. Gradient Boosted Classifier

GBC is a type of supervised ML algorithm that constructs an additive model iteratively, proceeding in a forward stage-wise manner to create a robust predictive model that is essential for multilabel classification. Various constraints and regularization techniques are employed to enhance the algorithm’s performance and prevent overfitting. These notably include penalized learning, which imposes penalties on complex models, and tree constraints to limit tree complexity.

2.3.7. Extra Trees Classifier

The ETC is similar to RF but builds trees using random splits. ETC also combines randomness with an ensemble approach, thereby reducing overfitting. ETC uses the original sample of the dataset without subsampling the input with replacement, reducing bias. Also, it reduces variance when the highly randomized tree algorithm chooses each node’s split point.

2.3.8. k_Nearest Neighbors

k-NN is an ML algorithm used for classification and regression, where each instance may be assigned multiple labels simultaneously in multi-label classification. This is achieved by choosing the number of neighbors, k, to be used for the prediction. Then, the closest neighbor is identified using the distance measured from the instance (isolate) to all other instances (isolates) in the training dataset. The labels are aggregated, and the labels are assigned to the isolates based on the aggregation.

2.3.9. Support Vector Machine

SVM is a supervised learning algorithm that operates by crafting a hyperplane or a series of hyperplanes within a high-dimensional feature space, facilitating the effective classification of data points. SVM was introduced by [27] and was initially conceived as a binary classification tool but is also applicable in multilabel classification. The Lagrangian Multiplier Method is used in the dual formulation of the SVM optimization problem by combining the objective of minimizing the norm of the weight vector

w

with constraints that ensure the data points are correctly classified with a margin [28,29,30]. This is computed as shown in Equation (4):

L (w, b, α) = \frac{1}{2} {||w||}^{2} - \sum_{i = 1}^{N} α_{i} (y_{i} (w . x_{i} + b) - 1)

(4)

where

w

is the weight vector,

b

is the bias term, α is the vector of Lagrange multipliers (for each data point) and

N

is the number of data points.

2.3.10. Logistic Regression

LR is efficient and relatively simple to implement, though it might not capture complex relationships like tree-based models [31]. The standard logistic function, which is an S-shaped curve, is given by Equation (5):

f (x) = \frac{1}{1 + e^{- 2}}

(5)

LR models are widely recognized in the medical and health domain, notably for their application in genetics [32], death risk prediction [33], and the assessment of risk events such as prostate cancer [34]. The versatility and effectiveness of LR make them valuable for understanding and predicting various health-related diseases.

2.3.11. Multinomial Naive Bayes

MNB is a probabilistic classifier founded on Bayes’ theorem, leveraging robust independence assumptions among features [35]. MNB is computed as shown in Equation (6):

P (y| X) = \frac{P (X| y) P (y)}{P (X)}

(6)

The variable

y

is the class variable, and the variable

X

represents the parameters/features. Where

X

= (

P (X| y)

). The Naïve Bayes classifier assumes that attributes are independent of each other, as given in Equation (7). So,

P (y| X) = \frac{P (x_{1} | y) P (x_{2} | y) \dots P (x_{n} | y) P (y)}{P (x_{1}) P (x_{2}) \dots P (x_{n})}

(7)

The MNB algorithm calculates the probability and gives the highest probability as the output in a relatively short time due to its simple probabilistic model; this is helpful in multi-label classification.

3. Experimentation

The experimentation section describes the datasets used to train and evaluate the models. It also depicts the details of the experimental setup and the hyperparameter settings of the ML eleven models. Finally, the findings from the experimental iterations are reported.

3.1. Dataset Description

The datasets used comprise strains of E. coli with resistance information and whole-genome sequencing for four antibiotics (CIP, CTX, CTZ and GEN). After filtering the isolates for missing antibiotic resistance information, the dataset contained 809 strains of E. coli that were characterized as either susceptible or resistant. Susceptibility is characterized by the inability of the microorganism to grow in the presence of the drug, whereas resistance means that the bacteria can grow even in the presence of the drug.

The CIP, CTX, CTZ, and GEN resistance percentages for the isolates were 45%, 44%, 34%, and 23%, respectively. The data were split into training and testing datasets at a ratio of 80:20, respectively. The data in the first column of the dataset containing the names of the strains were removed, leaving values related to the susceptibility or resistance of the E. coli strains to be used as input features for training and further experimentation.

3.2. Experimental Setup

The experiments were performed within an experimental framework using a Windows operating system. The system ran on hardware consisting of Intel(R) Core (TM) i7-8700 CPU (Intel, Santa Clara, CA, USA) running at 3.20 GHz, a Jupyter Notebook, and a computing power enhanced by 24 Gb RAM. Python version 3.10.12 with pandas, numpy, and scikit-learn libraries was used for model development and analysis.

3.3. Hyperparameter Setting

The experimentation involved a hyperparameter optimization process. Table 1 shows the hyperparameter settings used in the experiment for the eleven machine learning models used in this study.

3.4. Evaluation Metrics

The performance of the metrics was evaluated by applying metrics for classification such as accuracy (ACC), precision (PRE), recall (REC), F1-score, and hamming loss (HAM.).

Accuracy measures how accurately the model distinguishes between resistant and susceptible strains, determining the correct predictions among all the samples. It is calculated as shown in Equation (8) [36].

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(8)

T P

means that the resistant strains detected are correctly detected, and

T N

means that the strains detected as susceptible are susceptible.

F P

means that non-resistant strains were wrongly detected as resistant, and

F N

means that resistant strains were wrongly detected as susceptible.

Precision compares the correct, true positives to the predicted ones, determining how accurately the models perform. This indicates how many of the strains predicted by the model to be resistant are resistant and is computed using Equation (9) [37].

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

Recall shows the proportion of resistant strains that the models properly detect. Thus, the number of resistant strains the model can successfully predict from actual resistant strains is calculated using Equation (10) [37].

R e c a l l = \frac{T P}{T P + F N}

(10)

F1-score is the weighted average of precision and recall. This shows the balance between precision and recall with a single value. F1-score is calculated using Equation (11) [37].

F 1 - s c o r e = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

Hamming loss measures the fraction of the strains that are incorrectly predicted across all classes for each sample. The result is a value between 0 and 1, where 0 denotes correct labeling with no errors, and 1 denotes (labels) that the strains were incorrectly predicted. It is computed as shown in Equation (12) [38].

H A M = \frac{1}{m} \sum_{i = 1}^{m} \frac{| Y_{i} Δ Z_{i} |}{| L |}

(12)

where

m

is the total number of samples,

Y_{i}

is the true label of the

i

-th sample,

Z_{i}

is the predicted label of the

i

-th sample,

Δ

is the indicator function, and

| L |

corresponds to the total number of labels.

3.5. Results and Discussion

The performance of each of the eleven ML models was evaluated and is shown in Table 2 below. The XGBoost model leads in accuracy, precision, recall, F1-score, and hamming loss, demonstrating a high performance across all primary metrics with a high predictive accuracy of 0.890, and a corresponding F1-score of 0.890. XGBoost also features the lowest hamming loss at 0.110, suggesting fewer misclassified instances relative to the total number of instances, underscoring its robustness in handling the diverse dataset.

ETC and RF also showed a strong performance, particularly in terms of accuracy and consistency for the values of the evaluation metrics; however, their performance was slightly lower than the performance of XGBoost. The RF model achieves an accuracy of 0.885, with a minimal hamming loss of 0.115, indicating that it is a viable alternative with less complexity than boosting methods. DTs, despite being a simpler model, show commendable recall and F1-scores, both at 0.849, highlighting their effectiveness in capturing a high number of positive instances, though with a relatively higher hamming loss of 0.151. MNB significantly underperforms across all metrics, having the highest hamming loss of 0.413.

These findings illustrate the diverse capabilities and trade-offs of the different ML models used in classification. The choice of the model thus depends on the specific requirements of precision, recall, computational efficiency, and the ability to minimize incorrect classifications, as indicated by the hamming loss.

The experimental results demonstrate the effectiveness of ML models in predicting antimicrobial resistance in E coli. After 10-fold cross-validation on the test set, the XGBoost model exhibited the highest predictive accuracy, precision, recall, F1-score, and hamming loss among the models, with average values of 0.890, 0.891, 0.890, 0.890, and 0.110, respectively. Table 3 compares the performance of the best model using the proposed approach with previous works, further emphasizing the proposed approach’s superior performance. From our proposed approach, the XGBoost model exhibits the least hamming loss (0.110), indicating its efficacy in accurately predicting labels. It also achieves the highest precision, highlighting its ability to minimize false positives. The comprehensive evaluation metrics, including accuracy, precision, recall, F1-score, and hamming loss, effectively assess the model’s performance. Moreover, the findings from the research support the significance of utilizing ML algorithms to predict antimicrobial resistance, offering valuable insights into addressing the global challenge of increasing antibiotic resistance.

In this research, we experimented with eleven ML models to predict antimicrobial resistance in E. coli. These models included tree-based models (DT, RF, XGBoost, CatBoost, AdaBoost, GBM, and ETC), a linear-based model (LR), a non-parametric model (k-NN), a kernel-based model (SVM), and a probabilistic model (MNB). The performance evaluation revealed that XGBoost consistently outperformed the other models across all five metrics, effectively predicting drug resistance.

XGBoost, unlike other ML models, effectively manages model complexity through sparsity-aware split algorithm and regularization techniques (L1 and L2). Its gradient-boosting framework and use of multiple trees allow XGBoost to capture intricate data patterns while mitigating potential overfitting, resulting in the better generalization and accuracy of unseen data. Moreover, XGBoost’s ability to build trees sequentially and allow for the parallelization of branch creation within each tree, exploiting modern CPUs’ multi-threading capabilities, makes it a practical choice for real-world applications.

Compared to models like DT and k-NN, XGBoost is less susceptible to overfitting. While models like CatBoost and GBC are similar to XGBoost, XGBoost performs better due to its optimization and regularization techniques. However, CatBoost may handle categorical features more effectively. Though k-NN and SVM are faster and simpler to train, their performance was lower than that of XGBoost because they do not capture non-linear relationships as effectively.

Despite its advantages, XGBoost requires more computational power and memory than models like MNB or k-NN. Additionally, XGBoost can be sensitive to noisy data without proper regularization or early stopping, potentially leading to overfitting.

XGBoost is widely used for predicting the likelihood of diseases, including chronic conditions and infectious diseases, and analyzing patient data and biomarkers. Furthermore, XGBoost can be used to analyze the genetic variations that contribute to diseases. For example, [39] evaluated the performance of ML models in predicting AMR in E. coli using datasets from England and several African countries. XGBoost’s versatility makes it suitable for various machine learning tasks, including classification, regression, and ranking, across multiple domains [40,41,42].

4. Conclusions

This research utilizes ML models to predict the antimicrobial resistance of E. coli against four antibiotics. Extensive experimentation with various ML methods demonstrated commendable performance, effectively predicting antimicrobial resistance and outperforming related state-of-the-art models. The approach proposed in this study could be instrumental in developing a tool for the virtual screening of compound molecules, prioritizing drug candidates that are less likely to be resisted by E. coli, thereby streamlining the development of more effective antimicrobial agents.

In future research, additional antibiotics that depict AMR could be explored to enhance the use of ML algorithms for predicting AMR. These include incorporating more antimicrobial resistance knowledge into the selection of ML classifiers, and exploring novel approaches for dealing with missing data, further improving the interpretability of the ML model. Future work can also focus on employing advances in medicinal chemistry to synthesize new compounds, thereby generating richer datasets to enable more robust and predictive models. Also, integrating additional data sources, such as genomic data or patient metadata, could further enhance the predictive power of ML prediction models. Additionally, future work could explore the implementation of advanced optimization techniques for XGBoost, as well as its integration into deep learning-based models, specifically for single-drug resistance prediction. Such advancements could enhance its predictive performance and achieve state-of-the-art benchmarks.

Author Contributions

Conceptualization, P.D.G. and S.N.N.; methodology, P.D.G., S.N.N. and G.A.A.; software, P.D.G. and G.A.A.; validation, J.-D.K.; formal analysis, P.D.G. and G.A.A.; investigation, P.D.G. and S.M.; resources, P.D.G. and S.M.; data curation, P.D.G.; writing—original draft preparation, P.D.G.; writing—review and editing, P.D.G., S.N.N., S.M. and J.-D.K.; visualization, P.D.G.; supervision, J.-D.K.; project administration, J.-D.K.; funding acquisition, J.-D.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science, ICT), Korea, under the National Program for Excellence in SW), supervised by the IITP (Institute of Information & communications Technology Planning & Evaluation) in 2024 (2024-0-00023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used are publicly available at https://doi.org/10.1371/journal.pcbi.1006258.s010 and https://github.com/YunxiaoRen/deep_transfer_learning_AMR/ (accessed on 15 November 2023).

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

World Health Organization. WHO Outlines 40 Research Priorities on Antimicrobial Resistance. Available online: https://www.who.int/news/item/22-06-2023-who-outlines-40-research-priorities-on-antimicrobial-resistance (accessed on 25 March 2024).
Centers for Disease Control and Prevention. Antimicrobial Resistance. Available online: https://www.cdc.gov/antimicrobial-resistance/ (accessed on 25 March 2024).
Murray, C.J.; Ikuta, K.S.; Sharara, F.; Swetschinski, L.; Aguilar, G.R.; Gray, A.; Han, C.; Bisignano, C.; Rao, P.; Wool, E.; et al. Global burden of bacterial antimicrobial resistance in 2019: A systematic analysis. Lancet 2022, 399, 629–655. [Google Scholar] [CrossRef] [PubMed]
Aalinezhad, M.; Rezaei, M.H.; Alikhani, F.; Akbari, P.; Soleimani, S.; Hakamifard, A. Relationship between CT Severity Score and Capillary Blood Oxygen Saturation in Patients with COVID-19 Infection. Indian J. Crit. Care Med. 2021, 25, 279–283. [Google Scholar] [CrossRef] [PubMed]
Nwobodo, D.C.; Ugwu, M.C.; Anie, C.O.; Al-Ouqaili, M.T.S.; Ikem, J.C.; Chigozie, U.V.; Saki, M. Antibiotic resistance: The challenges and some emerging strategies for tackling a global menace. J. Clin. Lab. Anal 2022, 36, e24655. [Google Scholar] [CrossRef]
Azabo, R.; Dulle, F.; Mshana, S.E.; Matee, M.; Kimera, S. Antimicrobial use in cattle and poultry production on occurrence of multidrug resistant Escherichia coli. A systematic review with focus on sub-Saharan Africa. Front. Vet. Sci. 2022, 9, 1000457. [Google Scholar]
Ali, T.; Ahmed, S.; Aslam, M. Artificial Intelligence for Antimicrobial Resistance Prediction: Challenges and Opportunities towards Practical Implementation. Antibiotics 2023, 12, 523. [Google Scholar] [CrossRef]
Tsoumakas, G.; Katakis, I. Multi-label classification: An overview. Int. J. Data Warehous. Min. 2007, 3, 1–13. [Google Scholar] [CrossRef]
Tarekegn, A.N.; Giacobini, M.; Michalak, K. A review of methods for imbalanced multi-label classification. Pattern Recognit. 2021, 118, 107965. [Google Scholar] [CrossRef]
Feucherolles, M.; Nennig, M.; Becker, S.L.; Martiny, D.; Losch, S.; Penny, C.; Cauchie, H.M.; Ragimbeau, C. Investigation of MALDI-TOF Mass Spectrometry for Assessing the Molecular Diversity of Campylobacter jejuni and Comparison with MLST and cgMLST: A Luxembourg One-Health Study. Diagnostics 2021, 11, 1949. [Google Scholar] [CrossRef]
Wang, G.; Vaisman, I.I.; van Hoek, M.L. Machine Learning Prediction of Antimicrobial Peptides. In Computational Peptide Science: Methods and Protocols; Springer: New York, NY, USA, 2022; Volume 2405, pp. 1–37. [Google Scholar]
Yang, Y.; Niehaus, K.E.; Walker, T.M.; Iqbal, Z.; Walker, A.S.; Wilson, D.J.; Peto, T.E.; Crook, D.W.; Smith, E.G.; Zhu, T.; et al. Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data. Bioinformatics 2018, 34, 1666–1671. [Google Scholar] [CrossRef]
Kouchaki, S.; Yang, Y.; Walker, T.M.; Walker, A.S.; Wilson, D.J.; Peto, T.E.A.; Crook, D.W.; CRyPTIC Consortium; Clifton, D.A. Application of machine learning techniques to tuberculosis drug resistance analysis. Bioinformatics 2019, 35, 2276–2282. [Google Scholar] [CrossRef]
Söylemez, Ü.G.; Yousef, M.; Kesmen, Z.; Büyükkiraz, M.E.; Bakir-Gungor, B. Prediction of Linear Cationic Antimicrobial Peptides Active against Gram-Negative and Gram-Positive Bacteria Based on Machine Learning Models. Appl. Sci. 2022, 12, 3631. [Google Scholar] [CrossRef]
Ren, Y.; Chakraborty, T.; Doijad, S.; Falgenhauer, L.; Falgenhauer, J.; Goesmann, A.; Schwengers, O.; Heider, D. Deep Transfer Learning Enables Robust Prediction of Antimicrobial Resistance for Novel Antibiotics. Antibiotics 2022, 11, 1611. [Google Scholar] [CrossRef] [PubMed]
Ren, Y.; Chakraborty, T.; Doijad, S.; Falgenhauer, L.; Falgenhauer, J.; Goesmann, A.; Schwengers, O.; Heider, D. Multi-label classification for multi-drug resistance prediction of Escherichia coli. Comput. Struct. Biotechnol. J. 2022, 20, 1264–1270. [Google Scholar] [CrossRef] [PubMed]
Okerinde, A.; Shamir, L.; Hsu, W.; Theis, T.; Nafi, N. eGAN: Unsupervised approach to class imbalance using transfer learning. In Computer Analysis of Images and Patterns, Proceedings of the 19th International Conference, CAIP 2021, Virtual Event, 28–30 September 2021; Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M., Eds.; Springer International Publishing: Cham, Switzerland, 2021. [Google Scholar]
Minvielle, L.; Atiq, M.; Peignier, S.; Mougeot, M. Transfer Learning on Decision Tree with Class Imbalance. In Proceedings of the IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; pp. 1003–1010. [Google Scholar]
Gao, Y.; Cui, Y. Author Correction: Deep transfer learning for reducing health care disparities arising from biomedical data inequality. Nat. Commun. 2020, 11, 6444. [Google Scholar] [CrossRef]
Moradigaravand, D.; Palm, M.; Farewell, A.; Mustonen, V.; Warringer, J.; Parts, L. Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data. PLOS Comput. Biol. 2018, 14, e1006258. [Google Scholar] [CrossRef]
Ren, Y.; Chakraborty, T.; Doijad, S.; Falgenhauer, L.; Falgenhauer, J.; Goesmann, A.; Hauschild, A.-C.; Schwengers, O.; Heider, D. Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. Bioinformatics 2021, 38, 325–334. [Google Scholar] [CrossRef]
Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M.; et al. Twelve years of SAMtools and BCFtools. GigaScience 2021, 10, giab008. [Google Scholar] [CrossRef]
Danecek, P.; Auton, A.; Abecasis, G.; Albers, C.A.; Banks, E.; DePristo, M.A.; Handsaker, R.E.; Lunter, G.; Marth, G.T.; Sherry, S.T.; et al. The variant call format and VCFtools. Bioinformatics 2011, 27, 2156–2158. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Routledge: Abingdon, UK, 2017. [Google Scholar]
Collins, M.; Schapire, R.E.; Singer, Y. Logistic Regression, AdaBoost and Bregman Distances. Mach. Learn. 2002, 48, 253–285. [Google Scholar] [CrossRef]
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar] [CrossRef]
Burges, C.J.C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Mangasarian, O.L.; David, R.M. Lagrangian support vector machines. J. Mach. Learn. Res. 2001, 1, 161–177. [Google Scholar]
Arana-Daniel, N.; Gallegos, A.A.; López-Franco, C.; Alanís, A.Y.; Morales, J.; López-Franco, A. Support Vector Machines Trained with Evolutionary Algorithms Employing Kernel Adatron for Large Scale Classification of Protein Structures. Evol. Bioinform. 2016, 12, EBO.S40912–302. [Google Scholar] [CrossRef] [PubMed]
Xu, J. Multi-label Lagrangian support vector machine with random block coordinate descent method. Inf. Sci. 2016, 329, 184–205. [Google Scholar] [CrossRef]
Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Lewis, C.M.; Knight, J. Introduction to Genetic Association Studies. Cold Spring Harb. Protoc. 2012, 2012, 297–306. [Google Scholar] [CrossRef] [PubMed]
Lowrie, E.G.; Lew, N.L. Death Risk in Hemodialysis Patients: The Predictive Value of Commonly Measured Variables and an Evaluation of Death Rate Differences Between Facilities. Am. J. Kidney Dis. 1990, 15, 458–482. [Google Scholar] [CrossRef]
Langer, D.L.; Van der Kwast, T.H.; Evans, A.J.; Trachtenberg, J.; Wilson, B.C.; Haider, M.A. Prostate cancer detection with multi-parametric MRI: Logistic regression analysis of quantitative T2, diffusion-weighted imaging, and dynamic contrast-enhanced MRI. J. Magn. Reson. Imaging Off. J. Int. Soc. Magn. Reson. Med. 2009, 30, 327–334. [Google Scholar] [CrossRef]
Rennie, J.D.; Shih, L.; Teevan, J.; Karger, D.R. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Los Angeles, CA, USA, 23–24 June 2003. [Google Scholar]
Maxwell, A.; Li, R.; Yang, B.; Weng, H.; Ou, A.; Hong, H.; Zhou, Z.; Gong, P.; Zhang, C. Deep learning architectures for multi-label classification of intelligent health risk prediction. BMC Bioinform. 2017, 18, 523. [Google Scholar] [CrossRef]
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
Schapire, R.E.; Singer, Y. BoosTexter: A Boosting-based System for Text Categorization. Mach. Learn. 2000, 39, 135–168. [Google Scholar] [CrossRef]
Nsubuga, M.; Galiwango, R.; Jjingo, D.; Mboowa, G. Generalizability of machine learning in predicting antimicrobial resistance in E. coli: A multi-country case study in Africa. BMC Genom. 2024, 25, 287. [Google Scholar] [CrossRef]
Amjad, M.; Ahmad, I.; Ahmad, M.; Wróblewski, P.; Kamiński, P.; Amjad, U. Prediction of Pile Bearing Capacity Using XGBoost Algorithm: Modeling and Performance Evaluation. Appl. Sci 2022, 12, 2126. [Google Scholar] [CrossRef]
Can, R.; Kocaman, S.; Gokceoglu, C.A. Comprehensive Assessment of XGBoost Algorithm for Landslide Susceptibility Mapping in the Upper Basin of Ataturk Dam, Turkey. Appl. Sci 2021, 11, 4993. [Google Scholar] [CrossRef]
Asselman, A.; Khaldi, M.; Aammou, S. Enhancing the prediction of student performance based on the machine learning XGBoost algorithm. Interact. Learn. Environ. 2021, 31, 3360–3379. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the AMR prediction approach. The architecture constitutes three main stages in the prediction of AMR. The collection stage involves gathering the whole-genome sequence and drug phenotypes of E. coli from various environmental and biological sources and databases. The next stage involves data preprocessing, which consists of cleaning the datasets, encoding the sequences, and normalizing the data using bcftools. In the prediction stage, various machine learning algorithms are used to model the data to learn and make predictions from the features. The performance of the models is then evaluated using accuracy, F1-score, recall, precision, and hamming loss. Consequently, the output determines the antimicrobial resistance or susceptibility of the four antibiotics (GEN, CTX, CIP, and CTZ).

Table 1. Hyperparameter settings used in the experiment.

Models	Parameters	Description	Value
DT	Max_depth	Maximum depth of the tree	0.1
	Min_samples_split	Minimum number of samples required to split a node	6
	Min_samples_leaf	Minimum number of samples required to be at a leaf node	100
RF	N_estimators	Number of trees in the forest	100
	Max_depth	Maximum depth of the tree	None
XGBoost	Learning rate	Step size shrinkage used in updating the weights	0.1
	Depth	Maximum depth of a tree	6
	N_estimators	Number of trees in the forest	100
CatBoost	Iterations	Number of boosting rounds	100
	Learning rate	Step size shrinkage used in updating the weight.	0.1
	Depth	Maximum depth of a tree	6
AdaBoost	Learning rate	Step size shrinkage used in updating the weights	0.1
GBC	Learning rate	Step size shrinkage used in updating the weights	0.1
ETC	N_estimators	Number of trees in the forest	100
	Max_depth	Maximum depth of the tree	None
	Min_samples_split	Minimum number of samples required to split a node	2
	Min_samples_leaf	Minimum number of samples required to be at a leaf node	Auto
k-NN	Solver	Optimization algorithm used for training.	Adam
SVM	C	Regularization parameter	1
	kernel	Specifies the kernel type used in the algorithm	rbf
	gamma	Kernel coefficient for rbf to determine influence range of a single data point	scale
LR	Max_iter	Maximum number of iterations for solvers to converge	1000
MNB	Learning rate	Step size shrinkage used in updating the weights	0.1

Table 2. Performance metrics for each ML model based on 10-fold cross-validation on test set.

Models	Accuracy	Precision	Recall	F1-score	HAM
DT	0.849	0.850	0.849	0.849	0.151
RF	0.885	0.887	0.885	0.885	0.115
XGBoost	0.890	0.891	0.890	0.890	0.110
CatBoost	0.878	0.882	0.878	0.878	0.122
AdaBoost	0.871	0.872	0.871	0.871	0.129
GBM	0.883	0.884	0.883	0.883	0.117
ETC	0.886	0.888	0.886	0.886	0.114
k-NN	0.884	0.845	0.844	0.845	0.156
SVM	0.842	0.846	0.842	0.842	0.158
LR	0.875	0.875	0.875	0.874	0.125
MNB	0.587	0.629	0.587	0.565	0.413

Table 3. Performance quantitative comparison with previous related research.

Authors	Models	Accuracy	Precision	Recall	F1-score	HAM
Yunxiao et al. [15]	Deep TL	0.915	0.890	0.930	0.830	-
Yunxiao et al. [16]	ECC (RF)	0.72	0.890	0.94	0.98	0.11
Yunxiao et al. [21]	CNN	0.870	0.710	0.890	-	-
Proposed Approach	XGBoost	0.890	0.891	0.890	0.890	0.110

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gidiglo, P.D.; Ngnamsie Njimbouom, S.; Aly Abdelkader, G.; Mosalla, S.; Kim, J.-D. Multi-Label Classification for Predicting Antimicrobial Resistance on E. coli. Appl. Sci. 2024, 14, 8225. https://doi.org/10.3390/app14188225

AMA Style

Gidiglo PD, Ngnamsie Njimbouom S, Aly Abdelkader G, Mosalla S, Kim J-D. Multi-Label Classification for Predicting Antimicrobial Resistance on E. coli. Applied Sciences. 2024; 14(18):8225. https://doi.org/10.3390/app14188225

Chicago/Turabian Style

Gidiglo, Prince Delator, Soualihou Ngnamsie Njimbouom, Gelany Aly Abdelkader, Soophia Mosalla, and Jeong-Dong Kim. 2024. "Multi-Label Classification for Predicting Antimicrobial Resistance on E. coli" Applied Sciences 14, no. 18: 8225. https://doi.org/10.3390/app14188225

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Multi-Label Classification for Predicting Antimicrobial Resistance on E. coli

Abstract

1. Introduction

2. Predicting Antimicrobial Resistance

2.1. Data Collection

2.2. Data Preprocessing

2.3. Prediction Models

2.3.1. Decision Tree

2.3.2. Random Forest

2.3.3. eXtreme Gradient Boosting

2.3.4. Categorical Boosting

2.3.5. Adaptive Gradient Boosting

2.3.6. Gradient Boosted Classifier

2.3.7. Extra Trees Classifier

2.3.8. k_Nearest Neighbors

2.3.9. Support Vector Machine

2.3.10. Logistic Regression

2.3.11. Multinomial Naive Bayes

3. Experimentation

3.1. Dataset Description

3.2. Experimental Setup

3.3. Hyperparameter Setting

3.4. Evaluation Metrics

3.5. Results and Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI