Improving the Diagnosis of Systemic Lupus Erythematosus with Machine Learning Algorithms Based on Real-World Data

Park, Meeyoung

doi:10.3390/math12182849

Open AccessArticle

Improving the Diagnosis of Systemic Lupus Erythematosus with Machine Learning Algorithms Based on Real-World Data

by

Meeyoung Park

Department of Computer Engineering, Kyungnam University, 7 Gyeongnamdaehak-ro, Masanhappo-gu, Changwon-si 51767, Republic of Korea

Mathematics 2024, 12(18), 2849; https://doi.org/10.3390/math12182849

Submission received: 5 August 2024 / Revised: 9 September 2024 / Accepted: 12 September 2024 / Published: 13 September 2024

(This article belongs to the Special Issue Statistical Methods in Bioinformatics and Health Informatics)

Download

Browse Figures

Versions Notes

Abstract

:

This study addresses the diagnostic challenges of Systemic Lupus Erythematosus (SLE), an autoimmune disease with a complex etiology and varied symptoms. The ANA (antinuclear antibody) test, currently the primary diagnostic tool for SLE, exhibits high sensitivity but low specificity, often leading to inaccurate diagnoses. To enhance diagnostic precision, we propose integrating machine learning algorithms with existing clinical classification guidelines to improve SLE diagnosis accuracy, potentially reducing diagnostic errors and healthcare costs. We analyzed real-world data from a cohort of 24,990 patients over a 10-year period at the hospitals, excluding those previously diagnosed with SLE. Patients were categorized into three groups: negative ANA, positive ANA with non-SLE, and positive ANA with SLE. Feature selection was conducted to identify key factors influencing SLE diagnosis, and machine learning algorithms were employed to develop the CDSS. Performance analysis of three machine learning algorithms—decision tree, random forest, and gradient boosting—based on feature sets of 10, 20, and all available features revealed accuracy rates of 70%, 88%, and 87%, respectively, for the 20-feature set. The proposed system, utilizing real-world medical data, demonstrated modest performance in SLE diagnosis, highlighting the potential of machine learning-based CDSS in real clinical settings.

Keywords:

Systemic Lupus Erythematosus; antinuclear antibody (ANA) test; clinical decision support system; machine learning; common data model

MSC:

68U35

1. Introduction

Systemic Lupus Erythematosus (SLE) represents an enigmatic autoimmune condition with a poorly understood cause [1]. The disease is characterized by a loss of immunological self-tolerance and dysfunctional immune reactions, which are driven by intricate interactions between genetic predispositions and environmental triggers. This culminates in the generation of harmful autoantibodies, the accumulation of immune complexes, and the ensuing inflammatory responses, ultimately causing damage across multiple organ systems. These features constitute the core pathogenesis of SLE. Given the disease’s erratic and frequently irreversible clinical trajectory, as well as its potential for considerable morbidity and mortality, timely and accurate diagnosis is essential.

Nevertheless, diagnosing SLE can be particularly challenging, due to its varied and often elusive clinical symptoms, especially during the initial stages of the condition [2,3]. Furthermore, no single test is sufficient for a definitive diagnosis of SLE. Various classification guidelines have been developed over the years, including the 1997 criteria by the American College of Rheumatology (ACR) [4], the 2012 criteria by the Systemic Lupus International Collaborating Clinics (SLICC) [5], and the 2019 criteria co-authored by the European League Against Rheumatism (EULAR) and ACR [6]. Each of these sets of criteria relies on a combination of clinical observations and laboratory tests. For instance, to meet the specifications of the 2012 SLICC and 2019 EULAR/ACR criteria, specific immunological tests like autoantibodies and complement assays are indispensable. Although these criteria were not initially designed for diagnosing individual SLE cases, they serve the vital purpose of facilitating more uniform patient selection in epidemiological research [7].

The ANA test detects a broad spectrum of autoantibodies directed against various intracellular antigens, most notably within the cell nucleus. This test is extensively used for the diagnosis of multiple autoimmune disorders, such as SLE, Sjögren’s Syndrome, Systemic Sclerosis, and Autoimmune Hepatitis. Remarkably, ANA is positive in over 95% of SLE patients, establishing it as the foremost screening tool for this condition. However, the test suffers from low specificity, often yielding positive results in patients afflicted with conditions unrelated to autoimmune diseases, including malignancies and infectious diseases. Additionally, up to 30% of ostensibly healthy individuals—particularly among the elderly—can exhibit false-positive ANA results [8]. In alignment with this, the 2013 ‘Choosing Wisely’ campaign by the American College of Rheumatology recommended against ANA and its sub-serologies testing in patients presenting with non-specific symptoms, such as fatigue or myalgia, unless there is clinical suspicion of an ANA-positive autoimmune disease [9]. Nevertheless, clinical practice often witnesses the imprudent use of ANA tests, leading to avoidable healthcare expenditure and the associated socioeconomic impacts [10,11].

The SLE diagnosis includes a variety of manifestations of clinical symptoms and laboratory tests. While clinical guidelines provide a structured approach to diagnosing SLE, machine learning (ML) enhances this process by offering improved accuracy, early detection, personalized treatment, and the ability to handle and learn from vast amounts of data. These advantages make ML an invaluable tool in the diagnosis and management of SLE, complementing traditional methods and helping to address some of the challenges associated with this complex autoimmune disease. Recently, Martorell-Marugán et al. conducted a systematic review of 10 studies on machine learning algorithms for diagnosing SLE [12]. Among these, Murray et al. applied logistic regression to large, registered datasets, to classify patients into categories of “definite SLE”, “probable SLE”, “possible SLE”, and “not SLE” [13].

Our study aims to develop the initial phase of a Clinical Decision Support System (CDSS) for SLE diagnosis by leveraging key diagnostic features and applying powerful machine learning algorithms to improve decision-making. Furthermore, we employed the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) as the standardized database for data analysis. Recognized globally, the OMOP-CDM framework aids in the transformation of heterogeneous healthcare data from the real world into a standardized schema [14]. Research utilizing the OMOP-CDM promotes collaboration among researchers across different institutions, while ensuring consistency in clinical terminology and database architecture. We believe that this preliminary study has the potential to lead to the development of a CDSS that can seamlessly integrate data from multiple healthcare institutions, leveraging real-world clinical data to enhance diagnostic accuracy. Additionally, it has the potential to assist clinicians in making more efficient SLE diagnostic decisions, thereby saving time and reducing costs in clinical practice.

2. Materials and Methods

2.1. Study Population

In this multi-center, observational study, we examined a cohort of 25,290 patients who underwent their first ANA test during hospital visits between March 2011 and December 2020 at Pusan National University Hospital (PNUH, 179 Gudeok-ro, Seo-gu, Busan, 49241, South Korea) and Pusan National University Yangsan Hospital (PNUYH, 20 Geumo-ro, Yangsan-si, Gyeongsangnam-do, 50612, South Korea) and had not been previously diagnosed with SLE. The data for this study were directly extracted from the OMOP-CDM, which was derived from Electronic Health Records (EHR) at these tertiary referral centers in Southern Korea.

In a recent development, PNUH transitioned the real-world data (RWD) for patients—including EHR—to the OMOP-CDM. For this study, retrospective patient data were extracted from specific OMOP-CDM tables: “Person” for demographic information, “Measurement” for laboratory results, and “Condition” for diseases diagnosed during hospital visits, all of which were mapped to their respective local EHR codes and terminologies. To maintain data privacy, identifying details like social security numbers and hospital visit IDs were replaced with randomized identifiers during the conversion to the OMOP-CDM standard database, followed by a thorough validation process.

The study was conducted in accordance with the Declaration of Helsinki, reported according to the Strengthening the Reporting of Observational Studies in Epidemiology statement, and approved by the Institutional Review Boards PNUH and PNUYH, which waived the consent for this study (IRB No. H-1909-020-083 for PNUH, 2019/09/23) and 05-2022-092 for PNUYH, 2022/04/14). Patient consent was waived, due to the data for this study being de-identified and based on longitudinal observational health data.

2.2. Study Designs and Data Analysis

For this study, we designated the index date as the day on which each patient received their inaugural ANA test, following the start of their hospital visit. Demographic data such as age and gender were captured at the time of this first ANA test. Additionally, initial laboratory test outcomes—including various ANA sub-serologies and inflammatory markers—were collected within a one-week window around the index date. For the ANA test, serum levels were quantified using indirect immunofluorescence in HEp-2 cells, and we classified ANA titers into predefined categories, considering titers ≥1:80 as positive. Sub-serologies were determined through enzyme-linked immunosorbent assays. A level of IgG anti-dsDNA antibody ≥40 IU/mL was considered positive, based on the lab reference ranges at PNUH and PNUYH.

In addition to its role in diagnosing SLE, ANA testing is also crucial for identifying a range of other autoimmune diseases, such as autoimmune hepatitis, idiopathic inflammatory myopathies, juvenile rheumatoid arthritis, and Sjögren’s syndrome. Therefore, these conditions were considered in our analysis. Relevant data on SLE and other autoimmune diseases were extracted from the “Condition” table in the OMOP-CDM database, which is mapped to ICD-10 and SNOMED-CT codes. Diagnosing SLE and assessing inflammation involves a combination of blood tests such as C-Reactive Protein (CRP) and complete blood count (CBC), including WBC, neutrophils, lymphocytes, and platelets. Especially, as neutrophils and platelets are critical for evaluating the degree of inflammation, Neutrophil-to-Lymphocyte Ratio (NLR), and Platelet-to-Lymphocyte Ratio (PLR) were also evaluated as inflammation markers.

The Observational Medical Outcomes Partnership (OMOP) CDM is a medical data standard adopted by the Observational Health Data Sciences and Informatics consortium to systematically analyze data not only in South Korea but also in North America, Europe, and Asia [14]. The CDM standardizes diverse hospital data into a unified database format, enabling collaborative research across various medical institutions, globally. We gathered structured data from the OMOP CDM in conjunction with unstructured data, such as demographic information and SLE-related laboratory results.

For subsequent data analyses, cohorts from both healthcare centers were combined and divided into three separate groups according to their ANA test results and SLE diagnosis status. As illustrated in Figure 1, the total number of cohort groups is 24,931, and these groups included Group 1 (ANA-negative, n = 17,623), Group 2 (ANA-positive, n = 7391), and Group 3 (ANA-positive and diagnosed with SLE, n = 276).

2.3. Statistical Analysis

The baseline characteristics, including demographic information, ANA titers, ANA serological tests, and inflammatory markers, were characterized in the cross-sectional analysis. Categorical variables were converted into nominal values, while continuous variables were maintained, excluding any erroneous data points. Missing data were handled using the Fully Conditional Specification (FCS) method, facilitated by the MICE package in R, following the guidelines of Van Buuren and Groothuis-Oudshoorn [15,16]. Chi-square tests were used to compare categorical variables. However, given the significant size disparity between Group 3 and the other groups (Group 1: 17,623; Group 2: 7391; Group 3: 276), the Kruskal–Wallis test was chosen as the best method for comparing continuous data across these groups. The Kruskal–Wallis test does not assume a normal distribution of the data, and is less affected by unequal sample sizes. Statistical significance was set at a p-value of less than 0.05. All statistical procedures were conducted using R (version 4.0.5) and RStudio (version 1.4.1).

2.4. Machine Learning Analysis

To develop our diagnostic decision support system, three algorithms—Decision Tree (DT), Random Forest (RF), and Gradient Boosting Machine (GBM)—were chosen for their complementary strengths in building a robust system. The Decision Tree algorithm generates interpretable classification rules by automatically learning from data, identifying specific conditions, and constructing a binary tree for prediction [17]. It was selected for its ability to produce clear, interpretable classification rules, making it accessible to medical professionals and patients who may not be familiar with complex machine learning models. Random Forest was included for its ensemble approach, which enhances predictive accuracy by aggregating multiple decision trees generated through bagging [18]. Finally, GBM was chosen for its effectiveness in handling imbalanced data by gradually improving predictions through a boosting technique, making it well-suited for medical datasets [19,20,21]. To validate the stability of machine learning models, nested cross-validation with stratified N-fold cross-validation techniques were used. Nested cross-validation is an approach to overcome the overfitting problem and to reduce the bias in performance evaluation [22]. Hyperparameters were optimized using the GridSearchCV() function in Python [23]. Finally, to measure the performance of the ML algorithms for each feature set, four evaluation indicators—precision, recall, F1-score, and accuracy—were used.

3. Results

3.1. Analysis Results of Baseline Characteristics

The baseline demographic, clinical, and serological characteristics of the study participants are listed in Table 1. Out of the 24,931 subjects who underwent ANA testing, 7668 (equivalent to 30.8% and encompassing Groups 2 and 3) returned positive results. Only 276 participants, representing 1.1% of those tested and categorized within Group 3, received a diagnosis of SLE. Additionally, 1008 (4%) participants had at least one autoimmune disease, including SLE (n = 276), autoimmune hepatitis (n = 459), idiopathic inflammatory myopathies (n = 65), juvenile rheumatoid arthritis (n = 45), overlap syndrome (n = 41), Raynaud’s disease (n = 244), Sjögren’s syndrome (n = 295), and systemic sclerosis (n = 167). A statistical comparison of the frequency of autoantibodies is depicted in Figure 2.

Significant variations were observed across Groups 1–3 in terms of age distribution and gender ratios. Participants within Group 3, those diagnosed with SLE, were predominantly younger and more often female, compared to the other groups. Furthermore, a statistically significant disparity was noted in the occurrence rates of various autoimmune conditions among the groups, as detailed in Table 1. Autoimmune hepatitis was most prevalent in Group 2, while Group 3 exhibited the highest incidence rates for idiopathic inflammatory myopathies, overlap syndrome, Sjögren’s syndrome, and systemic sclerosis.

3.2. Analysis Results of Lab Tests and SLE

Next, we analyzed the correlation of the ANA test and SLE and other autoimmune diseases. Hence, ANA-positive patients (Group 2 and 3) were chosen for further analysis. As demonstrated in Table 2, the prevalence of autoimmune diseases and SLE among the 7668 ANA-positive patients was classified as Group 2 and 3. In these patient populations, autoimmune diseases were present in 1008 (13.2%) among 7668 individuals, and SLE was diagnosed in 276 (3.6%). A significant increase in the occurrence of both SLE and autoimmune diseases was noted as ANA titers increased. To identify and understand changes in data over ANA titer, trend analysis was performed. A p-value revealed statistically significant variations in ANA titers across the different titer groups.

Inflammation is a significant component in the assessment and management of SLE. Diagnosing SLE and assessing inflammation involves a combination of blood tests such as C-Reactive Protein (CRP) and complete blood count (CBC) including WBC, neutrophils, lymphocytes, and platelets. Especially, as neutrophils and platelets are critical for evaluating the degree of inflammation, Neutrophil-to-Lymphocyte Ratio (NLR), and Platelet-to-Lymphocyte Ratio (PLR) were also evaluated as inflammation markers.

Table 3 and Figure 3 show the comparative analysis of laboratory findings across Groups 1–3. Significant differences were observed in variables such as CRP levels, WBC counts, neutrophil counts, lymphocyte counts, platelet counts, NLR, and PLR, among the different groups. Notably, individuals in Group 3 exhibited elevated levels of CRP, NLR, and PLR, while showing reduced counts of WBC, neutrophils, lymphocytes, and platelets, compared to their counterparts in Groups 1 and 2.

3.3. Feature Selection

Machine learning is important for the diagnosis of SLE, even though clinical guidelines exist, because it may improve diagnostic accuracy with early and timely diagnosis. However, due to the incomplete characteristics of the real-world data (RWD), we preprocessed the dataset without loss of any clinical information. The dataset dimension and the features of the dataset are 24,931 samples(rows) × 27 features(columns). To analyze the data without loss of information, data imputation was applied. If the features had missing values, predictive mean matching techniques in the multivariate imputations by the chained equations (MICE) package provided in the R software were used. Then, we applied the standard scaling method to improve the performance of the ML algorithms. Feature selection was executed based on feature importance metrics from each ML model, to identify key factors for SLE diagnosis. This also accounts for the fact that clinicians generally check fewer than 50 factors when making clinical decisions.

Therefore, we identified the most effective number of features by testing the algorithms using 10 and 20 of all the features. Among them, the common top ten features are shown in Table 4.

Among the selected features, the top feature is the ANA test, which represents the most important clinical diagnosis factor for SLE diagnosis. Moreover, the second and third factors were PLR and Anti-dsDNA IgG, which represent the inflammation indicator and important clinical factor, respectively, in the clinical field. Figure 4 compares the importance of these features across the different algorithms.

3.4. Hyperparameter Tuning

To validate the stability of our machine learning models, we applied nested cross-validation (10 outer iteration and 3 inner iteration) with stratified 10-fold cross-validation techniques. Therefore, a total number of 300 times of the training process for each model were performed. The optimal hyperparameters for the models were determined using the GridSearchCV function from the Scikit-Learn module in Python. As illustrated in Table 5, adjustments were made to the tree depth (max_depth) and the minimum number of samples required to split a node (min_samples_split) for both the Decision Tree and Random Forest algorithms, to reduce the risk of overfitting. For the Random Forest method, the number of trees (n_estimators) was also fine-tuned. In the case of the Gradient Boosting Machine (GBM), we optimized the tree depth (max_depth), the number of estimators (n_estimators), and the learning rate applied during training.

3.5. Machine Learning Performance

Table 6 and Figure 5 summarizes the performance of three machine learning algorithms across three different feature sets: 10, 20, and all 27 features. The evaluation metrics, including precision, recall, F1-score, and accuracy, were calculated on the test set using the optimal hyperparameters shown in Table 5.

Performance scores are reported with 95% confidence intervals (CIs), assuming a Gaussian distribution. In terms of accuracy, the score did not consistently improve with an increased number of features. For the feature sets with 10 or 20 features, the decision tree, random forest, and GBM algorithms achieved accuracy values of 85% (95% CI, 81–89%), 88% (95% CI, 84–92%), and 87% (95% CI, 83–91%), respectively. The evaluation results indicate that feature sets with 10 or 20 features performed better than the set with all 27 features.

4. Discussion

Authors Diagnosing SLE can find it challenging, due to its diverse and often elusive clinical symptoms, and no single test is sufficient for a definitive diagnosis. In this preliminary study, our goal is to develop a machine learning-based Clinical Decision Support System (CDSS) for SLE diagnosis. This system will utilize structured clinical big data and ensemble machine learning algorithms to enhance patient-specific evaluations and diagnoses, with the potential to reduce unnecessary tests.

We collected clinical big data from the standardized OMOP-CDM database across hospitals, which enables multi-center studies, despite variations in data formats. In the future, we plan to include data from additional hospitals.

The ANA test is most useful when there is a strong clinical suspicion of SLE or other autoimmune conditions, due to its low positive predictive value (PPV). PPV measures the probability that individuals who test positive have the condition. However, it is not recommended for general screening of non-specific symptoms because of its high false-positive rate [8,11]. In our study, real-world data revealed that among 24,931 patients tested for ANA, only a small percentage were diagnosed with SLE or other autoimmune diseases, and approximately 30.8% tested positive for ANA.

Additionally, 23% of ANA-negative individuals underwent further sub-serology tests, indicating frequent unnecessary testing in clinical settings. Our study supports previous research showing that ANA tests are often used indiscriminately [10,24,25]. For instance, Kang et al. reported a 0.7% prevalence of ANA-associated rheumatic disease, despite a 14.4% positive ANA rate in a large sample from Korean hospitals [10]. Similarly, a study from a Turkish pediatric clinic found a 27.6% positive ANA rate [24]. These findings, including our study’s low positive predictive value for diagnosing autoimmune diseases, underscore the need for clinicians to use ANA tests more selectively.

According to our RWD, the likelihood of diagnosing SLE or other autoimmune diseases increased with higher ANA titers. SLE patients were more likely to have key antibodies such as anti-dsDNA, anti-SS A, anti-SS B, and anti-Smith, and exhibited elevated levels of systemic inflammation biomarkers like CRP, NLR, and PLR, while having lower WBC, neutrophil, lymphocyte, and platelet counts. Given the low positive predictive value of ANA testing, it should be reserved for patients with a high pre-test probability of autoimmune diseases. Our findings, consistent with previous research, emphasize the need for selective use of the ANA test due to its high false-positive rate. ANA sub-serologies should be used only for patients with a positive ANA result and strong clinical suspicion of autoimmune diseases, to avoid unnecessary testing. Our study also confirmed that elevated NLR and PLR levels may indicate a higher risk of SLE, warranting further investigation.

Using OMOP-CDM for multi-center studies facilitates scalable research on medical big data, though it comes with limitations due to the dependence on disease codes and de-identified data. ICD-10 and SNOMED are widely used general disease codes, while OMOP-CDM utilizes its own set of codes. As a result, users must map each code to integrate clinical information effectively. This mapping process is time-consuming, so an efficient tool is needed.

Our machine learning-based feature selection indicates that while ANA remains a crucial factor, both PLR and anti-dsDNA should also be considered for further testing. With an accuracy exceeding 88%, our CDSS shows potential as a valuable tool for diagnosing SLE. Although incorporating more features can improve accuracy, a moderate number of around 20 features also delivers comparable performance. However, this study is limited by its data scope, which includes only two hospitals, and by the de-identified nature of OMOP-CDM data, which restricts detailed individual patient evaluations. To validate and refine the machine learning algorithm, future research should integrate data from a broader range of healthcare institutions.

Funding

This research was funded by Kyungnam University Foundation Grant 2021.

Data Availability Statement

The datasets generated and/or analyzed during the current study are not publicly available due to the hospital policy but are available from the corresponding author on reasonable request.

Acknowledgments

This research was funded by Kyungnam University Foundation Grant 2021.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shin, J.M.; Kim, D.; Kwon, Y.C.; Ahn, G.Y.; Lee, J.; Park, Y.; Lee, Y.K.; Lee, T.H.; Park, D.J.; Song, Y.J.; et al. Clinical and Genetic Risk Factors Associated with the Presence of Lupus Nephritis. J. Rheum. Dis. 2021, 28, 150–158. [Google Scholar] [CrossRef] [PubMed]
Nashi, R.A.; Shmerling, R.H. Antinuclear Antibody Testing for the Diagnosis of Systemic Lupus Erythematosus. Med. Clin. N. Am. 2021, 105, 387–396. [Google Scholar] [CrossRef] [PubMed]
Olsen, N.J.; Karp, D.R. Finding Lupus in the ANA Haystack. Lupus Sci. Med. 2020, 7, e000384. [Google Scholar] [CrossRef] [PubMed]
Hochberg, M.C. Updating the American College of Rheumatology Revised Criteria for The Classification of Systemic Lupus Erythematosus. Arthritis Rheum. 1997, 40, 1725. [Google Scholar] [CrossRef]
Petri, M.; Orbai, A.M.; Alarcon, G.S.; Gordon, C.; Merrill, J.T.; Fortin, P.R.; Bruce, I.N.; Isenberg, D.; Wallace, D.J.; Nived, O.; et al. Derivation and Validation of The Systemic Lupus International Collaborating Clinics Classification Criteria for Systemic Lupus Erythematosus. Arthritis Rheum. 2012, 64, 2677–2686. [Google Scholar] [CrossRef] [PubMed]
Aringer, M.; Costenbader, K.; Daikh, D.; Brinks, R.; Mosca, M.; Ramsey-Goldman, R.; Smolen, J.S.; Wofsy, D.; Boumpas, D.T.; Kamen, D.L.; et al. 2019 European League Against Rheumatism/American College of Rheumatology Classification Criteria for Systemic Lupus Erythematosus. Arthritis Rheumatol. 2019, 71, 1400–1412. [Google Scholar] [CrossRef]
Andrade, L.E.C.; Damoiseaux, J.; Vergani, D.; Fritzler, M.J. Antinuclear Antibodies (ANA) as a Criterion for Classification and Diagnosis of Systemic Autoimmune Diseases. J. Transl. Autoimmun. 2022, 5, 100145. [Google Scholar] [CrossRef]
Waits, J.B. Rational use of laboratory testing in the initial evaluation of soft tissue and joint complaints. Prim. Care 2010, 37, 673–689. [Google Scholar] [CrossRef]
Yazdany, J.; Schmajuk, G.; Robbins, M.; Daikh, D.; Beall, A.; Yelin, E.; Barton, J.; Carlson, A.; Margaretten, M.; Zell, J.; et al. Choosing Wisely: The American College of Rheumatology’s Top 5 List of Things Physicians and Patients Should Question. Arthritis Care Res. 2013, 65, 329–339. [Google Scholar] [CrossRef]
Kang, S.H.; Seo, Y.I.; Lee, M.H.; Kim, H.A. Diagnostic Value of Anti-Nuclear Antibodies: Results from Korean University-Affiliated Hospitals. J. Korean Med. Sci. 2022, 37, e159. [Google Scholar] [CrossRef]
Qaseem, A.; Alguire, P.; Dallas, P.; Feinberg, L.E.; Fitzgerald, F.T.; Horwitch, C.; Humphrey, L.; LeBlond, R.; Moyer, D.; Wiese, J.G.; et al. Appropriate Use of Screening and Diagnostic Tests to Foster High-Value, Cost-Conscious Care. Ann. Intern. Med. 2012, 156, 147–149. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Wang, M.; Zhao, S.; Yan, Y. Machine Learning for Diagnosis of Systemic Lupus Erythematosus: A Systematic Review and Meta-Analysis. Comput. Intell. Neurosci. 2022, 1, 7167066. [Google Scholar] [CrossRef] [PubMed]
Murray, S.G.; Avati, A.; Schmajuk, G.; Yazdany, J. Automated and flexible identification of complex disease: Building a model for systemic lupus erythematosus using noisy labeling. J. Am. Med. Inform. Assoc. 2019, 26, 61–65. [Google Scholar] [CrossRef]
Observational Health Data Sciences and Informatics. Data Standardization. Available online: https://www.ohdsi.org/data-standardization (accessed on 12 September 2024).
Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
Van Buuren, S. Flexible Imputation of Missing Data, 2nd ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
Song, Y.-Y.; Lu, Y. Decision tree methods: Applications for classification and prediction. Shanghai Arch. Psychiatry 2015, 27, 130–135. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef]
Bakas, S.; Reyes, M.; Jakab, A.; Bauer, S.; Rempfler, M.; Crimi, A.; Shinohara, R.T.; Berger, C.; Ha, S.M.; Rozycki, M.; et al. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. arXiv 2018, arXiv:1811.02629. [Google Scholar]
Song, X.; Waitman, L.R.; Hu, Y.; Yu, A.S.; Robins, D.; Liu, M. Robust clinical marker identification for diabetic kidney disease with ensemble feature selection. J. Am. Med. Inform. Assoc. 2019, 26, 242–253. [Google Scholar] [CrossRef]
Wainer, J.; Cawley, G. Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst. Appl. 2021, 182, 115222. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mac Mach. Learn. Res. 2011, 1, 2825–2830. [Google Scholar]
Aygun, E.; Kelesoglu, F.M.; Dogdu, G.; Ersoy, A.; Basbug, D.; Akca, D.; Cam, O.N.; Akyuz, B.; Gunsay, T.; Kapici, A.H.; et al. Antinuclear Antibody Testing in a Turkish Pediatrics Clinic: Is it Always Necessary? Pan Afr. Med. J. 2019, 32, 181. [Google Scholar] [CrossRef]
Abeles, A.M.; Abeles, M. The Clinical Utility of a Positive Antinuclear Antibody Test Result. Am. J. Med. 2013, 126, 342–348. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Cohort Composition. A total of 24,990 patients who underwent ANA testing were examined. The study cohort was categorized based on the presence or absence of an SLE diagnosis. The final groups for comparison included Group 1 (ANA-negative), Group 2 (ANA-positive), and Group 3 (ANA-positive with SLE diagnosis). Subsequent data analyses were conducted by combining the cohorts for each group. ANA, Antinuclear Antibody; SLE, Systemic Lupus Erythematosus.

Figure 2. Distribution of ANA sub-serologies. The figure displays the proportion of positive to negative cases for each serological test. Chi-square tests were utilized for group comparisons. Sub-serologies include (a) Anti-dsDNA, (b) Anti-Ro, (c) Anti-La, (d) Anti-Smith Ab, (e) Anti-RNP Ab, (f) Anti-SCL-70, (g) Anti-Jo-1 Ab, and (h) Anti-Centromere. Significance levels are denoted as follows: ** is used for p < 0.01 and *** stands for p < 0.001.

Figure 3. Inflammatory Marker Analysis. Laboratory results were evaluated using the Wilcoxon signed-rank test, following normality assessment with the Shapiro–Wilk test. Basic inflammatory markers, including CRP, WBC, neutrophil count, lymphocyte count, and platelet count, were scrutinized. The figure shows (a) CRP levels, (b) WBC counts, (c) Neutrophil counts, (d) Lymphocyte counts, and (e) Platelet counts. Additionally, two derived inflammation metrics were calculated: (f) NLR (neutrophil-to-lymphocyte ratio) and (g) PLR (platelet-to-lymphocyte ratio). Significance levels are denoted as follows: ** is used for p < 0.01 and *** stands for p < 0.001.

Figure 4. Feature Importance Score by Machine Learning algorithms. The selected features are compared in terms of their importance across the different algorithms. Among the 27 features evaluated, ANA is identified as the most important feature for all three machine learning algorithms.

Figure 5. Performance Comparison of Machine Learning Algorithms Using Different Feature Sets: Precision, Recall, F1-Score, and Accuracy. The evaluation results indicate that feature sets with 10 or 20 features performed better than the set with all 27 features.

Table 1. Baseline characteristics of participants (SD: standard deviation).

	Group 1 n = 17,263	Group 2 n = 7392	Group 3 n = 276	p-Value
Age
0–15, n (%)	1871 (10.8)	457 (6.2)	15 (5.4)	<0.001
16–29, n (%)	2174 (12.6)	695 (9.4)	68 (24.6)	<0.001
30–39, n (%)	2096 (12.1)	628 (8.5)	53 (19.2)	<0.001
40–49, n (%)	2708 (15.7)	1039 (14.1)	52 (18.8)	0.001
50–59, n (%)	3719 (21.5)	1610 (21.8)	38 (13.8)	0.006
60–69, n (%)	2690 (15.6)	1577 (21.3)	29 (10.5)	<0.001
≥70, n (%)	2005 (11.6)	1386 (18.8)	21 (7.6)	<0.001
Overall age mean ± SD	45.5 ± 20.6	51.9 ± 19.6	41.5 ± 18	<0.001
Sex
Male, n (%)	7327 (42.4)	2692 (36.4)	45 (16.3)	<0.001
Female, n (%)	9936 (57.6)	4700 (63.6)	231 (83.7)	<0.001
Autoimmune diseases
Autoimmune hepatitis, n (%)	145 (0.8)	310 (4.2)	4 (1.5)	<0.001
Idiopathic inflammatory myopathies, n (%)	29 (0.2)	33 (0.5)	3 (1.1)	<0.001
Juvenile rheumatoid arthritis, n (%)	35 (0.2)	10 (0.1)	0 (0)	0.59
Overlap syndrome, n (%)	5 (0)	31 (0.4)	5 (1.8)	<0.001
Raynaud’s disease, n (%)	134 (0.8)	106 (1.4)	4 (1.5)	<0.001
Sjögren’s syndrome, n (%)	95 (0.6)	187 (2.5)	13 (4.7)	<0.001
Systemic sclerosis, n (%)	29 (0.2)	131 (1.8)	7 (2.5)	<0.001
SLE, n (%)	0 (0)	0 (0)	276 (100)	<0.001
ANA titer
<1:80, n (%)	17,263 (100)	0 (0)	0 (0)	<0.001
1:80, n (%)	0 (0)	5367 (72.6)	67 (24.3)	<0.001
1:160, n (%)	0 (0)	1023 (13.8)	62 (22.5)	<0.001
1:320, n (%)	0 (0)	401 (5.4)	30 (10.9)	<0.001
1:640, n (%)	0 (0)	225 (3)	37 (13.4)	<0.001
≥1:1280, n (%)	0 (0)	376 (5.1)	80 (29)	<0.001
Autoantibodies
Anti-dsDNA IgG, n (%)	464/2789 (16.6)	699/2163 (32.3)	130/266 (48.9)	<0.001
Anti-SS A (Anti-Ro), n (%)	114/1893 (6)	379/1540 (24.6)	107/198 (54)	<0.001
Anti-SS B (Anti-La), n (%)	29/1775 (1.6)	153/1378 (11.1)	39/197 (19.8)	<0.001
Anti-Smith Ab, n (%)	17/1312 (1.3)	24/1387 (1.7)	46/229 (20.1)	<0.001
Anti-RNP Ab, n (%)	29/919 (3.2)	60/968 (6.2)	27/130 (20.8)	<0.001
Anti-SCL-70, n (%)	12/730 (1.6)	48/910 (5.3)	2/69 (2.9)	<0.001
Anti-Jo-1 Ab, n (%)	16/800 (2)	15/551 (2.7)	0/47 (0)	0.076
Anti-Centromere Ab, n (%)	10/615 (1.6)	94/852 (11)	5/64 (7.8)	<0.001

Table 2. Prevalence of autoimmune diseases and SLE in ANA-positive patients (Group 2 and 3).

ANA Titer	No. of Patients with Autoimmune Diseases (%)	p-Value for Trend	No. of Patients with SLE (%)	p-Value for Trend
1:80	297/5434 (5.5)	<0.001	67/5434 (1.2)	<0.001
1:160	195/1085 (18)		62/1085 (5.7)
1:320	124/431 (28.8)		30/431 (7)
1:640	131/262 (50)		37/262 (14.1)
≥1:1280	261/456 (57.2)		80/456 (17.5)
Total	1008/7668 (13.2)		276/7668 (3.6)

Table 3. Comparison of laboratory test results (mean ± SD).

Inflammation Markers	Group 1 n = 17,263	Group 2 n = 7392	Group 3 n = 276	p-Value
CRP	1.54 ± 4.01	1.35 ± 3.52	1.88 ± 4.43	<0.001
WBC	7563 ± 4169	7223 ± 4957	3898 ± 3553	<0.001
Neutrophil count	4680 ± 3379	4438 ± 3336	3880 ± 3205	<0.001
Lymphocyte count	2069 ± 1156	1986 ± 1805	1432 ± 859	<0.001
Platelet count	245,510 ± 97,094	241,005 ± 96,262	199,322 ± 107,689	<0.001
NLR	3.21 ± 5.19	3.02 ± 4.26	3.82 ± 5.19	0.002
PLR	109.57 ± 117.42	80.99 ± 129.53	163.75 ± 169.30	<0.001

Table 4. Top 10 feature importance for each ML model.

Rank	Decision Tree		Random Forest		GBM
Rank	Feature	Importance	Feature	Importance	Feature	Importance
1	Antinuclear Ab (ANA)	0.546	Antinuclear Ab (ANA)	0.456	Antinuclear Ab (ANA)	0.506
2	PLR	0.093	PLR	0.083	Anti-dsDNA IgG	0.072
3	Lymphocyte count	0.084	Anti-dsDNA IgG	0.006	PLR	0.066
4	Anti-dsDNA IgG	0.071	Age	0.055	Lymphocyte count	0.051
5	Anti-SS-A	0.058	Lymphocyte count	0.051	Age	0.050
6	Age	0.033	WBC	0.042	Platelet count	0.036
7	WBC	0.032	Platelet count	0.038	Anti-SS-A	0.036
8	NLR	0.020	Anti-SS-A	0.037	WBC	0.035
9	Platelet count	0.019	CRP	0.034	CRP	0.029
10	CRP	0.010	Neutrophil count	0.033	Neutrophil count	0.025

Table 5. Best-tuned hyperparameters for each machine learning model by different feature sets.

Model	Parameters	Hyperparameters for Each Feature Set
Model	Parameters	10	20	All
Decision Tree	max_depth	5	6	10
Decision Tree	min_samples_split	10	15	20
Random Forest	n_estimators	100	100	300
	max_depth	8	10	10
	min_samples_split	2	3	3
GBM	n_estimators	300	300	300
	max_depth	5	5	6
	learning_rate	10	15	10

Table 6. Comparison of machine learning performance with different feature sets.

Number of Features	Decision Tree				Random Forest				GBM
Number of Features	Precision	Recall	F-1 Score	Accuracy	Precision	Recall	F-1 Score	Accuracy	Precision	Recall	F-1 Score	Accuracy
10	0.87 $\pm$ 0.03	0.85 $\pm$ 0.04	0.85 $\pm$ 0.04	0.85 $\pm$ 0.04	0.85 $\pm$ 0.04	0.84 $\pm$ 0.04	0.84 $\pm$ 0.04	0.85 $\pm$ 0.04	0.86 $\pm$ 0.04	0.86 $\pm$ 0.04	0.86 $\pm$ 0.04	0.87 $\pm$ 0.04
20	0.73 $\pm$ 0.05	0.70 $\pm$ 0.05	0.70 $\pm$ 0.05	0.70 $\pm$ 0.05	0.89 $\pm$ 0.03	0.88 $\pm$ 0.03	0.88 $\pm$ 0.03	0.88 $\pm$ 0.03	0.87 $\pm$ 0.04	0.86 $\pm$ 0.04	0.86 $\pm$ 0.04	0.87 $\pm$ 0.04
All	0.81 $\pm$ 0.04	0.81 $\pm$ 0.04	0.81 $\pm$ 0.04	0.81 $\pm$ 0.04	0.88 $\pm$ 0.03	0.88 $\pm$ 0.03	0.88 $\pm$ 0.03	0.88 $\pm$ 0.03	0.86 $\pm$ 0.04	0.85 $\pm$ 0.04	0.85 $\pm$ 0.04	0.85 $\pm$ 0.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, M. Improving the Diagnosis of Systemic Lupus Erythematosus with Machine Learning Algorithms Based on Real-World Data. Mathematics 2024, 12, 2849. https://doi.org/10.3390/math12182849

AMA Style

Park M. Improving the Diagnosis of Systemic Lupus Erythematosus with Machine Learning Algorithms Based on Real-World Data. Mathematics. 2024; 12(18):2849. https://doi.org/10.3390/math12182849

Chicago/Turabian Style

Park, Meeyoung. 2024. "Improving the Diagnosis of Systemic Lupus Erythematosus with Machine Learning Algorithms Based on Real-World Data" Mathematics 12, no. 18: 2849. https://doi.org/10.3390/math12182849

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving the Diagnosis of Systemic Lupus Erythematosus with Machine Learning Algorithms Based on Real-World Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Population

2.2. Study Designs and Data Analysis

2.3. Statistical Analysis

2.4. Machine Learning Analysis

3. Results

3.1. Analysis Results of Baseline Characteristics

3.2. Analysis Results of Lab Tests and SLE

3.3. Feature Selection

3.4. Hyperparameter Tuning

3.5. Machine Learning Performance

4. Discussion

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI