1. Introduction
Systemic lupus erythematosus (SLE) is a chronic autoimmune inflammatory disease with multi-organ involvement and preferentially affects women of childbearing age [
1]. Pregnancy outcomes of SLE patients have been improving owing to advances in medicine; however, lupus pregnancies are still associated with more maternal and fetal complications compared with healthy women. The frequency of lupus flares during pregnancy ranges from 12.7% to 69%; lupus does not spare pregnancy which increases the rate of fetal loss, preterm birth, and small-for-gestational-age (SGA) neonates [
2]. Both rheumatologic and obstetric teams need to be alert to adverse pregnancy outcomes.
Early prediction is necessary to improve maternal and neonatal outcomes. The traditional statistical approach to predict categorical disease outcomes involves the use of logistic regression (LR) models. The sample size used for this prediction models is relative to the number of variables, and the ratio between research subjects and variables is widely used as 10 to 1. This minimal sample size criterion has generally been accepted as a methodological quality item in appraising prediction modeling studies; small sample size has frequently been associated with poor predictive performance upon validation [
3]. However, considering the very low incidence rate of lupus [
4] and even lower rate of childbearing patients with detailed medical records, the small sample size is inevitable and may results in amended or abandoned research [
5].
Machine learning (ML) and traditional statistics originate in two different communities but share many similarities, and the former can be considered as a generalization of the latter. Meanwhile, ML shows its own advantages for data analysis. Firstly, there is no strict assumption about the data distribution of variables, which needs extensive data preprocessing. Secondly, although less noise is preferred always, ML can handle noisy data and large variances within the dataset comparatively well. Thirdly, specialized types of ML can be trained on small datasets, especially when the number of features considerably outnumbers the number of observations. Finally, complex ML models can identify complicated, multi-faceted, and non-linear patterns of data efficiently [
6]. In recent years, significant progress has been made in applying ML for disease prediction.
In this study, the primary objective was to develop various ML models to predict adverse pregnancy outcomes utilizing a small size dataset with nearly three hundred variables collected before, during, and after gestations and to evaluate the discrimination ability of these models. The second objective was to evaluate the real-time predictive performance of these models, developing the models mentioned above with variables merely from pre-pregnancy care or from pre-pregnancy care associated with prenatal care in different trimesters, in a chronological order, to assess the real-time discriminative ability (the flow chart of this study can be seen in
Figure 1).
2. Materials and Methods
2.1. Study Population
A single-center, retrospective study was conducted. Pregnancy-relevant medical records were reviewed, and eligible women who were diagnosed with SLE before pregnancy with singleton pregnancies were enrolled at Second Affiliated Hospital of Dalian Medical University, from January 2013 to December 2021. All the selected women belong to the Chinese Han population. Participant exclusion criteria: (1) miscarriage or elective abortion; (2) the pregnancy outcomes were unknown, as planned discharged or required transfer; (3) the missing data rate on the analyzed variables was more than 60% [
7].
The reason why we excluded patients whose pregnancies ended before 14 weeks was based on the difficulty of identifying the real cause of miscarriage or elective abortion, as the high frequency of miscarriage was contributed to by chromosome errors or endometrial defects [
8] instead of SLE, which was impossible to identify in our study; additionally, the real causes of elective abortion are untraceable, for instance an unwanted pregnancy due to drug exposure or the severity of the disease progression.
Upon admission, both clinical and laboratory records were collected, and the records were identified mainly in six different periods, which also varied according to the actual situation. (1) Pre-pregnancy: within six months before pregnancy; (2) first trimester: ≤13 weeks 6 days of gestation; (3) second trimester: ≥14 weeks and ≤27 weeks 6 days of gestation; (4) third trimester: ≥28 weeks of gestation. (5) Before delivery: within 24 h after admission for delivery; (6) after delivery: within three months after delivery. All the specimens were tested at the clinical laboratory of this tertiary care hospital.
Though the new 2019 European League Against Rheumatism (EULAR)/American College of Rheumatology (ACR) SLE classification criteria performed well [
9], in this retrospective study, there were no women diagnosed with SLE after the new classification criteria published, and the review of medical records generated before the available date in our hospital or the records from other hospitals is inapplicable because the shared electronic medical record system network is not authorized temporally and spatially. Therefore, SLE was still diagnosed by rheumatologists based on the 1997 ACR criteria for the classification of SLE [
10].
Gestational ages were confirmed by ultrasonic examinations before 14 gestational weeks.
2.2. Grouping
In order to evaluate the predictive performance of ML models about adverse pregnancy outcomes, we grouped the women as following: (1) Adverse Group (n = 22): individuals associated with adverse pregnancy outcomes; (2) Positive Group (n = 29): individuals associated with satisfactory pregnancy outcomes (having no adverse outcomes).
Adverse pregnancy outcomes including one or more of the following: (1) fetal death after 13 weeks’ gestation excluded chromosomal abnormalities, anatomical malformation, or congenital infection [
11]; (2) early neonatal death (death before 8 days of age) due to complications of prematurity and/or placental insufficiency [
12]; (3) preterm delivery at less than 37 weeks due to gestational hypertension, preeclampsia, HELLP syndrome, placental insufficiency, placenta abruption or premature rupture of membranes [
12]; (4) SGA neonate (<10th percentile) [
12]; (5) fetal distress which was certified by pathological type observed in the cardiotocography [
13]; (6) the SLE pregnancy disease activity index (SLEPDAI) was more than 4 [
14].
2.3. Predictive Variables
Predictive variables include medical history and clinical and laboratory examinations collected before, during, and after pregnancy. The medical records of deliveries and neonates were also collected and assessed. The ultimately enrolled 288 variables were divided into six domains: clinical domain (66 variables), hematologic domain (57 variables), renal domain (56 variables), hepatic domain (30 variables), immunologic domain (75 variables), and thyroid domain (4 variables), listed in
Table S1.
Random missing data were an inevitable reality in our retrospective study, which may unnecessarily threaten the validation of results. Therefore, a pre-processing stage is usually required to deal with missing values before any subsequent analysis. K-nearest neighbor intelligent imputation technique can investigate the relationships between attributes and predict both numerical and categorical missing data, so it is an appropriate choice when we have no prior knowledge about the distribution of data. This method is based on the principle that an attribute can be approximated by the values of the “k” attributes that are closest to it [
7,
15]. After data imputation, a complete dataset was obtained, and the missing data rate was calculated (seen in
Table S1).
2.4. Statistical Analysis
Descriptive statistics were performed for all variables. Continuous data were presented as medians and interquartile ranges; categorical data were reported in frequencies and percentage. Statistical analyses were performed with the Mann–Whitney U test for continuous or ordinal data and with the chi-squared test or Fisher’s exact test for categorical data between the two groups to determine the differences. p value < 0.05 was considered statistically significant.
2.5. Correlation Clustering
Correlation of variables were assessed by Spearman’s rank correlation, which is a method of nonparametric statistics. Correlation coefficient ranges from 1 to −1; the closer it is to 1 or −1, then the stronger the correlation between the two variables is. To determine the relationships between all variables, construction of “heat maps” can be led. The display of heat maps solves the problems of pairwise graphic mapping of variables simultaneously and is an illustrative way to assess the presence of dependence [
16]. The independent variables screened by correlation analysis would be obtained for following exploration.
2.6. Feature Selection
Feature selection is an important data preprocessing step before ML methods are applied to increase prediction accuracy and to decrease computation time consumption. To identify how each variable contributes towards the classification, Decision Tree (DT), an ML method, which would be introduced in detail below, was proposed as the feature selection algorithm. Using each variable to train the DT model, Area Under the Receiver Operator Characteristic Curve (ROC-AUC) was applied to evaluate the predictive accuracy of all the trained DTs; the area under the curve (AUC) was calculated and ranked to reflect the importance of each variables to the prediction task [
17,
18]. Targeted variables with AUC values more than 0.5 were filtered for subsequent performance.
2.7. Model Development
As the purpose of this study was to develop predictive models based on ML algorithms, both the overall models and the real-time models were constructed.
Overall models refer to the models that are constructed by all the variables collected before, during, and after pregnancy, and the predictive ability for adverse outcomes were evaluated. Considering the bias may be produced by the data imputation process, the development of overall models was split into two parts: (1) 288 collected variables were used to construct the overall models regardless of the data missing rate; (2) 170 variables with a missing rate less than 30% were accumulated to develop the overall models. Then, predictive ability of different ML strategies in the two parts were compared, respectively, to comprehend the superior algorithm for modeling and the influence of imputation on modeling.
In addition, real-time models were also developed in order to describe how much time in advance the algorithmic models can achieve the most satisfying discriminative performance for adverse outcomes. We created real-time predictive models as follow: (1) pre-pregnancy models: as the outcomes of 51 participants were known, the variables collected at pre-pregnancy periods were accumulated and utilized to construct the first real-time models, and then the predictive performance for adverse outcomes of these early period models could be figured out based on the AUC values quantitatively; (2) pre-pregnancy + first trimester models: in these models, the modeling variables were collected from pre-pregnancy period to first trimester, as the second real-time models, and the predictive ability of these models were also evaluated; (3) pre-pregnancy + first trimester + second trimester models: with the timespan from pre-pregnancy to the second trimester, the acquired variables were applied to construct the third real-time models; (4) pre-pregnancy + first trimester + second trimester + third trimester models: variables were collected in timespan mentioned above to develop the fourth real-time models.
Data standardization was essential to weaken or even eliminate the disturbance factors of variables with different features and was utilized to solve the problems of comparability between different variables, improving the accuracy of prediction. All the original data were normalized to the same order of magnitude and standardized from 0 to 1 [
19]. The description of different ML models is listed below.
Support Vector Machine (SVM), the maximization of separating margin, is a binary linear classifier for classification or regression analysis, creating a decision boundary between two classes that enables the prediction from one or more feature vectors. The model transforms training data into a high-dimensional feature space, separating the decision boundary, known as the hyperplane, with the smallest distance between the hyperplane points and the largest margin between the classes, providing a linear optimal solution [
20,
21,
22].
K-Nearest Neighbor (KNN) is one of the oldest, simplest, and most accurate algorithms for patterns classification and regression models. The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples. This nonparametric algorithm indicates that there is no fixed number of parameters irrespective of data size and no assumptions about the underlying data distribution. This model could be the best choice for any classification study that involves a little or no prior knowledge about the distribution of the data [
23].
The Decision Tree (DT) classifier is a single base classifier consisting of nodes and edges. The building process starts from the root node which is also known as the first split point. This split decides the divisions of the entire dataset on the basis of calculation, and the process continues from top to bottom until partitioning is no longer required. The leaves present at the end of the decision tree represent the last partitions. So far, this system applies to various classification and regression tasks [
24].
To overcome the drawbacks of a single base prediction model, the researcher proposed the ensemble learning method, Random Forest (RF), to achieve higher accuracy. The ensemble is composed of multiple decision trees corresponding to various sub-datasets which belong to the same datasets. The algorithm becomes trained with a different subset of features rather than selecting best feature present in the dataset, and this randomness leads to achieve good accuracy. The random forest performs well even though the size of the dataset is very low [
24].
The Multi-Layer Perceptron (MLP) is a type of feedforward artificial neural networks with a high degree of connectivity determined by synaptic weights of the network, consisting of three layers: input, hidden, and output layer. In the hidden layer, each artificial neuron contains a nonlinear activation function. Employing the backpropagation algorithm, the training process can be divided into two phases. In forward phase, the synaptic weights are fixed as the signal propagated, while in the backward phase, the error signal propagates backward until it reaches synaptic weight and is adjusted [
25].
Linear Discriminant Analysis (LDA) is a multivariate classification technique. Maximizing the ratio of the between-group sum of squares to the within-group sum of squares, this model seeks a linear combination to discriminate multiple measures into two different groups [
26]. The decision boundary obtained from the testing sample plays a crucial role in the correct recognition, and linear transformation is performed on data from a higher dimensional space to a lower dimensional space, where the final decision is taken [
27].
For all ML models, a ten-fold cross-validation technique was optimized to select the best bias-corrected discriminant model. In this process, the data are divided into ten equal parts. For each iteration, seven parts are used for training, and three parts are used for testing. Ten iterations are performed; each part is used as a testing data in a rotatory manner, and the final performance of models is calculated as the average of all the iterations [
28].
2.8. Model Testing
As mentioned above, discrimination performance is often visualized using an ROC curve. AUC was assessed to illustrate the classification performance of the models, as well as the sensitivity, specificity, and positive and negative predictive values [
29].
All analyses were performed by Python language version 3.6.9, SPSS version 26 (IBM Corp., Armonk, NY, USA), and GraphPad Prism 6.01 (GraphPad Software, San Diego, CA, USA).
2.9. Ethics Statement
The requirement for informed consent was waived for this retrospective and observational study. The study protocol was approved by the Ethics Committee of Second Affiliated Hospital of Dalian Medical University (2022-068) and Dalian University of Technology (DUTSCE220416_01). All procedures performed in this study adhered to the ethical standards with the principles of the Declaration of Helsinki. The personal information was shielded before any analysis.
4. Discussion
Even if the conception occurs after the period of quiescence, the risk of SLE flare and pregnancy complications can only be minimized and cannot be eliminated. A satisfactory pregnancy management includes the maintenance of low disease activity by rheumatologists, as well as the maternal and fetal monitoring by obstetricians in the whole process of the pregnancy–childbirth–puerperium period. Strengthening the exactness of risk prediction will definitely improve the quality of this cooperative clinical practices and achieve the patient-centered benefits.
Nevertheless, the reality is that the attainable SLE dataset, including the complete tracking records of clinical and laboratory variables during the whole gestation process, might usually come across the problem of insufficient sample size, which means researchers are dealing with a “wide dataset”, where the number of variables exceeds the number of individuals, in contrast to a “long dataset”, where the number of individuals is greater than that of variables. While the classical statistical modeling was designed for the “long dataset”, in the situation of a “wide dataset”, classical statistical inferences become less precise [
30]. However, ML prediction models make data-driven classification, which perform the algorithms depends on the pattern of the dataset [
6]. After applying six different ML techniques in the current dataset, the first main finding of our study shows that the RF algorithm was testified as a superior model for both overall and real-time adverse outcome predictions, confirmed by ROC-AUC values, a well-established model for discriminative ability of prediction. This technique benefits from the splitting strategy. In the process of creating every decision tree, random variable selection is applied, which makes each decision tree possible to be different from others, improves the diversity of the constructed RF, and guarantees the prediction accuracy [
31]. With the advantage of ensemble power, RF can be applicable even in the dataset with highly correlated variables and can achieve good performance in this structural medical dataset stably.
It should be noted that the fundamental purpose of our study is not the competitive comparison between conventional logistic regression analysis and machine learning algorithms for this attempt of binary classification [
32,
33]; instead, we want to provide ML models as an alternative approach when confronting a dataset with variables outnumbering sample size significantly, such as with rare diseases or genomics data. Clinical practitioners may be more familiar with the thinking of statistical inference and the predicted continuous outcome scores by regression models, while sometimes ML may be helpful to operate the “wide data” problem by finding the generalizable classification patterns automatically.
The second main finding is that the procedure of feature selection is proposed to identify informative variables which may be neglected by traditional statistical analysis. Based on the calculated statistical significance, there are eighteen indicators acquired from different stages of gestation demonstrating statistical differences between the two groups (
Table 1); as to the variables selected by feature selection process, even the number of high influential variables with AUC values more than 0.65 is forty-one (
Table 3). Compared with the statistical significance based on the assumption that samples are independently and identically distributed, feature selection concentrates on the knowledge of exact distributions of the variables. More and more evidence [
34,
35,
36] has been accumulated that significant variables may not lead to good prediction of outcomes, as more feature selection strategies are applied into variable filtering. Similar to the thinking that ML methods can be alternative approaches for prediction, if prediction is the ultimate goal, we could employ feature selection strategies as alternative approaches for exploring predictive variables and lay aside significance as the only selection criterion. Moreover, ALT collected from the second trimester and GGT, Titer of ANA, TT, and Platelet acquired from different periods of gestation are the predictive variables identified by both the statistical significance and feature selection, indicating the contributing influence of hepatic function, autoimmune status, and coagulation function on adverse pregnancy outcomes.
The third and the last main finding is that risk assessment for adverse pregnancy outcomes neither should be limited to the pre-pregnancy period, nor be delayed until the third trimester, and serious evaluations are suggested to be conducted until the second trimester. The reasons for this emphasis are, as to rheumatologists, the previous studies mainly focus on the importance of disease remission before conception, while as to obstetricians, the former experience may prompt greater focus on the third trimester and delivery which are highly correlated with adverse outcomes. As to the models with AUC values equaling 1, the values do not mean the perfect predictive ability but the over-fitting of the models under current small sample size. Considering this unreliability, the real-time model in the timespan from pre-pregnancy to the second trimester may be the most preferential period to predict adverse outcomes most accurately. Accumulating sufficient but not too redundant information to support clinical decision, this finding may benefit clinical practices but still needs more evidence from similarly designed studies designed.
There are also two limitations in this study. The first limitation is that the missing data rate is relatively high for our retrospective clinical study. There are two main reasons for this. Firstly, in order to reflect what happens in clinical practice, we split the dataset into six different periods instead of taking the whole gestation as the only observation period and designed the four real-time predictive models; hence, the missing date rate in each period was increased inevitably. Secondly, as so far, there is no study providing an evidence-based set of protocols for the frequency of monitoring pregnancies involving SLE [
37]. The international or regional consensus on routine maternal and fetal surveillance with practical uniformity and clinical effects is still lacking. Though we employed the KNN imputation method for the missing data, which was testified as the most efficient method in our previous study [
38], the results of overall models in two parts testified that removing variables, regardless of their missing rate, did not tremendously affect the ranking of predictive models; while any missing data imputation method is not an ideal circumstance, the development of standard management instructions benefitted the medical work team. Undertaking this task, unified study design concerning different trimesters of gestation with data sharing among multiple centers is quite essential.
In addition, another limitation relates to the preferential strategy of feature selection. Feature selection is the data-fitting pre-processing procedure for ML modeling, aiming for selecting a subset of variables from original dataset based on certain a criterion to develop an efficient classifier with reduced computation consuming. As a diversity of feature selection strategies has been established, different strategies depending on different algorithms and criteria can generate different subsets of variables; therefore, the collections of predictors selected by different strategies for certain predictive model may not overlap completely and may lead model development into uncertainty. Considering the main objectives of this study, we applied DT as the feature selection method, while in the subsequent study, we focus on the performance of different feature selection methods, and the results indicate that the main contributing variables for prediction can be filtered by different selection strategies simultaneously, while the explanation of the selected subsets can only be interpreted by the algorithms themselves, not by clinical knowledge or judgement.
The utilization of ML techniques demonstrated promising potential for exploration of information from “wide data”, where traditional statistics are not applicable. Referring to the long-term tracing medical datasets with small sample sizes and numerous variables, ML can be applied for classification tasks, such as disease diagnosis, evaluation of complication involvement, assessment of adverse outcomes, and prediction of prognosis and late sequelae automatically and efficiently. If so, the real-time classifiers can be embedded in electronic medical record system, and the given alert thresholds will flag the target events in time, triggering the instant surveillance or interventions. However, challenges that match the actual clinical situation, evaluate the actual benefits, and solve the actual problems still need to be concentrated on. Multidisciplinary cooperation from a panel including machine learning experts, traditional statisticians, rheumatologist, and obstetricians in this case manifests the positive energy.