1. Introduction
Seasonal variations in the prevalence of infectious diseases, which are commonly known as seasonality, have been widely documented and studied [
1,
2,
3,
4]. Although these variations are usually attributed to seasonal changes in humidity, temperature, or even different human behaviors throughout the year, the detailed causes of seasonality remain a recurrent research topic [
1,
2,
5,
6].
Seasonality is usually considered in epidemiological time series studies, in which the objectives are to estimate the number of future cases or assess the factors correlated with the spread of infections [
2,
7]. However, seasonal variations are rarely considered in clinical prediction models, which have objectives related to prognostic and health service research [
8].
In this paper, we explore the main techniques used to address the challenge of seasonality in prediction models for infectious diseases. We focus on classification problems, and use the approach of simply ignoring season in the models as a gold standard. Two common approaches used to deal with seasonality [
2,
7,
8,
9,
10,
11,
12,
13], namely, adding the season as an additional feature and generating different models for different season, are considered in this work. Furthermore, we propose two algorithms based on common methods from the field of datastream mining research, namely, sliding windows and ensembles of models trained on different time periods. The effects of these approaches are studied with regard to both synthetic datasets and data related to infectious diseases extracted from the MIMIC-III database [
14]. This work greatly extends our preliminary proposal presented in [
15]. The main contributions of this paper are:
New approaches for dealing with seasonality based on sliding windows and ensembles;
An extensive study of the effects of seasonality on clinical prediction models in the presence of high dimensionality and imbalanced data;
Experimental settings based on freely available interpretable techniques and open data. With the aim of ensuring the reproducibility and usability of these results in future research, we have made the code developed for this work freely available at:
https://github.com/berncase/seasonality-rProject (accessed on 25 May 2023).
The remainder of this paper is structured as follows.
Section 2 presents a comprehensive analysis of the issue of seasonality in clinical data and discusses relevant research in the field. In
Section 3, we explain the approaches employed to deal with seasonality that are considered in this work, including the two new proposed algorithms. In addition, we provide a detailed description of the two synthetic datasets with seasonal variations considered in this work along with the two clinical datasets extracted from the MIMIC-III database.
Section 4 provides insights into the conducted experiments and their results, which are further discussed in
Section 5. In
Section 6, we outline the limitations of this work and highlight future research directions. Finally,
Section 7 presents the conclusions drawn from this study.
2. Related Work
It is widely accepted that seasonal variations are a common trait of many infectious diseases [
1]. As stated in [
8], several methods are employed in epidemiological studies to examine the effect of seasonality, including the statistical comparison of two different time periods, geometrical models assuming sinusoidal cyclic patterns, and generalized linear models in which seasonality is included as an extra term. These approaches make it possible to assess the effect of seasonality on the number of cases and identify factors that might be associated with these variations [
2,
7].
When the outcome of the model is to predict a condition regarding a particular patient, a common strategy is to include the season among the possible features to be explored. For example, the season can be included as an additional feature with four possible values (spring, summer, autumn, and winter) in northern and southern countries [
9] or two values (dry and wet) in models for countries with subtropical or tropical climates [
10]. Another strategy is to build separate models for each season [
11], or at least different models for summer and winter [
12,
13].
From a different perspective, we can consider seasonality in clinical data as a particular case of concept drift in a datastream. Datastream mining is a recent research field that is focused on the development of models over huge amounts of online data obtained from sources such as sensors, bank transactions, or social networks [
16,
17]. When dealing with datastreams, certain particularities must be considered. For example, it is assumed that the whole stream can neither be stored in memory nor accessed repeatedly, signifying that the algorithms working with it can manage only a limited number of data items at the same time or even that the model must be trained by observing each item only once [
18]. Another common assumption is that the datastream is non-stationary, which means that the distribution of the features and/or the target outcomes varies over time [
16]. This aspect of data is commonly known as
concept drift [
19,
20,
21], and is an open and fast-growing research topic [
22]. This effect often occurs in clinical research, in which the study of clinical models capable of evolving over time, known as dynamic models, is a recent and active topic [
23]. A more comprehensive study of the state of the art in these research lines can be found in [
16,
22,
23].
However, problems arise when attempting to apply these approaches to open clinical data in order to validate and share results. First, de-identification policies force the removal of most of the elements from real timestamps when sharing clinical datasets in order to ensure patient privacy [
24]. In the case of the MIMIC-III Database [
14], which is the public data source used in this work, the date of each patient’s admission is randomized, although both the day of the week and the season shown in the original source are maintained. Consequently, approaches that rely on strict temporal ordering of the data cannot be readily applied in such cases. Another challenge is that many of these algorithms are not designed to work with a relatively small number of samples, which is common when working with certain diseases. Furthermore, the need for interpretable models and the effects of high dimensionality and class imbalance are not usually considered in the design and validation of these frameworks.
In this paper, we propose and evaluate two approaches based on sound strategies from datastream mining that we have adapted to the particular problem of seasonality in open clinical data, as described in the following sections.
3. Materials and Methods
In this section, we describe the data mining techniques and the datastream mining frameworks considered in this work, along with the modifications proposed to deal with the problem of seasonality.
3.1. White-Box Models
The aim of a clinical model is to provide support when making clinical decisions, not to replace the experts who make them. Therefore, the interpretability of the model and the ability to understand the rationale behind its predictions are crucial for its acceptance, even if this results in a slight decrease in its accuracy. Models that are easily understood and applied by users are commonly referred to as
white-box models or
interpretable models. Examples of such models include logistic regressions and decision trees, which are widely used in clinical settings. On the other hand, certain Artificial Intelligence approaches, such as deep learning and bagging (when using a large number of members [
25,
26]) generate complex models, often referred as
black-box models, which are more difficult to interpret [
27]. These models have historically been less well accepted despite producing very accurate predictions [
28].
In this work, we experiment with both logistic regression and decision trees. Logistic regression is one of the most common techniques employed to build clinical prediction models [
29]. These models have the following structure:
where
, in which
Y is an indicator variable denoting the occurrence of the event of interest and
are the features from the dataset used in the model. The
components are usually estimated by means of maximum likelihood, although they are sometimes modified in order to avoid overfitting [
30].
Decision tree models consist of a series of nested conditions that partition the feature space into homogeneous groups that can be assumed to be of a particular class with an acceptable margin or error [
31]. These models can be represented as a set of
if–then rules, or more traditionally as an inverted-tree graph in which the split conditions are the nodes and the predicted classes are the leaves. As a result, these models are highly interpretable [
31].
3.2. High Dimensionality
A common problem related to clinical datasets is that of having a high number of possible features and a modest number of observations. This may make the training process difficult and lead to less accurate models.
This effect is usually alleviated by using filters. One common approach consists of performing univariate analyses between each candidate predictor and the outcome class [
32]. In these cases, experts recommend the use of the chi-squared test or Fisher’s exact test for nominal variables and a univariate logistic regression model or two-sample Student’s
t-test for continuous variables. Those predictors that have a low
p value, commonly
p < 0.05, are then considered interesting for the training process.
Other more complex filters are available as well; for example, the Fast Correlation-Based Filter (FCBF) [
33] estimates the relevance of each feature with regard to the target outcome along with whether it is redundant for any other relevant feature. Only those features considered highly relevant and non-redundant are used to train the model.
Furthermore, it is a common practice to use the Least Absolute Shrinkage and Selection Operator (LASSO) approach when working with logistic regression. LASSO seeks a balance between model complexity and accuracy by imposing a constraint on the sum of the absolute size of the regression coefficients [
34]. This allows certain coefficients to reach zero during the search for the optimal solution, allowing them to be removed from the model. This results in less overfitted models.
The modern algorithms used to create decision trees, such as C5.0, include the option of
winnowing, or removing the predictors that are considered to be unimportant before creating the model [
31].
In this work, we used LASSO to carry out experiments with logistic regression and activate the winnowing option for decision trees. We additionally tested the use of only those features with p < 0.05 after a Fisher test or t-test and the use of FCBF to discard highly redundant features for both logistic regression and decision tree models.
3.3. Class Imbalance
Another common problem when developing clinical prediction models is that the datasets are skewed towards one of the values of the target outcome. For example, it is typical for fewer patients to be infected by a bacterial species than to not be infected or be infected by a different one. In these situations, models tend to be biased towards the most frequent value, and minority cases, which usually have high relevance, may be ignored [
35].
There is a wealth of approaches for dealing with the problem of class imbalance. In this work, we focus on those that have little or no impact on the interpretability of the resulting models. In particular, we experiment with undersampling and oversampling. When using undersampling, all the observations of the minority class are considered for the training dataset, while only a random sample of the majority class is used until a selected ratio of minority class is attained. On the contrary, in the oversampling strategy all the observations of the majority class are considered and the samples from the minority class are randomly repeated until the selected ratio with respect to the majority class is attained.
Undersampling and oversampling do not alter the values of the data samples, nor do they generate synthetic samples; consequently, they have no impact on model interpretability, which is why they are used in this work rather than other more complex approaches.
3.4. Proposed Adaptations to Deal with Seasonality
In this work, we follow a datastream mining approach; therefore, we consider seasonality as a particular case of concept drift. Furthermore, we assume the following:
Our observations do not follow a strict temporal order (owing to, e.g., de-identification processes); therefore, we cannot apply common techniques used to address concept drift.
It is possible to determine or estimate the month in which each observation was made; additionally, we consider the case in which only the season of the observation is known.
We have sufficient observations with which to build models based on data from a limited number of months, or seasons, if we cannot attain that level of detail.
These assumptions are utilized in order to adapt several well known datastream mining frameworks, as described below.
3.4.1. Sliding Windows
One of the earliest techniques proposed to address concept drift was the use of sliding windows [
21]. The underlying hypothesis of this approach is that the newest datapoints are more useful than the older ones when attempting to predict the target outcome to the point that the oldest ones can be discarded when training a prediction model. Therefore, only those data points within a determined time interval (i.e., the time window) are used to build the prediction model. This framework has been the starting point for many other methods, and is commonly used as a baseline in the evaluation of new algorithms [
36].
However, the traditional sliding window approach and its subsequent improvements require an ordered datastream. If it is not possible to estimate the real timestamp of the data, or if they are not ordered by occurrence, then these approaches cannot be applied.
We propose an adaptation of the sliding window approach focused on dealing with the problem of seasonality in those cases in which only the month or the season of the data are known. The proposed method is formally described in the functions
WindowTraining (Algorithm 1) and
WindowPredict (Algorithm 2). Let us assume that it is necessary to make a prediction for a datapoint
belonging month
m and that in our training dataset we know the month in which each observation was made. We first partition our training dataset by the month of its observations, after which we create a window of a predefined size
w around the data obtained in
m. This window contains data from
w months; therefore, it includes the training data for month
m, the previous
months, and the subsequent
months, where
w is assumed to be an odd number.
Algorithm 1 WindowTraining (monthly/seasonal) |
Input: w: Size of the window with in the monthly version, or in the seasonal version Input: : Training dataset Output: : A set with a model for each month/season
- 1:
Aggregate data in by month/season - 2:
for each month/season m do - 3:
data of w months/seasons from , gathering data from month/season to month/season - 4:
model trained using - 5:
Include in - 6:
return
|
Algorithm 2 WindowPredict (monthly/seasonal) |
Input: : Observation in month/season m for prediction Input: : Set of trained models, one for each month/season Output: : Prediction of the observation - 1:
extract month/season from - 2:
Select trained by using data from the window centred on m - 3:
prediction of for observation - 4:
return
|
We illustrate our proposal with the example depicted in
Figure 1a,b. In our example, we assume a window of three months, i.e.,
. First, we centre the sliding window on January (
Figure 1a). Our first training dataset is composed of datapoints from December, January, and February, and the model trained with it will be used to predict observations from January. By sliding the window, we can create up to twelve different models (one per month).
Figure 1b shows an example of a prediction for a datapoint
belonging to February. According to the month of the observation to be predicted, the particular model trained when the window was centered on that month (February, in this example) is used to make the prediction.
3.4.2. Ensembles
The ensemble approach consists of training models with data from different time intervals, then combining their outputs when predicting a new observation [
16,
37,
38]. In this case, models created with older data can be restored and used again when the correlations between the outcome class and the features return to a previous state in the stream (i.e., recurring concepts). The literature contains many variations of this framework depending on the strategy used to maintain older models, update them, or merge their results into the final output [
16].
However, this generic framework and its subsequent refinements require a dataset with a strict order between its observations; therefore, it cannot be applied to de-identified clinical data.
We propose an adaptation of the ensemble framework to deal with the problem of seasonality in open clinical data. The method is formalized by two functions, EnsembleTraining (Algorithm 3) and EnsemblePredict (Algorithm 4). Let be the dataset available for training, where corresponds to the data for the i-th month and where . We build a prediction model for each month using the data regarding that specific month (EnsembleTaining, lines 1–6). The model trained using is denoted as .
It is then necessary to calculate the weights used to combine the outputs of the models in the ensemble (
EnsembleTraining, lines 7–12). This is done by again iterating over the training dataset, and each model
is tested on the training data of every month
, where
. The Root Mean Squared Error (RMSE) is then estimated for each pair
.
Algorithm 3 EnsembleTraining (monthly/seasonal) |
Input: n: number of models of the ensemble. for the monthly ensemble, for the seasonal ensemble Input: : Training dataset Output: : set of models, one for each month/season Output: W: n × n weights matrix.
- 1:
Aggregate data in by month/season - 2:
Initialize an empty n x n matrix to store root-mean squared errors from models - 3:
for each month/season i do - 4:
subset of gathering data from month i - 5:
model trained using - 6:
Include in - 7:
for each month/season j do - 8:
subset of including data from month j - 9:
Root mean squared error obtained when applying to ▹ In case , we assume - 10:
for each month/season i do - 11:
for each month/season j do - 12:
- 13:
return , W
|
Algorithm 4 EnsemblePredict (monthly/seasonal) |
Input: : Observation in month/season m for prediction Input: : Set of trained models, one for each month Input: W: Weight matrix Input: n: number of models of the ensemble. for the monthly ensemble, for the seasonal ensemble Output: : Prediction of the observation - 1:
extract month/season from - 2:
- 3:
return
|
It is worth mentioning that (1) the lowest error is usually obtained when
, and the model
is tested using the same data employed to build it (i.e.,
); and (2) the highest error usually occurs when it is tested with data in which the seasonal effect has the greatest effect on the correlations when compared to
, say, from month
z. Therefore, it is necessary to calculate the weights such that, when a new observation from month
i is predicted, the output from model
will have the highest weight, with the contrary being the when the observation is from month
z. This is done as follows: after the RMSE for all the pairs
has been calculated, we calculate a weight matrix
W using the weight of each model
i when used to predict data from month
j (
):
where
is the RMSE estimated for the model
when applied to the training data from month
j. In the case of
for any combination of
i and
j, we assume that
.
A table with the weights is stored along with the ensemble. When it is necessary to predict the outcome of a new observation
(
EnsemblePredict function), the final prediction provided by the ensemble is
where
is the prediction of model
for the observation
and
m is the month to which
belongs. If only the season of each observation is known, then the ensemble is composed of four models (one per season), while the rest of the algorithm is similar.
Figure 2a shows a graphical explanation of the steps of the
EnsembleTraining function when training a monthly ensemble.
Figure 2b shows an example of execution of the
EnsemblePredict function when used to predict an observation belonging to February.
3.5. Dataset Description
We considered two synthetic datasets and two real-world datasets extracted from clinical open data, specifically from the MIMIC-III database [
14].
3.5.1. Synthetic Datasets
We created two synthetic datasets in order to simulate two different seasonal variations. Let us assume the model defined by the equation , where and are random variables within the range . The values of and represent the unknown factors of the model that need to be estimated. Because our focus is on binary classification models, we introduce a categorical column named as the binary outcome. This column takes two possible values: non-negative when , and negative otherwise.
Our aim was to simulate a concept drift during winter. This was achieved by varying the value of between 0 and 1 throughout the year and calculating the value of as . If , then , and the outcome y depends only on the value of . The contrary occurs when reaches 0. Consequently, each observation includes a timestamp attribute that is used to vary the values of these factors according to the season. here, this attribute ranges from the 1 January 2100 to 31 December 2199, similar to the dates in the MIMIC-III database, although our methods consider only the month or season of each particular date.
Furthermore, we added extra variables that are not directly related to the model in order to simulate the problem of high dimensionality. We included ten random variables,
to
, that have a uniform distribution with values in
. In addition, we included another ten variables
correlated with
and ten more variables
correlated with
. These variables were calculated as follows:
As such, these variables have values similar to those that really affect the outcome of the model, with an additional small random error such that they are not perfectly correlated.
The synthetic datasets eventually contain a total of 33 columns: one nominal column that indicates the class of the observation (non-negative or negative), two columns ( and ) that really affect the class, ten absolutely random columns (), and twenty columns () correlated with or .
A class imbalance of ten to one was then simulated in order to increase the complexity of the dataset. After each dataset had been generated, we assumed that the non-negative class was the minority one and randomly removed samples until there were ten negative samples for each non-negative sample. A total of 5500 rows were generated per dataset (5000 negative samples and 500 non-negative samples).
We made two different assumptions about how seasonality affects data, which were simulated by varying and over time:
In the condensed dataset, we assumed that does not affect y (i.e., and ) except in winter, when y becomes gradually affected by following a Gaussian curve with its maximum centered in the middle of the season (i.e., and exactly in the middle of winter). In this case, the effects of seasonality are strictly present only during winter.
In the
sinusoidal dataset, we assumed that
varies following a sinusoidal function that reaches it maximum in the middle of winter and that its effects decrease slightly until reaching its minimum in the middle of summer. The use of sine curves to represent seasonal variation is quite common in epidemiological studies regarding the seasonal occurrence of infectious diseases [
8]. In this case, the effects of seasonality are present throughout the year, with the main differences being between winter and summer.
Figure 3 shows a sample of the values of
and
along with the outcome class and the changes in
and
for the
condensed and
sinusoidal datasets. Furthermore,
Figure 4 shows a graph concerning the correlation between the numeric attributes of both datasets. As shown in these plots,
features are highly correlated with
; the same occurs with
and
, while
are not correlated with either
and
, as intended.
3.5.2. Clinical Datasets
We extracted two datasets from the MIMIC-III database in order to test the performance of these techniques with real hospital data. MIMIC-III is a freely available database containing data regarding hospital admissions to the clinical care units of the Beth Israel Deaconess Medical Center in Boston, Massachusetts [
14]. It includes a wide variety of data, including demographics, microbiology cultures, laboratory tests, and bedside monitoring.
Data Extraction
It should be noted that the aim of this work is not to develop a precise clinical model; rather, it is to test the performance of the different approaches employed to deal with seasonality in this context. Therefore, we designed a query for MIMIC-III in order to extract a dataset containing generic data related to infections that might be suitable for our study.
The query was specifically designed to retrieve the first positive microbiology test for each microorganism and sample type in every admission. It collected various demographic data (age, gender, insurance, marital status, ethnicity), data related to microbiology tests (microorganism found, type of sample, date of the test), and hospital stay data (admission type and location, previous ICU stays, current service at the time of test ordering), as well as the mean, maximum, and minimum values of the white blood cell count and lactate within a 24 h time windows on the day the sample was obtained. The code of the query is depicted in
Appendix A.
We generated two datasets from the results of the query, each of which was focused on a different species of bacteria. The target outcome was to predict whether the microorganism isolated in each test belonged to the species of bacteria being studied.
In the
Acinetobacter dataset, we focused on bacteria belonging to the
Acinetobacter species. These are responsible for many healthcare-associated infections (HCAIs), and multiple studies suggest the existence of clear seasonal variations in these infections [
3]. In this dataset, we assumed that those microbiology tests that were positive for
Acinetobacter sp.,
Acinetobacter baumannii, or
Acinetobacter baumannii complex belonged to the
positive class and that the others were
negative.
Another bacteria species with clear seasonal variations is
Streptococcus pneumoniae, for which infections are known to occur more frequently in cold seasons [
39]. Using a similar strategy, we generated an
S. pneumoniae dataset in which those microbiology tests in which
S. pneumoniae were detected were considered as
positive and the others where considered as
negative.
The time when the microbiology sample was obtained was considered as the temporal reference in these datasets. The data in the MIMIC-III database have been de-identified in order to protect the patients’ confidentiality, which implies that the available date is not the real one. The de-identification procedure randomly shifts the real date into the future, sometime between the years 2100 and 2200. However, the season is preserved (i.e., an observation made during winter will be shifted to a winter month in the future), making these data appropriate for our work.
Data Preprocessing
We carried out further transformations in both datasets in order to adapt them to the techniques used in this work. The patients’ ages were stratified as adult (between 16 and 65) or elderly (65 and over). Only the microbiology tests concerning sputum, bronchoalveolar lavage, blood culture, and swab were considered, as these are the samples related to systemic and respiratory infections caused by the studied bacteria types. Only the most frequent values for admission location, marital status, ethnicity, and current service were considered, while the rest were labeled as other.
Next, we discarded those cases in which any of the selected attributes were missing in order to obtain a consistent dataset. To facilitate model development, we converted each multilevel categorical attribute into multiple Boolean attributes. However, we decided not to normalize the continuous attributes, as doing this could potentially impact the interpretability of the models.
Appendix B provides a summary of the attributes available in both the
Acinetobacter and
S. pneumoniae datasets along with their representation in each class. Both datasets have a noticeable class imbalance (i.e., 6301 negative vs. 61 positive cases in the
Acinetobacter dataset and 6280 negative vs. 82 positive cases in the
S. pneumoniae dataset) and high dimensionality (i.e., a total of 52 features after the aforementioned transformations).
4. Experiments and Results
In this section, we analyze the impact of the aforementioned methods and their combinations when creating models for the high-dimensional imbalanced datasets proposed above.
4.1. Experimental Settings
We experimented with five different approaches used to deal with seasonality in data:
None: we built a single model that ignores the season; the results of this experiment were used as a gold standard for comparison.
Season as a feature: we built a single model that includes the season as an additional feature for prediction.
Model per season: we built isolated models, one for each season; for a given observation to be predicted, the model corresponding with the relevant season is used.
Monthly/seasonal window (3,5,7,9): as explained earlier, a sliding window was adapted to account for seasonality; we experimented using windows with lengths of 3, 5, 7, and 9 months and with a window containing both the season of the observation to be predicted and the adjacent ones (i.e., a seasonal window).
Monthly/seasonal ensemble: we used an ensemble model that aggregates the output of different models for each prediction, as explained previously; we experimented with an ensemble of twelve models (i.e., one model per month) and four models (i.e., one model per season).
Figure 5 provides an overview of the workflow employed in the experimental settings. We adopted a training–validation–testing strategy [
40]. First, the dataset was split into training/validation (80% of the data) and testing datasets (20% of the data). We followed a random sampling strategy only within each class group in order to preserve the overall class distribution of the data. We then generated 100 datasets from the original training/validation subset using a sampling with replacement strategy.
Each particular combination of techniques was then applied to each resampling of the training/validation dataset. The first step consisted of applying the technique related to seasonality. Several of these approaches might imply the creation of various models over different subsets of data (e.g., a model based on data for winter only). After the training/validation resample had been partitioned (if necessary), we then applied the balancing approach and the feature selection algorithm, and the model was eventually built using either logistic regression or the C5.0 algorithm for tree models. In addition to the tested approaches for feature selection, we applied the LASSO technique to the logistic regression models and winnowing to the decision trees in all our experiments. These techniques are common approaches which are used together with the aforementioned modeling techniques in the presence of high dimensionality, and as such are suitable for inclusion in this scenario.
We consequently trained 100 models per combination of approaches and dataset (three options for feature selection, five options for class balancing, ten options for seasonality, and two techniques for interpretable models), that is, a total of 30,000 models per dataset.
Each model we created was then used to predict the test datasets, and the resulting area under the receiver operating characteristic curve (AUC) was stored. This made it possible to obtain 100 AUC results per combination of techniques. A t-test was then performed in order to calculate the mean and 95% confidence intervals per combination used in the remainder of the analysis. The number of predictors included in these models was studied as an approximation with which to evaluate the differences in complexity between models.
All the experiments were performed using the R platform version 4.0.2 and RStudio version 1.3.1093. The LASSO models were fitted using the
glmnet R package [
41,
42]. The
parameter, which controls the elastic net behavior, was set to
in order to obtain a LASSO effect and ensure numerical stability [
42]. The
we eventually used was
, that is, the value obtained by the model had an error within one standard error from the minimum when performing ten-fold cross validation on the training/validation dataset, as suggested in [
42]. The decision tree models were created using the C50 package [
43], with active winnowing and the remaining parameters set to their default values (trials = 1, rules = false, subset = true, bands = 0, noGlobalPruning = false, CF = 0.25, minCases = 2, fuzzyThreshold = false, sample = 0, earlyStopping = true). The
Biocomb package [
44] was used for the experiments with FCBF. In this case the
threshold parameter was set to 0 for an initial safe approach, as suggested by [
45]. The implementations of Fisher’s exact test and Student’s
t-test used in the filter by the
p value were those included in the
stats package of the R platform.
4.2. Seasonality in Data
In order to assess whether seasonality has an effect on the relationships between the predictors and the target outcome, we performed a univariate test between each feature and the class using data from different seasons.
Figure 6a–d show the changes in the
p values of features when partitioning the datasets by the season of the observed data. These figures include the traditional cut-off of
p = 0.05, shown as a dashed line. In the condensed and sinusoidal datasets, the effect is clear; the feature
has its maximum relevance (lowest
p value) in winter, as expected, to the point that it is above 0.05 during the rest of the year in the condensed dataset (
Figure 6a) and during summer in the sinusoidal dataset (
Figure 6b).
With regard to the clinical datasets, the two features with the minimum p values in each season were selected in order to study their variation during the rest of the year. Again, clear variations are present among seasons. For example, the min_lactate feature in the Acinetobacter dataset has a low p value (high relevance) in spring and winter, yet its p value would lead it to be discarded as a relevant feature in summer. In the S.pneumoniae dataset, the fact that the specimen is of the sputum type is more relevant in summer than in the winter, while the contrary occurs with respect to the datapoint patient concerning whether the patient received ICU service before.
4.3. Separate Analysis of the Effects of Different Approaches
We initially performed an analysis of the effects of using each set of techniques described in this work separately, i.e., feature selection to reduce high dimensionality, sampling to compensate class imbalance, and approaches for dealing with seasonality in data.
Figure 7a compares the results of each feature selection approach in terms of mean AUC, while
Figure 7b compares the number of features included in the models, with the aim of illustrating the variations in model complexity.
The use of FCBF without combining it with other techniques tends to reduce the model AUC, with the exception of the condensed dataset and when used in logistic regression models. The p-value filter has less of an impact on the model AUC, and leads to a slighter reduction in model complexity.
The approaches based on decision trees obtained poor AUC results on all of the MIMIC-III datasets. In these cases, the high class imbalance led to tree models with only one node (zero features per model), resulting in all the observations being classified as belonging to the majority class regardless of the feature selection approach used.
The results when varying only the method employed to compensate the class imbalance are shown in
Figure 8. When combined with logistic regression models, the balancing approaches achieve a similar or slightly lower AUC than the models without a balancing strategy. However, the performance improves when using decision trees, which is particularly relevant on the MIMIC-III datasets as the models are no longer empty. With regard to model complexity, undersampling leads to simpler models than oversampling.
Figure 9 shows the results obtained when using the seasonality approaches without any other preprocessing techniques. There are differences between the synthetic and clinical datasets. The use of ensembles, one model per season, or a three-month window improves the AUC results in the synthetic datasets, while leading to worse results on the MIMIC-III datasets with logistic regression. In these experiments, wider windows, the inclusion of the season as a feature, and even using no seasonal approach at all led to better performance. The results according to model complexity were more homogeneous. The use of the monthly ensemble and three-month window clearly reduced model complexity. Again, the high class imbalance in the MIMIC-III datasets led to one-node trees, resulting in models with a poor AUC.
These results indicate that the use of a seasonal ensemble leads to the same results as creating one model per season. Although they are different algorithms, their outputs are sufficiently similar that both the AUC and the number of features in the resulting models are identical.
4.4. Analysis of the Effects of Different Approaches in Combination
Next, the different preprocessing and seasonal drift approaches were combined with the aim of improving the results obtained with them separately. We focused on each dataset and type of model in order to analyze them.
As the seasonal ensemble obtained the same results as the model-per-season approach, it was not included in this analysis in order to avoid repetition.
Note that it is not possible to build a model for certain datasets when no feature is able to attain the threshold set for the p-value filter (p < 0.05). Specifically, this occurred when combining the p-value filter with monthly ensembles and a few small sliding windows for the Acinetobacter and S. pneumoniae datasets.
4.4.1. Synthetic Dataset—Condensed
The combinations that achieved the best AUC on the
condensed dataset are ranked in
Table 1, which shows the combinations with logistic regression models, and in
Table 2, which shows the best results obtained by combinations with decision trees.
The approach that obtained the best mean AUC in logistic regression models was the use of one model per season, while including the season as a feature worked better for decision trees. The proposed sliding window approach using three-month windows showed promising results in both models. With regard to the filtering and balancing techniques, the combination of FCBF with 2:1 oversampling obtained the best mark with logistic regression, while oversampling with both ratios and no filter had the best results for decision trees. FCBF clearly led to simpler logistic regression models, with a mean of 1.08 features per model obtained in the best combination for logistic regression. However, FCBF did not achieve good AUC results in decision tree models. In those experiments, the simpler models were those based on a three-month window and one model per season. In all cases, the use of seasonal techniques obtained much better results than simply creating the model by ignoring seasonality.
4.4.2. Synthetic Dataset—Sinusoidal
Table 3 and
Table 4 show the top ten combinations of techniques according to their mean AUC when applied to the
sinusoidal dataset.
Our proposed sliding window approach obtains the best results in both the logistic regression and decision tree models with windows of shorter length (three and five months). Unlike the previous synthetic dataset, the use of any balancing strategy slightly worsens the results in logistic regression models, though it remains decisive for decision tree models. The filter based on the p value reduces the complexity of models with only a slight reduction or no reduction in model performance. Again, combinations of the studied techniques obtain the best results when compared to the model created without considering any of the techniques.
4.4.3. MIMIC-III Acinetobacter Dataset
Table 5 and
Table 6 show the top ten combinations of techniques according to their mean AUC obtained on the
Acinetobacter dataset.
In this case, the proposed seven-month window without any feature selection or balancing technique attains the best results with logistic regression, while the monthly ensembles attain the best results for decision trees when combined with oversampling techniques. The slight reduction in dimensionality obtained when using the p-value filter is noteworthy, though in this case it additionally implies a slight reduction in AUC.
4.4.4. MIMIC-III S. Pneumoniae Dataset
Table 7 and
Table 8 show the top ten combinations of techniques according to their mean AUC obtained on the
S. pneumoniae dataset.
The inclusion of the season as a feature led to the best results with the logistic regression models and to good results with decision trees. With regard to the latter, the seasonal window obtained the best results when combined with 1:1 undersampling and a p-value filter. The combination of these techniques did not drastically improve the results for logistic regression models in this dataset, to the point that the models trained with no more than LASSO and logistic regression are among the top ten results. In the case of decision trees, the use of any balancing strategy is again decisive with regard to obtaining a valid model.
4.5. Analysis of the Impact of the Different Approaches on Interpretability
To the best of our knowledge, there is no widely accepted metric for the interpretability of a model. The number of features, which has been studied in the previous sections, could be a good approximation, as it is related to the complexity of the model. However, the use of approaches that involve a combination of multiple submodels may impact interpretability as well, even when interpretable submodels are used.
We provide an example using the models generated for the sinusoidal synthetic dataset. As a reminder, to generate this dataset we used a sinusoidal function to vary the effect of the two main coefficients, and , in the outcome variable throughout the year. The coefficient was the most relevant factor in winter, while it had no effect on the outcome in summer. The feature had the opposite effect, being the most relevant in summer. The models selected here were trained with one of the samplings from the training/validation dataset without applying any filter or balancing strategy. By examining these models, we aim to highlight the differences in interpretability resulting from the different approaches to handling seasonality. The results shown here extend to both tree-based models and models generated with logistic regression; however, only the latter are shown in order to avoid duplication.
Figure 10 shows the values of the coefficients of the logistic regression model generated without applying any seasonal drift approach (
Figure 10a) and when including the season as an additional feature (
Figure 10b). These are common logistic regression models that are easily interpretable; in this case,
and
are much more relevant than the rest of the features. Among the other features selected, those starting with
c1_ and
c2_ refer to the
and
features added to complicate the dataset, which are highly-correlated with
and
, respectively. When the season is included as a feature (
Figure 10b), it is included in the model, but it has little relevance compared to
and
.
All these findings are consistent with the known behavior of the underlying model that generated the dataset; the yearly variations in the relevance of and are smooth, and as such the model must consider both in order to generate a prediction when the dataset is treated as a whole.
Figure 11 shows the models that were generated when using the strategy of creating a different model for each season. This model may appear more complicated to understand; however, because each model is used in a particular condition (i.e., to predict an observation of a particular season) and they can be interpreted separately, on the whole we consider it easy to interpret these models. For example, it can be observed that while the relevance of
is high in the summer model, it does not even appear for consideration in the winter model, in which
is by far the most relevant feature. Therefore, we can interpret this model, and even extract interesting information about the underlying impact of seasonality in the data.
Figure 12 illustrates the models generated when using the seasonal window strategy. Although the models appear similar to the model-per-season approach, the amount of data used to build each model is larger; therefore, the impact of seasonality on the models cannot be appreciated as clearly. However, the increased relevance of
in summer and
in winter can be observed, though not as clearly as in the previous approach. Despite this, it is important to consider how these models were generated when analyzing their structure; with this context, they are interpretable and the reasoning behind each prediction can be easily traced.
The results when using models for each month with the monthly window approach is shown in
Figure 13. The models were made using a three-month sliding window approach; therefore, the effects of seasonality are not as diluted as in the previous example. Despite the increase in complexity, the change in relevance of
and
in each model are noticeable, and the behavior of the approach as a whole can be easily analyzed.
The use of ensembles has a noticeable impact on interpretability, as can be appreciated in
Figure 14. In this approach, we have to combine the outputs of the different models using a weight matrix, such as the one in
Figure 14b. Even though the models can be analyzed independently (indeed, they are the same as those frinom the model-per-season approach), the combination matrix can be easily understood. For example, the matrix in
Figure 14b suggests that for predicting data in summer or winter the output mainly relies on the models generated using data from these seasons, while in spring and autumn the output is mainly a mixture of the models for the other seasons. Although the whole model is more complicated that the previous approaches, it can be understood and analyzed.
The monthly ensemble, illustrated in
Figure 15a,b, represents the most complex combination of all the studied approaches. Assessing the exact relevance of each feature on a specific output becomes challenging when using this approach. Nonetheless, the variations in the coefficients
and
can be observed across the monthly models. Additionally, from the model matrix we are able to discern that the models adjacent to the month of the observation to be predicted have a greater impact on the output of the ensemble.
5. Discussion
According to the obtained results, no one specific approach or combination clearly outperforms the rest when seasonality, high dimensionality, and class imbalance are all present. However, the results provide useful information with which to discuss the advantages and disadvantages of each combination.
Despite the fact that all the experiments used LASSO or winnowing, in most cases the use of a feature selection technique reduced model complexity even more. In particular, FCBF drastically reduced the number of features; the effect of the filter based on the p value, while not as significant, was noticeable in most cases. Therefore, extra filtering techniques appear to be advisable even in the presence of seasonality when reducing the complexity of the model is a critical requisite.
The use of feature selection techniques to reduce model complexity had differing effects on model performance. In the synthetic condensed dataset they clearly improved the AUC when combined with logistic regression, yet in most of the experiments they tended to slightly decrease model performance. It may be possible that when the underlying model is simple, as for the synthetic datasets, the reduction in complexity leads to the final models approximating the real ones. In the case of more complicated interactions and dependencies, as certainly occurs in clinical datasets, relevant features may be discarded, and the resulting models may lose accuracy. Therefore, the common trade-off between model simplicity and performance is present in these kinds of datasets.
In certain experiments combining techniques that severely reduced the number of observations used to train the models, the p-value filter was unable to select any features at all. For example, the combination of the p-value filter, monthly ensemble, and undersampling was unable to create valid models for most of the training/validation sampled datasets. As is well known, the P value is affected by the number of observations; thus, if no strong correlation is clear in the dataset, no feature is able to reach the cut-off value. Therefore, these combinations should be used with caution when the available training data are scarce. In the Acinetobacter dataset, the seven-month and five-month window approaches obtained the best results for logistic regression and the monthly ensemble obtained the best results for decision trees. The good results of month-based approaches rather than season-based ones with the MIMIC-III database may seem surprising, as the months in the timestamps of the MIMIC-III database are randomized to ensure patient confidentiality. However, because the season was maintained, the month of the randomized data was close to the real month; this may explain the good performance of these methods. Moreover, the drift in data might not occur precisely within an astronomical season, and may be delayed with respect to its boundaries or even occur multiple times throughout the year. Our proposed monthly window and monthly ensemble may be good options in these cases, provided that there is at least an approximate estimation of the month.
The best sizes for the sliding window approach changed depending on the dataset. While the best results for the Acinetobacter dataset were obtained with lengths of seven months, on the condensed and sinusoidal datasets a three-month window was the best option. Therefore, it is important to test with different windows sizes when using this approach for seasonality, as occurs with datastreams.
The particularities of the base model may have an impact on the performance of different seasonality strategies. For example, in decision trees it is possible that the splitting algorithm can obtain similar or even better results compared to the model-per-season or ensemble approaches if it is able to effectively utilize the season as a partitioning criterion. However, when the correlation between the season and the outcome variable is unclear or subject to drift the use of these techniques may help to obtain better results, as happened in some of our experiments. The effectiveness of feature selection and balancing strategies depends on the base model as well; for example, class balance techniques were fundamental to obtaining the best results with decision trees in all of our experiments, while they did not appear to be as essential when using logistic regression.
In all cases, the use of seasonal approaches combined with other techniques improved the resulting models with regard to both AUC and simplicity when compared to the direct application of logistic regression or decision trees. The proposed approaches for seasonality (sliding windows and ensembles) attained the best performance in five of eight combinations of datasets with modeling techniques and other traditional approaches in the rest of them. This supports the idea that multiple approaches should be considered when seasonality, high dimensionality, and class imbalance are all present in a clinical dataset.
6. Limitations and Future Work
Although we used interpretable models and techniques, the complexity of a final model can complicate its interpretation. For example, the best logistic regression models obtained for the
S. pneumoniae dataset had a mean of 35.31 features in our experiments, which might be overwhelming for an expert to understand and apply. Moreover, as discussed in
Section 4.5, the use of multiple models in windows and ensembles impacts the interpretability of the overall model. However, we consider this trade-off acceptable due to the reduced number of models and their interpretation being clear in terms of how they are applied, i.e., we use a different model each month/season.
Our discussion regarding the differences between each approach relies on the the graphical representations of the results and rankings of the combinations with best performance. While performing statistical tests among all possible combinations can provide additional insights and detect statistical relevance, it is challenging due to the large number of combinations and the complexity of the data being compared. In light of these limitations, we opted for a clearer and more manageable approach to analyzing and discussing the results.
In our experiments, logistic regression models with LASSO usually obtained better results than those based on C5.0. However, it is important to note that decision trees offer a wide range of tuning possibilities, such as pruning heuristics and boosting, which were not extensively explored in this study. While further research should be carried out in order to determine the best interpretable model for a particular clinical problem, we believe that our experiments can provide valuable insights into the effects of seasonality techniques in both logistic regression and decision tree models.
In our future work, we intend to study further variations of the approaches presented here. One straightforward extension would be to use different sizes of sliding windows depending on the month for which the model is being created. This would make it possible to use wider windows in months with a low number of samples, allowing the creation of more robust models, and narrower ones in months with abundant data, allowing the creation of more precise models.
Furthermore, we intend to experiment with the adaptation of clinical datasets similar to those considered here for use with new algorithms developed for datastream mining, rather than adapting the algorithms to the datasets, which was the approach followed in this work.
7. Conclusions
In this work we have studied the problem of seasonality in clinical datasets, particularly when high dimensionality and class imbalance are present. We tested the combination of multiple techniques, including two new algorithms based on datastream mining research.
Regardless of the modeling technique used, our approaches clearly obtained the best results with two datasets, and with a third when combined with decision trees. The traditional approaches of model-per-season and season-as-feature obtained the best results in several of our experiments. The top techniques employed to deal with high dimensionality and class imbalance varied, leading us to conclude that the best approach for dealing with seasonality is highly dependent on the dataset and modeling technique; therefore, in future studies several techniques should be tested in order to obtain better clinical prediction models.
In spite of the differences in our results regarding the best approach, the use of any technique to deal with seasonality improved the resulting models in all of our experiments. Although traditional approaches achieved acceptable results, our experiments indicate that the use of the proposed techniques when developing clinical prediction models can lead to increased model performance in the presence of seasonality.