3.1. Dataset
Data from six T1D patients who wore for approximately eight weeks the Paradigm Veo system with the second generation of the Enlite CGM sensor (Medtronic Minimed, Northridge, CA, USA) were analyzed in this work. Data were collected during a clinical trial that aimed to assess an artificial pancreas system during aerobic and anaerobic physical activity [
22]. Demographic characteristics are shown in
Table 1. All patients provided written informed consent for research participation.
Patients used a CGM system which recorded BG measurements every five minutes. Periods of different duration were selected for analysis at different times of the day. Glucose time-series of 24-h and 6-h duration starting at different times (00:00, 06:00, 12:00, and 18:00) were analyzed considering the standardized clinical levels of hypo- and hyperglycemia presented in
Table 2.
The four levels of hypo- and hyperglycemia presented on
Table 2 and the range related to normoglycemia (70 mg/dL ≤ BG ≤ 180 mg/dL) defined the five glucose ranges considered for the analysis of time-series, which allowed the obtainment of the composition:
Both 6-h and 24-h periods of glucose data were split into time spent in the five preceding glucose ranges. However, either periods of 6-h or 24-h sometimes presented a non-uniform number of samples, due to missing values related to CGM malfunction. For a 24-h period to be considered valid, each one of its four 6-h periods must contain at least 70% of valid data. In the case of profiles with missing CGM samples, the amounts of time corresponding to the number of missing samples were assumed to be evenly distributed between the existing ranges of the period in analysis. Time spent in different ranges during a finite period of 24-h or 6-h are relative contributions to glucose profiles, codependent, and therefore, should be analyzed as CoDa.
Initially, 24-h periods of glucose data, starting at 00:00 and ending at 24:00 were analyzed, following an adaptation of the methodologies presented in [
13,
14].
Table 3 reports the average amounts of time spent in each glucose range per patient in minutes/period, considering the profiles from 00:00 to 24:00. These values were obtained by adjusting the geometric means of the components to 1440 min (24-h period). Due to the difference in patients’ central tendency of the compositions of days from 00:00 to 24:00, shown in
Table 3, it was performed an individualized analysis of each patient’s profiles.
After the analysis of 24-h periods from 00:00 to 24:00, 24-h glucose time-series starting at 06:00, 12:00, and 18:00 were also analyzed, following the same missing data exclusion criteria as aforementioned.
Table 4 shows the quantity of 24-h periods followed by a 6-h period at different times for each patient.
3.2. Data Analysis
The steps for the analysis and categorization of both 24-h and 6-h periods are presented in
Figure 1. The first part of the figure shows the development stage, in which preprocessing and selection of valid days for analysis occur. The second part consists of the zero analysis, followed by CoDa analysis and clustering. The third part consists of the presentation of results obtained in the CoDa analysis in terms of clinical outcomes, the discriminant model to categorize the previous 24-h period at different times of the day, and a transition model to predict the future 6-h period. On the right bottom, a potential application of the methodology is presented.
After the obtainment of the compositions, zero analysis was performed. Since CoDa analysis is based on logarithms of ratios and both operations require non-zero elements in the data matrix, the log-ratio methodology must be preceded by proper handling of zero values, as has been extensively described by [
25,
26,
27,
28]. The zero patterns were replaced using the log-ratio Expectation-Maximization algorithm [
29]. Given that the CGM records data every 5 min, the matrix of detection limits used in the zero imputation was obtained considering fractions of 5 min, depending on the position of the zero on the compositions, according to [
13].
Following the zero replacement, data were represented through coordinates, according to Equations (
2) and (
3). The orthonormal basis
of the ilr-transformation was defined following a sequential binary partition (SBP) [
30], presented in
Table 5. The SBP was defined following the clinical interpretation of time spent in different glucose ranges which represent different situations (occurrence of hypo- and hyperglycemic events), as has been detailed in [
13].
The first coordinate (
ilr1) is calculated following the expression:
The
ilr1 can be interpreted as a balance between the log-ratio of the geometric mean of the times spent in <54 and 54–70 mg/dL (with +1 at the first line in the sign matrix of
Table 5) and the geometric mean of all the other times (with −1 in the sign matrix). It can be interpreted as the relationship between the time spent in the hypoglycemic ranges and the time spent in the normo- and hyperglycemic ranges.
The second coordinate (
ilr2, corresponding to the second row of
Table 5):
The ilr2 is equal to the log-ratio of the times spent in <54 and between 54–70 mg/dL and can be interpreted as the balance between the time spent in level 2 and level 1 hypoglyemia.
The
ilr3 is the balance between the log-ratio of the geometric mean of the times spent between 180–250 and >250 mg/dL and the geometric mean of the times spent between 70–180 mg/dL, i.e., the balance between time spent in the hyperglycemic and normoglycemic ranges:
The
ilr4 is equal to the log-ratio of the times spent in >250 and between 180–250 mg/dL. It is the ratio between the time spent in level 2 and level 1 hyperglycemia:
Figure 2 shows a 24-h period of a glucose time-series, in which the glucose ranges related to normoglycemia and both levels of hypo- and hyperglycemia are highlighted.
Following, the vectors corresponding with the composition, obtained from the 24-h time-series. The composition was defined according to Equation (
4). First, the distribution of the 288 samples, followed by its equivalent summed to 1440 min (24-h) and last, normalized to unity.
The computation of the clr and ilr transformations for these compositions, according to Equations (
2) and (
3), defined by Equations (
5)–(
8).
After the representation in coordinates, an exploratory analysis has been performed. A very useful exploratory tool that allows the discovery of potential clusters of similar compositions and significant statistical relationships between log-ratios of the parts is the clr-biplot [
31]. The clr-biplot is an adaptation of the biplot [
32] for compositional data.
K-means algorithm [
33] was applied to coordinates of 6-h periods at all times to check for different patterns of periods. The algorithm was also applied to the 24-h periods starting at 00:00 and ending at 24:00 to check for different patterns of days. Since the data now is represented through coordinates, the distance between two periods can be easily calculated as the Euclidean distance between two coordinates, meaning that k-means can be directly applied either to the ilr or clr coordinates [
34]. The algorithm was tested for several groups considering each time 25 random repetitions of the selection of initial centers. Clustering results can be evaluated using either external or internal validation, in which external information is provided or only the information within the data set is used for clustering validation, respectively. Three different indices that are used for internal validation are the Calinski–Harabasz index [
35], Dunn index [
36] and Silhouette index [
37], those indices have been recently used as validation methods to CoDa clusters in [
38,
39]. In this work, the choice of k considered, the interpretability of the clinical outcomes of the groups of both 6-h and 24-h and their distribution in the clr-biplot.
Groups of periods with different lengths were analyzed regarding the maximums and minimums of parts and ratios. Additional measures were also obtained to improve the interpretation of the 24-h periods: average blood glucose (BG), BG variation (BGV), Low and High Blood Glucose risk Indexes (LBGI and HBGI) [
40], total basal insulin, total bolus insulin, time of pump suspension and total carbohydrate (CHO) ingested.
As stated, k-means algorithm was applied separately for periods of 6-h and 24-h, therefore, the categorization for periods of different duration was done independently. Once the categories of the 24-h periods were obtained, a discriminant analysis method was applied to each patient’s 24-h periods from 00:00 to 24:00. This discriminant analysis considered only 24-h periods from 00:00 to 24:00 and was used to find a discrimination rule to assign any individual 24-h composition (x) at different times (00:00, 06:00, 12:00, and 18:00) to a group. The main objective is to make possible that the individual looks at the previous 24-h composition, at different times of the day and determine the group to which that 24-h period belongs.
We considered a linear discriminant (LD) model where the discrimination rule is based on compositional linear functions on
x [
34]. Discriminant analysis was developed by [
41] and it is one of the most traditional methods for classification. The discriminant rule is based on probabilities. One composition
x is classified in a determined group with the largest probability, using the information provided by the training data set. The input features considered for the LD classifier were the ilr coordinates. The discrimination functions used to classify the data were calculated considering the information included in the composition using leave-one-out cross-validation (LOOCV). The accuracy of the method was measured by the percentage of compositions correctly classified in its respective group.
The probability model of transition between a category of the previous 24-h period to a subsequent 6-h period was performed through a retrospective analysis of the data after the proper categorization of the periods. This analysis was performed at different times of the day: 00:00, 06:00, 12:00, and 18:00. The counts of moving from determined category of 24-h period to a category of 6-h period were expressed it in terms of probabilities of transition at different times of the day. A probabilistic model of transition between the category of the past 24-h of glucose to the category of the future 6-h period was obtained.