*Article* **A Deep Learning Approach to EMG-Based Classification of Gait Phases during Level Ground Walking**

### **Christian Morbidoni \*, Alessandro Cucchiarelli, Sandro Fioretti and Francesco Di Nardo \***

Department of Information Engineering, Università Politecnica delle Marche, 60100 Ancona, Italy

**\*** Correspondence: c.morbidoni@univpm.it (C.M.); f.dinardo@staff.univpm.it (F.D.N.);

Tel.: +39-071-220-4830 (C.M.); +39-071-220-4838 (F.D.N.)

Received: 30 July 2019; Accepted: 9 August 2019; Published: 14 August 2019

**Abstract:** Correctly identifying gait phases is a prerequisite to achieve a spatial/temporal characterization of muscular recruitment during walking. Recent approaches have addressed this issue by applying machine learning techniques to treadmill-walking data. We propose a deep learning approach for surface electromyographic (sEMG)-based classification of stance/swing phases and prediction of the foot–floor-contact signal in more natural walking conditions (similar to everyday walking ones), overcoming constraints of a controlled environment, such as treadmill walking. To this aim, sEMG signals were acquired from eight lower-limb muscles in about 10.000 strides from 23 healthy adults during level ground walking, following an eight-shaped path including natural deceleration, reversing, curve, and acceleration. By means of an extensive evaluation, we show that using a multi layer perceptron to learn hidden features provides state of the art performances while avoiding features engineering. Results, indeed, showed an average classification accuracy of 94.9 for learned subjects and 93.4 for unlearned ones, while mean absolute difference (±*SD*) between phase transitions timing predictions and footswitch data was 21.6 ms and 38.1 ms for heel-strike and toe off, respectively. The suitable performance achieved by the proposed method suggests that it could be successfully used to automatically classify gait phases and predict foot–floor-contact signal from sEMG signals during level ground walking.

**Keywords:** sEMG; deep learning; neural networks; gait phase; classification; everyday walking

#### **1. Introduction**

Electromyography is a widely-accepted tool able to provide an essential and original contribution to the characterization of the neuromuscular system [1]. In particular, surface electromyography (sEMG) is acknowledged as a non-invasive approach, specifically suitable to monitor muscle activity during dynamic tasks, such as walking [2–4]. In order to achieve a spatial/temporal characterization of muscular recruitment during walking, gait events, such as the instant of foot-floor contact and ground clearance, need to be assessed. This process starts from the identification of the two main gait phases, stance and swing. The stance phase designates the entire period during which the foot is on the ground, while the swing phase is characterized by the time the foot is in the air for limb advancement. The transitions between a swing and the subsequent stance phase is commonly referred to as heel-strike (HS), while the transition between a stance and the subsequent swing phases is referred to as toe-off (TO). Stance and swing identify the functional subdivisions of total limb activity within the gait cycle [2], thus precisely identifying HS and TO events is important to analyze the gait activity. For this reason, sEMG signals are typically coupled with signals able to provide the synchronization of the gait cycle, such as signals from foot-switch sensors [5], pressure mats [6], stereo-photogrammetric systems [7], and inertial measurements units (accelerometers and gyroscopes) [8].

Stereo–photogrammetric systems and pressure mats are affected by different relevant issues: high costs of the instrumentation, limited number of cycles observed, and/or invasiveness of experimental set-up. The use of wearable sensors seems to mitigate the impact of the costs and to allow the identification of gait events in a suitable number of cycles. Even so, the problems of encumbrance and time-consuming experimental set-up are still relevant, especially for applications in pathology. Moreover, wearable sensors can require particular care for the correct placement and the need of specific calibration procedures, not consistent with the timing of clinical practice. Thus, the idea of overcoming all these limitations developing novel techniques able to detect and classify gait events from sEMG signal alone is indeed starting to catch on. This kind of approach may involve machine learning and deep learning techniques.

#### *1.1. Aim of the Study*

Classifying gait events is a typical task which could be addressed by machine learning and deep learning techniques. Many examples were reported in literature [9–12]. However, only few reports addressed the issue of gait-phase classification from sEMG signal only [13–15]. These very recent approaches were based on hand-crafted features extracted from sEMG signals during treadmill walking and were evaluated on a relatively small number of subjects (up to eight). Walking on a treadmill is known to affect gait performance, resulting in increased number of steps and cadence, decreased preferred walking speed, stride and stance-phase length, slightly decreased joint range of motion, and changes in EMG activation with respect to level ground walking [16–19]. Thus, the reliability of the above mentioned EMG-based classifiers of stance and swing phases [13–15] is limited to treadmill data and is not tested on ground-walking data. In addition, treadmill walking occurs in very controlled conditions, characterized by a high repeatability of spatial/temporal parameters (including stance and swing duration). In contrast, everyday walking is characterized by a wider variability of spatial/temporal parameters and sEMG signals introduced by deceleration, reversing, curves and acceleration [16–19]. This variability is expected to affect the performance of a possible stance/swing classification and the consequent prediction of temporal parameters, such as heel-strike and toe-off timing.

Therefore, the aim of the present study is to propose an artificial neural network (ANN)-based approach to classify gait events (proper stance and swing phases) and to predict foot-floor-contact signal from sEMG signals, in conditions similar to everyday walking. With "conditions similar to everyday walking" we mean that, differently from previous study where gait phases were classified in the very controlled conditions of treadmill walking [13–15], in the present study each subject walked on level ground following an eight-shaped path (Figure 1) which includes natural deceleration, reversing, curve, and acceleration.

**Figure 1.** Illustration of the eight-shaped path used in our experiments.

### *1.2. Contributions*

The main contributions of the present study are three:


The remainder of this paper is organized as follows. Section 2 reports a brief review of the related works. Section 3 describes the dataset, the acquisition and the pre-processing of the signals, and the gait- phase classification by deep learning. Section 4 reports and discusses the results. Section 5 concludes the present study.

#### **2. Related Works**

Machine learning techniques, such as ANNs, are typically used for analyzing and classifying large amount of data and complex signals. Many applications of machine learning approaches in health care were reported [20–22]. Similar approaches were proven to be reliable also in classifying EMG signals during different tasks. Wavelet neural network and multi-layer perceptrons were used to handle EMG signals in order to identify neuromuscular disorders [23,24]. Learning vector quantization, support vector machine, and Levenberg–Marquardt-based networks were applied to EMG signals for classifying hand-motion patterns [25–27]. EMG-based unsupervised competitive learning techniques were employed for the identification of the muscle activity during pregnancy [28]. Efforts have been made to adapt these techniques for walking-task characterization: different machine learning approaches were applied to gait analysis data by Joyseeree et al. for disease identification [9]. Kaczmarczyk et al. [11] applied ANNs for gait classification in post stroke patients. Wang and Zieliska [12] designed an EMG-based method for detecting the variability in gait features depending on footwear, by applying vector quantization classifying networks and clustering competitive networks. Zou et al. [10] performed gait recognition analyzing inertial sensor data by means of deep convolutional neural network (and deep recurrent neural network approaches

In particular, literature reports only few attempts to provide a machine learning approach which used the sEMG signal only for the classification of stance and swing phases [13–15]. In [15] a set of time-domain features is extracted from EMG signal segments and hidden Markov models are used to individuate stance and swing phases. Evaluation is performed on a single subject walking on a treadmill, reporting a maximum classification accuracy of 91.08%. In [14] a novel bilateral feature is extracted from the EMG signal and is used to classify stance and swing phases by training a support vector classifier. The method is evaluated on two subjects walking on a treadmill at different speeds. The best reported accuracy (96%) corresponds to the case where a subset of the gait cycles from a subject are used to classify the entire walk of the same subject. In [13] a set of time-domain features are extracted from EMG signals and fed to a single hidden layer neural network to classify gait phases. This study is the most similar to ours, as it explicitly targets unlearned subjects, i.e., subjects not used as inputs during the training phase and attempts to predict timing of phase transitions. However, the data used is derived from 8 subjects walking on a treadmill and the evaluation of phase transition

detection is performed on only 5 s of the walking of a single unlearned subject, reporting a mean average error of 35 ± 25 ms for HS and 49 ± 15 ms for TO. All these recent approaches were based on hand-crafted features extracted from sEMG signals during treadmill walking and were evaluated on relatively small populations of subjects (up to eight).

#### **3. Materials and Methods**

#### *3.1. Dataset*

The dataset included signals recorded from 23 healthy adults (12 females and 11 males), acquired in the Movement Analysis Laboratory of Università Politecnica delle Marche, Ancona, Italy. Mean (±SD) characteristics were: age = 23.8 ± 1.9 years; height = 173 ± 10 cm; mass = 63.3 ± 12.4 kg; body mass index (BMI) = 20.8 ± 2.1 kg/m2. None of the subjects presented any pathological condition or had undergone orthopedic surgery that might have affected lower limb mechanics. Therefore, subjects with joint pain, neurological pathologies, orthopedic surgery, abnormal gait or a body mass index (BMI) higher than 25 (overweight and obese) were not recruited. The research was undertaken in compliance with the ethical principles of the Helsinki Declaration and was approved by an institutional expert committee. Participants signed informed consent prior to the beginning of the test.

#### *3.2. Signal Acquisition*

The multichannel recording system, Step 32 (Medical Technology, Italy, Version PCI-32 ch2.0.1. DV, resolution: 12 bit; sampling rate: 2 kHz) was used to acquire surface electromyographic (sEMG) and basographic signals (i.e., the signals from footswitches). Each lower limb was instrumented with three foot-switches and four sEMG probes. Foot-switches (surface: 1.21 cm2, activation force: 3 N), were pasted beneath the heel, the first and the fifth metatarsal heads of the foot. Single differential sEMG probes with fixed geometry (Ag/Ag-Cl disk; electrode diameter: 0.4 cm; inter-electrode distance: 0.8 cm; gain: 1000; high-pass filter: 10 Hz; input impedance: 1.5 G; CMRR > 126 dB; input referred noise: 1 Vrms) and with variable geometry (Ag/Ag-Cl disks; minimum inter-electrode distance: 12 mm, gain: 1000, high-pass filter: 10 Hz, input impedance >1.5 G, CMRR >126 dB, input referred noise 200 nVrms) were placed on the belly muscle to detect the sEMG signals. Skin was shaved, cleansed with abrasive paste and wet with a damp cloth. Probes were placed over tibialis anterior, gastrocnemius lateralis, hamstrings, and vastus lateralis, following the recommendations provided by the European concerted action SENIAM (surface EMG for a non-invasive assessment of muscles) for electrodes location with respect to tendons, motor points and fiber orientation [29]. Each volunteer walked barefoot on the floor at her/his own chosen pace for about 5 min, following an eight-shaped path [30], which includes natural deceleration, reversing, curve and acceleration (Figure 1).

### *3.3. Pre-Processing*

Footswitch signals were converted and processed so as to identify the different gait cycles and phases (stance and swing), according to the approach discussed in details in [31].

Electromyographic signals were processed by a high-pass, linear-phase FIR filter (cut-off frequency: 20 Hz), in order to avoid phase distortion effects and by a low-pass, linear-phase FIR filter (cut-off frequency: 450 Hz). Then, sEMG signals were full-wave rectified and the envelope was extracted (second-order Butterworth low-pass filter, cut-off frequency: 5 Hz). Figures 2 and 3 showed, respectively, the raw EMG signals recorded for the right leg and the envelope obtained as a result of the pre-processing step.

The sEMG and basographic data analyzed in the present study are going to be published in a public dataset, and are currently available for research purposes by contacting the authors.

**Figure 2.** Raw electromyographic (EMG) signals recorded from the four muscles of the right leg. Corresponding heel-strike (HS) and toe-off (TO) timing are highlighted.

**Figure 3.** The envelope resulting from the pre-processing of the raw EMG signals recorded from the four muscles of the right leg. Corresponding HS and TO timing are highlighted.

#### *3.4. Gait Phase Classification*

#### 3.4.1. Data Preparation

Before feeding data into the classifier, a min-max normalization of each muscle signal was performed within each subject, thus mapping the values in the [0–1] interval. In order to train the classifier, we split the signals into 20 data samples windows (corresponding to 10 milliseconds) for both stance and swing phases.

We then aggregated the synchronized EMG-signal segments corresponding to the eight muscles (four for each leg) into a single vector of 160 elements. Each EMG vector was composed of 20 sequences of eight elements. Each element represents the EMG-signal values of the eight muscles in that single time-sample. Thus, the first eight elements of the vector were the EMG signal values of the eight muscles computed in the first sample of the segment. The following eight elements of the vector are the EMG signal values of the eight muscles computed in the second sample of the segment, and so on up to the twentieth sample. The structure of the input vector is illustrated in Figure 4. *L*1*i*, *L*2*i*, *L*3*i*, and *L*4*<sup>i</sup>* were the values of EMG signals in the sample *i* corresponding to the muscles of the left leg, respectively: tibialis anterior, gastrocnemius lateralis, hamstring, vastus lateralis. *R*1*i*, *R*2*i*, *R*3*i*, and *R*4*<sup>i</sup>* represented the correspondent for the right leg. Each input vector was assigned to the label 0 if the corresponding signals belong to a stance phase, and to 1 if the corresponding signals belong to a swing phase. After removing the segments before the first swing phase, to avoid considering muscle activation recorded in non-walking conditions, we obtained 522.936 labelled segments from the 23 subjects.

**Figure 4.** The structure of EMG vectors fed as input to the artificial neural networks (ANNs).

In our study we attempted to classify gait phases and to detect the timing of phase transitions (HS and TO events). In particular we are targeting previously unseen subjects, i.e., subjects whose gait recordings were not used in the training phase.

Accordingly, we performed a cross-validation using 23 folds, each of which uses data from 22 subjects (LS set) in training and 1 in test (US set). At each fold, a different subject is used as the test subject (unseen). In order to measure the phase classification performances also for learned subjects, we further split the LS set into training set (LS-train) and test set (LS-test). More precisely, LS-train includes the first 90% of the each subject strands (approximately 3 min and 30 s, 180 gait cycles) and LS-test the remaining 10% (approximately 30 s, 20 gait cycles).

#### 3.4.2. Neural Networks

We experimented with different multi layer perceptron (MLP) architectures. In Table 1 we summarize the different architectures for which we report the results in the following section. The first model (*MLP*1) was a shallow network with one single hidden layer composed of 128 units (neurons) and had a one-dimensional output. The output was fed to a sigmoid function and a 0.5 threshold is used to obtain a binary output: when the output of the sigmoid was >0.5 we assigned the label

1, otherwise we assigned the label 0. We then experimented with deeper networks, composed of 2–5 hidden layers (Table 1). In all the architectures we used rectified linear units (ReLU) to provide non-linearity between hidden layers. As an example, in Figure 5 we illustrate the structure of the *MLP*<sup>4</sup> architecture.

In our experiments, we used stochastic gradient descent (SGD) as the optimization algorithm and binary cross entropy as the loss function (BCE).

The value 0.1 was experimentally identified as the optimal learning rate for all the tested models and thus adopted in all the experiments. Finally, all ANN models were trained using an early stop technique, according to the following procedure. The networks were trained for a maximum of 100 epochs, stopping when the accuracy on the validation set did not increase for 10 consecutive epochs. The best-performing learned parameters were adopted to evaluate the model performances over LS-test and US sets and the basographic signal was used as ground truth.

**Table 1.** Overview of the multi layer perceptron (MLP) architectures.


**Figure 5.** The architecture of multi layer perceptron (MLP)4.

### *3.5. Gait Events Timing Detection*

The predicted foot–floor-contact signal was reconstructed by chronologically arranging the binary output of the network. Thus, a vector was obtained, composed of sequences of 0 (stance phase) alternating with sequences of 1 (swing phase). This vector was chronologically scanned in order to detect the transitions between gait phases: from swing to stance phase (HS) and from stance to swing phase (TO). HS was identified as the sample where the transition from 1 to 0 occurred. TO was identified as the sample where the transition from 0 to 1 occurred.

Then, the predicted signal was cleaned by removing those phases that were too short according to physiological constraints, probably due to classification errors. We adopted the following procedure. Starting from the first HS, the following 500 samples (250 ms) were scanned to find out and remove those having a value of 1. Then, the following HS was identified, the process was repeated and so on. In the same way, starting from the first TO, the following 500 samples were scanned to find out and remove the samples which assumed the value of 0. Then, the following TO was identified, the process was repeated and so on. Eventually, the cleaned vector was chronologically scanned again in order to detect the transitions between gait phases (from 0 to 1 and from 1 to 0) and thus the timing of the gait events.

#### *3.6. Evaluation Measures*

In this work, a EMG signal's segment classifier is ultimately used to predict a biographic signal, that is predicting the precise timing of gait phase transition. We first evaluate the performance of the classifier in assigning the correct label (0 for stance and 1 for swing) to single EMG segments. This is done using standard classification metrics, by calculating accuracy, precision, recall and *F*<sup>1</sup> score, as the harmonic average of the precision and recall. However, this measure does not provide enough information to evaluate the performances of the basographic signal prediction. In fact, even a high accuracy, if errors are concentrated in proximity of transitions, may lead to unsatisfactory results in terms of time error of transition instants. Furthermore, we apply a post processing to the classifier output, to remove false prediction and improve performances, thus we need to explicitly evaluate the predicted basographic.

For that purpose, we adopt the following procedure, used in literature to evaluate gait events prediction, e.g., in [32,33]. We first chose a temporal tolerance *T*, which we set to 600 milliseconds. Then we consider as true positive each predicted TO or HS event at time *tp* if an event of the same type exists in the ground truth signal at time *tg* such that |*tg* − *tp*| < *T*. Otherwise we consider the predicted event a False Positive. We then measure the precision, recall and *F*<sup>1</sup> score and, for all the true positives, we calculate the mean average error (MAE) as the average time distance between the predicted event and the one, of the same type, in the ground truth signal.

We adopted this evaluation strategy to measure the performances of our approach and comparing it with a feature-based one.

#### **4. Results and Discussion**

To the best of our knowledge, this study is the first attempt to provide a reliable binary classification of level ground walking into stance and swing phases, by means of the application of deep learning techniques to sEMG signal. Starting from gait-phase classification, the study achieves also a prediction of foot-floor-contact signal and a consequent identification of heel-strike and toe-off timing.

#### *4.1. Gait-Phase Classification*

Mean classification accuracy (±SD) obtained over the 23 folds with different MLP architectures for both learned subjects (LS-test set) and unlearned ones (US set) is shown in Table 2, where the best results are in bold. The same convention is used in all the tables.

**Table 2.** Gait phase classification accuracy (± standard deviation (SD)) averaged over the 23 folds for each considered network.


As expected, classification accuracy for learned subjects was higher than for unseen ones. However, the limited gap (around 1–2 percent) suggests that all the networks succeed in learning signal patterns that generalize well to unseen subjects. The best accuracy on unlearned subjects was obtained with *MLP*3. By looking at the standard deviation reported in Table 2, one can notice that, while for learned subjects the accuracy was uniform across the different folders, there was a higher variability when considering unseen subjects. This is partially due to the fact that there were 22 unlearned subjects in the LS-test in each folder, while the US set was composed of a single subject. Such a variability also suggests that walking patterns might be very different from subject to subject, especially in everyday walking conditions, making the classification harder if a subject had never

been seen before. However, looking at results on US sets over the 23 folds, as shown in Figure 6, the accuracy does not fall below 87.6% (subject 13) and reaches the highest value of 97.3% (subject 20). Such results are, in our opinion, promising and suggest that the variability could be reduced, and mean accuracy increased, by considering a larger number of subjects to learn from. Tables 3 and 4 report precision, recall and *F*<sup>1</sup> score of the classification of Stance and Swing phases for, respectively, unlearned (US set) and learned (LS-test set) subjects. Results are averaged over the 23 folds. It is worth noticing that, in line with what was reported in literature [2], the segments belonging to a stance phase were more frequent (around 60%) than those belonging to a swing phase. This is because in normal walking the stance phase duration was 60% of the gait cycle (while the swing phase duration was the remaining 40%) on average. Despite this, results obtained for swing labelled segments are better than those obtained for stance labeled ones, both in unlearned and learned data.


**Table 3.** Gait phase classification performances for stance and swing phases in unlearned subjects (set US). Precision, recall and *F*<sup>1</sup> scores are averaged over the 23 folds.

**Table 4.** Gait phase classification performances for stance and swing phases in learned subjects (set LS-test). Precision, recall and *F*<sup>1</sup> scores are averaged over the 23 folds.


**Figure 6.** Classification accuracy over the 23 folds for the unlearned subjects (US set).

#### *4.2. Comparison with Feature-Based Approach*

Classification of EMG signals from lower-limb muscles is usually based on time/frequency domain features extraction [34]. In order to provide a comparison with a feature-based method, we implemented a classifier following the approach described in [13]. We used a window size of 200 samples and for each sEMG signal we calculated the following features: standard deviation (SD), root mean square (RMS), mean absolute value (MAV), integrated EMG (IEMG) and waveform length (WL). They correspond to the group 2 features used in [13], which provide the best classification accuracy. We then concatenated the features obtaining a 40 length input vector (five features for each of the eight muscles). This was fed into a multi layer perceptron to train a gait phase classifier. A single hidden layer with 10 units is used in [13], where the input vector length was 10. We ran classification experiments over the 23 folds training a single layer network with 10 units, a single layer network with 40 units (corresponding to the size of our input vector) and all the networks in Table 1. The best average classification accuracy of 87.69 ± 5.9 for unlearned subjects was achieved with *MLP*3, while accuracy for LS was 88.03 ± 2.7, thus we used *MLP*<sup>3</sup> for predicting HS an TO timing, adopting the same procedure applied to our approach and described in Section 3.5. In the following sections we refer to such a feature based method as *FB*.

#### *4.3. Gait Events Detection*

The analysis of the results identified *MLP*<sup>3</sup> as the best model for the classification of unlearned data, obtaining an accuracy of 93.41% (Table 2), a *F*<sup>1</sup> score of 92.46% for stance phases and of 94.11% for swing phases (Tables 3 and 4). Thus, *MLP*<sup>3</sup> has been adopted to predict foot-floor-contact signal and identify HS and TO timing in unlearned subjects, following the procedure described in Section 3.5. The prediction was tested by comparing HS and TO timing provided by the present approach vs. heel-strike and toe-off timing measured from the basographic signal. Examples of prediction of foot–floor-contact signal provided by the present approach in US subjects (blue line) vs. the ground truth (red line) are depicted in Figure 7. As one can see, our method provides good predictions also in the presence of an irregular walking activity, i.e., non uniform gait phases duration, due to the everyday walking conditions addressed in this study.

**Figure 7.** Examples of TO and HS predictions on six random subjects.

Purely data-driven approaches for gait-phase identification have been shown to achieve lower performance in less controlled scenarios, such as conditions similar to everyday walking [32]. Adapting to the larger variability of sEMG signals collected from dynamic environments and activities is more challenging for the classifiers. This could be due to the reported differences [16–19] between treadmill and straight ground walking (i.e., increased number of steps and cadence, decreased preferred walking speed, stride and stance-phase length and changes in sEMG activation) and to the further variability of spatial and temporal parameters and sEMG signals introduced by deceleration, reversing, curve and acceleration typical of everyday walking [16]. Despite this, the accuracy of *MLP*<sup>3</sup> classification is good for both LS-test subjects and US subjects (93.41%). The high classification accuracy and the effective post-processing of model output allowed to achieve an average absolute prediction error over unlearned subjects of 21.42 ± 7.0 ms for HS and 38.1 ± 15.2 ms for TO (Table 5). Both HS and TO predictions may be considered reasonably suitable, since they show an average absolute error <1% and <2% of a gait cycle duration, respectively. The present approach shows better performance in detecting

heel-strikes rather than toe-offs. This result matches previous EMG-based and accelerometer-based reports [13,32].


**Table 5.** Performance of toe-off (TO) and heel-strike (HS) detection for unlearend subjects over 23 folds.

To test the reliability of HS and TO prediction, results provided by our approach have been compared with the results achieved by feature-based (FB) approach. Comparison is reported in Table 5. Our approach outperformed the FB one, suggesting that the neural network succeeded in learning latent features that are better suited for the task at hand if compared to the ones used in previous studies [13]. Besides the above reported direct comparison with the FB approach in the same population, an idea of the quality of the present results could be given also through the analysis of the results reported in [13] in their own population, during treadmill walking. The classification accuracy reported in [13] was 87.5% on learned subjects and 77% for unlearned ones. The associated average prediction error computed in unlearned subjects was 35 ± 25 ms for HS and 49 ± 15 ms for TO. Compared with those performances, results provided by the present study are encouraging, despite the challenging conditions of everyday walking. Besides the different approach (different neural network and different processing of the input signal), the elevated classification-accuracy values and the satisfactory HS/TO-prediction achieved here are supposed to be due also to the characteristics of the experimental set-up: high number of strides acquired per subject (about 500) associated with quite a large number of subjects (23); four different muscles per leg for every subject involved in training process. Moreover, foot-switches [31,35] represent the gold standard in gait segmentation since each gait phase can be associated with a specific value of the sensor output [36]. Finally, in the present study, data from three foot-switches was considered, in line with what reported in literature [31]. Using three foot-switches, instead of the two used in [13], probably improved the reliability of basographic signals as ground truth as well as the reliability of performance evaluation.

#### **5. Conclusions**

The present study proposed a suitable approach for classifying stance vs. swing and predicting the occurrence of the transition between phases, such as heel strike and toe off. The main contribution of the study is to provide reliable performances in gait-phase classification and gait-event prediction in natural walking conditions (similar to everyday walking), overcoming the constraints associated to the controlled environment (such as treadmill walking) used by previous studies to address this kind of task. A further methodological contribution consists of proposing a different pre-processing of sEMG-signal in order to better train the neural networks; the direct use of linear envelopes proposed here allows the network to automatically learn relevant higher level (hidden) features, avoiding hand crafting ad-hoc features and contributing to improve the performances.

From the clinical point of view, a relevant contribution is the fact that the present approach is based on sEMG signals only. This could considerably help reduce the number of sensors necessary for a complete gait protocol, limiting the clinical encumbrance, time-consumption, and cost. Thus, one of the main application domains of the present methodology should be in the field of neuromuscular diseases, such as spastic cerebral palsy, where the acquisition and then the analysis of sEMG signals are essential and special care should be exercised in the treatment of the patients.

Moreover, the present classifier may also support the process of gait phase detection in EMG-driven assistive devices [37], such as hip–knee, ankle–foot, and knee–ankle–foot orthoses and exoskeletons, as the identification of gait events is a continuing concern in the use of these devices. Evaluating the performance of the classifier after reducing the complexity of experimental protocol (i.e., the number of monitored muscles) could also be valuable. At the time of writing, all these applications are beyond the goals of the present work. However, future efforts will point in that direction.

**Author Contributions:** conceptualization, F.D.N., C.M., S.F. and A.C.; methodology, F.D.N. and C.M.; software, C.M.; investigation, C.M. and F.D.N.; validation, C.M.; resources, F.D.N. and S.F. ; data curation, F.D.N. and C.M.; writing—original draft preparation, F.D.N. and C.M.; writing—review and editing, A.C. and S.F.; visualization, F.D.N. and C.M.; supervision, A.C. and S.F.

**Funding:** This research received no external funding.

**Acknowledgments:** Thanks to Guido Mascia and Lorenzo Principi for the support provided in preparing the experiments.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **An Ensemble Learning Approach Based on Diffusion Tensor Imaging Measures for Alzheimer's Disease Classification**

**Eufemia Lella 1, Andrea Pazienza 1,\*, Domenico Lofù 1,2, Roberto Anglani 1,† and Felice Vitulano 1,†**


**Abstract:** Recent advances in neuroimaging techniques, such as diffusion tensor imaging (DTI), represent a crucial resource for structural brain analysis and allow the identification of alterations related to severe neurodegenerative disorders, such as Alzheimer's disease (AD). At the same time, machine-learning-based computational tools for early diagnosis and decision support systems are adopted to uncover hidden patterns in data for phenotype stratification and to identify pathological scenarios. In this landscape, ensemble learning approaches, conceived to simulate human behavior in making decisions, are suitable methods in healthcare prediction tasks, generally improving classification performances. In this work, we propose a novel technique for the automatic discrimination between healthy controls and AD patients, using DTI measures as predicting features and a softvoting ensemble approach for the classification. We show that this approach, efficiently combining single classifiers trained on specific groups of features, is able to improve classification performances with respect to the comprehensive approach of the concatenation of global features (with an increase of up to 9% on average) and the use of individual groups of features (with a notable enhancement in sensitivity of up to 11%). Ultimately, the feature selection phase in similar classification tasks can take advantage of this kind of strategy, allowing one to exploit the information content of data and at the same time reducing the dimensionality of the feature space, and in turn the computational effort.

**Keywords:** diffusion tensor imaging; ensemble learning; decision support systems; healthcare; machine learning; computational intelligence; Alzheimer's disease

### **1. Introduction**

Alzheimer's disease (AD) is the most common type of neurodegenerative disorder causing dementia, generally characterized by loss of memory and a progressive decline of cognitive functions. AD affects millions of people worldwide, and according to the World Alzheimer's report 2015 [1], people affected by dementia will reach 131.5 million in 2050. The in vivo diagnosis of AD is still a hard task because of the diversity of symptoms manifested by patients. In this context, a very challenging goal is the development of innovative computational-intelligence-based diagnostic tools that can support physicians and specialists in the early identification of the pathology and in therapeutic plan decisions. Advances in neuroimaging techniques have been fundamental for structural and functional brain analysis allowing the identification of AD-related brain alterations [2–4]. Due to the difficulty of integrating data on a large scale, machine learning methods (ML) allowing patient classification driven by large amounts of data are gaining increasing interest in recent years in the field of digital healthcare [5,6]. ML algorithms are a collection of computational and statistical models that can learn through experience and make predictions based on new data [7]. Machine learning approaches are able to uncover patterns in the data for differentiating diagnostic groups and identifying pathological scenarios [8,9]. Several

**Citation:** Lella, E.; Pazienza, A.; Lofù, D.; Anglani, R.; Vitulano, F. An Ensemble Learning Approach Based on Diffusion Tensor Imaging Measures for Alzheimer's Disease Classification. *Electronics* **2021**, *10*, 249. https://doi.org/10.3390/ electronics10030249

Academic Editor: Heung-Il Suk Received: 30 December 2020 Accepted: 19 January 2021 Published: 22 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

recent studies have analyzed the potential of applying ML-based analytical frameworks to MRI data for the characterization and the automatic diagnosis of AD [10–13]. Indeed, the biological hypothesis that the cognitive decline due to AD is related to a connectivity disruption between brain regions caused by white matter degeneration (WM) has been widely investigated in literature [14,15]. In this context, diffusion tensor imaging (DTI) has emerged in the last fifteen years as a promising technique that measures the diffusion of water along WM fibers, providing information on their integrity [16]. The trajectory and the integrity of the main WM fiber bundles in the brain can be evaluated by tracing the highly anisotropic diffusion of water along axons [17]. Since DTI is a neuroimaging technique capable of characterizing white matter fiber trajectories and of highlighting microscopic WM lesions in these bundles, it can be exploited to uncover signs of connectivity impairment not detectable by means of standard anatomical MRI. Among the different measures that can be calculated from the diffusion tensor [17], fractional anisotropy (FA) and mean diffusivity (MD) have played major roles as AD biomarkers [18]. As a matter of fact, in a healthy axon water diffusion is highly anisotropic, because it is almost completely bound in one direction; consequently, large values of FA paired to small MD measures usually describe non-pathological scenarios. From this perspective, DTI allows to investigate microstructural disease-related changes complementary to the information on brain atrophy highlighted by anatomical MRI.

Recent applications of DTI techniques, together with ML algorithms for the classification of AD, use three possible methods for feature extraction: region of interest (ROI)-based, voxel-based and tractography-based approaches. In a ROI-based approach, the brain is parceled into regions of interest, and the mean of the DTI measures is then calculated for each ROI. The DTI scalar indexes averaged over each ROI are then used as features for feeding ML algorithms to classify AD subjects also at early stages of the disease and for investigating WM integrity alterations [19,20]. Several studies based on this approach have been conducted with multimodal analysis [21]. In tractography-based approaches, DTI fiber tracking algorithms together with a parcelation scheme are used to model the brain as a network and to study its connectivity through graph theory. Network measures turned out to be effective variables to characterize the connectivity alterations due to AD [22–24], and valid features from which to build classification models [25–27]. In voxel-based approaches, starting from fractional anisotropy maps and using the tract-based spatial statistics, a white matter "skeleton" is obtained, containing WM tracts common to all subjects. The diffusion maps of each subject are projected onto the average fractional anisotropy skeleton; hence, all diffusivity measures of the voxels belonging to that skeleton can be exploited for feeding classification algorithms and for performing voxel-wise statistical analyses aimed at localizing brain changes related to the onset and development of the pathology.

Machine learning methods for the identification of AD phenotypes are typically based on individual classifiers [28–30] or ensembles of different classifiers trained on the same set of features [25,31]. Ensemble learning is a ML approach—generally improving classification performances [32,33]—that integrates multiple classifiers fed with the same group of features or with several vectors of variables describing different representations of the same physical phenomenon [34]. Ensemble learning was conceived to simulate human behavior in making decisions, and for this reason it can be a suitable approach in the medical diagnosis context, where humans usually ask the opinions of various doctors to increase the reliability of a diagnosis.

In this paper, we propose a novel classification framework based on ensemble learning for the automatic discrimination between healthy controls (HC) and AD cases, relying on DTI measures as predicting variables. This kind of ensemble method is able to conveniently exploit the informative contents of individual maps, associated with specific aspects of microstructural fiber integrity, and to enhance the generalization ability, taking into account the peculiarities of different classifiers related to each set of features. Moreover, this methodology is aimed at enhancing computational efficiency, focusing in particular on combinations of single groups of variables instead of considering the usual approach of global feature concatenation. The paper is organized as follows. Section 2 introduces the diffusion tensor imaging (DTI) techniques able to investigate white matter fiber integrity through measurement of anisotropy of WM tracts and water diffusion along them. In Section 3 after a brief description of feature extraction procedures and classification models adopted in the present work, a learning experiment is detailed. Finally, Section 4 reports the results of the experiment and Section 5 discusses the main findings together with future research directions.

#### **2. Diffusion Tensor Imaging**

Diffusion, also known as Brownian motion, is the process of the random constant microscopic molecular motion caused by heat. In an anisotropic mean, like WM, diffusion is characterized by a tensor, called the effective diffusion tensor *D*eff, which fully describes the molecular mobility along the three spatial directions and the correlations between these directions. In the framework of MRI-based neuroimaging, diffusion tensor imaging (DTI) is a technique which evaluates the location, orientation and anisotropy of the brain's WM tracts, providing the estimation of the diffusion tensor for each voxel of the 3D image.

From a geometric point of view, the diffusion tensor completely characterizes the shape of an ellipsoid by means of six variables describing the diffusion coefficient of water molecules at a specific time in each direction. In the case of isotropic diffusion, the diffusion coefficient is equal in every direction and the ellipsoid turns into a sphere. Instead, in the case of anisotropic diffusion the greater mean diffusion along the longest axis of the ellipsoid is described by an elongated ellipsoid. The tensor matrix is symmetric according to a property describing the antipodal symmetry of Brownian motion that is called "conjugate symmetry". The diagonal terms of the diffusion tensor quantify the intensity of diffusivity in each of three orthogonal directions. The off-diagonal terms (vanishing in case of isotropy) indicate the magnitude of diffusion along one direction arising from a concentration gradient in an orthogonal direction.

Therefore, diffusion data are crucial in order to gain information on tissue microstructure and architecture for each voxel [16,17]. In particular, the three eigenvectors and the eigenvalues *λ*1, *λ*<sup>2</sup> and *λ*<sup>3</sup> of *D*eff describe the directions and lengths of the three diffusion ellipsoid axes, respectively, in descending order of magnitude. The largest (primary) eigenvector and the related eigenvalue *λ*<sup>1</sup> represent the direction and magnitude of greatest water diffusion, respectively. The primary eigenvector provides an important contribution to the fiber tractography algorithms, since it indicates the orientation of axonal fiber bundles. Eigenvalue *λ*1, called "longitudinal diffusivity" (LD), indicates the diffusion rate along the fibers' orientation. Eigenvalues *λ*<sup>2</sup> and *λ*3, associated with second and third eigenvectors orthogonal to the primary one, represent the magnitude of diffusion in the plane transverse to the axonal bundles. The mean value,

$$RD = \frac{\lambda\_2 + \lambda\_3}{2},\tag{1}$$

is called "radial diffusivity" (RD). The mean diffusivity (MD) indicates the mean displacement of molecules (average ellipsoid size) and describes the directionally averaged diffusivity of water within a voxel. It is defined as the mean of the three eigenvalues:

$$MD = \frac{\lambda\_1 + \lambda\_2 + \lambda\_3}{3} \tag{2}$$

The fractional anisotropy (FA) measures the degree of directionality of intravoxel diffusivity, i.e., the fraction of the diffusion that is anisotropic:

$$FA = \sqrt{\frac{1}{2} \frac{\left(\lambda\_1 - \lambda\_2\right)^2 + \left(\lambda\_2 - \lambda\_3\right)^2 + \left(\lambda\_3 - \lambda\_1\right)^2}{\lambda\_1^2 + \lambda\_2^2 + \lambda\_3^2}}\tag{3}$$

This measure basically represents a distance between the tensor ellipsoidal shape from a perfect sphere. Values of the fractional anisotropy range from zero, meaning an isotropic diffusion, to 1, in case of a linear diffusion occurring only along the primary eigenvector. When *λ*<sup>1</sup> *λ*2, *λ*3, the fractional anisotropy measure is close to 1, indicating a preferred direction of diffusion.

#### **3. Materials and Methods**

#### *3.1. Data Collection*

Real-world data have been gathered from the Alzheimer's Disease Neuroimaging Initiative (ADNI) which has the primary goal of testing whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD) (for up-to-date information, see www.adni-info.org) [35].

The dataset is made of diffusion-weighted scans from a cohort of 92 subjects of both genders, with age ranging from 55 to 90, from the ADNI-GO and ADNI-2 phases. According to their diagnoses, the subjects were grouped into 49 HC and 43 AD patients. Pre-processed FA, MD, RD and LD maps, available in ADNI databases, were randomly selected from baseline and follow-up study visits. It is worth mentioning that healthy subjects did not report symptoms of mild cognitive impairment, dementia, or depression; subjects with AD were those who met the NINCDS/ADRDA criteria for probable AD. The acquisition of diffusion-weighted scans was carried out through a 3-T GE Medical Systems scanner. In particular, for each subject 46 distinct images were collected articulated in 41 diffusion-weighted images (*b* = 1000 s/mm2) and 5 scans with negligible diffusion effects (*b*<sup>0</sup> images).

#### *3.2. Image Processing and Feature Extraction*

The first step of the image processing is a double registration step. It consists of aligning the maps of all subjects so that the same microstructural areas of the anatomical regions correspond to the same voxels in the images. Then the maps are transformed into an existing standard space template image (in this case the MNI152 standard space [36] is used). After the registration, the voxels belonging to the white matter main fiber tracts are extracted from each map.

Following the acquisition of general diffusivity maps (including FA, MD, RD and LD), for each subject, all image processing steps were performed with FMRIB Software Library (*FSL*) [37], and in particular its diffusion toolkit *FDT*. In order to carefully align FA, MD, RD and LD maps to a group-wise space and to focus the analysis only on voxels that belong to the WM fiber bundles, a tract-based spatial statistics (*TBSS*) [38] standard procedure, included in *FSL*, was performed according to the following steps:


4. Projection of all FA maps onto the mean FA skeleton: this allowed us to achieve an alignment among all subjects in the direction orthogonal to the fiber bundle orientation. The same elaboration steps were applied to RD, MD and LD maps.

**Figure 1.** Coronal, sagittal and axial views of the brain obtained in the FSLeyes image viewer: the mean fractional anisotropy (FA) skeleton (green) is overlaid with the mean FA map. For the following analysis all maps were projected onto the white matter skeleton.

The *TBSS* procedure generates, for each subject and for each diffusivity metric (FA, MD, RD, LD), approximately 9 × 104 voxels, belonging to the WM skeleton and representing the features of our classification task.

#### *3.3. Classification Methods*

Supervised learning methods are statistical learning techniques aimed to the classification of instances based on labeled training data. In the present paper, in order to build the ensemble approach, we investigate the most commonly used ML algorithms for medical classification tasks: support vector machine, random forest and multi-layer perceptron.

Support vector machine (SVM) [39] is a supervised learning algorithm based on the concept of an optimal hyper-plane that separates observations belonging to two different classes. In the case of a linear classification problem, given *n* data points belonging to two linearly separable sets in a *p*−dimensional space, the task is to find a (*p* − 1)−dimensional hyper-plane that can classify two classes with the largest margins, i.e., the largest distance to the boundary from the closest points in each set. In cases when data are not linearly separable, a possible solution is to map the original data onto a higher-dimensional feature space in order to favor a more effective separation. Support vector classifiers are then generalizations of the linear classifier approach to an "augmented" feature space with significantly high dimensionality (see left panel in Figure 2). Assuming that the transformed feature vectors are given by the function *h*(**x**), the optimization problem can be conveniently recast as a quadratic programming problem using Lagrange multipliers in which the transformed vectors *h*(**x**) are involved in the form of scalar products. Thanks to this trick, it is not important to know the transformation, but only the type of the kernel function *K*(*x*, *x* ) = *h*(**x**), *h*(**x** ). Consequently, the configuration of a SVM classifier is completely characterized by the regularization parameter *C* and the choice of kernel function. In the present work, for the hyper-parameter tuning phase, the chosen functions are: (1) *d*−degree polynomial: *K*(**x**, **x** )=(1 + **x**, **x** )*d*; (2) radial basis function (RBF): *K*(**x**, **x** ) = exp(−*γ*||**x** − **x** ||2), where values of parameters *<sup>d</sup>*, *<sup>γ</sup>*, *<sup>κ</sup>*<sup>1</sup> and *<sup>κ</sup>*<sup>2</sup> span specific ranges.

Random forest (RF) is a supervised learning algorithm based on the construction of a collection of decision trees, known to be one of the best classifiers in terms of prediction accuracy and efficiency for high-dimensional datasets [40,41]. RF models operate by constructing a multitude of decision trees in the training phase and returning as a prediction the class predicted most frequently by each tree composing the forest, with the aim of reducing the variance of the final result. The RF training algorithm is based on the general technique of bootstrap aggregating to the trees under training. Let (*X*,*Y*) be the pair of training set *X* and target vector *Y* where *X* = {**x**1, ... , **x***n*} and *Y* = *y*1, ... , *yn*. The strategy applies repeated (*B* times) extraction with the replacement of a random sample from *X* and a fit of the trees to this sample. In particular, for *b* = 1, ... , *B*, the procedure is the following: (1) Random sampling with replacement of *n* observation from training set *X* obtaining the subsets (*Xb*,*Yb*). Generally, for a classification problem with *p* features, the cardinality of the subset is of order √*p* in order to reduce the correlation between trees originated by bagging. (2) Training of the *b*-th tree *fb* on (*Xb*,*Yb*). (3) Out-of-sample prediction on unseen dataset *x*∗ is the response outcome obtained from the majority of the results generated by each single tree. The number of trees *B* in a forest is the free parameter of the model, usually set at an order of magnitude of at least 102.

**Figure 2.** (**a**) Cartoon picture of a SVM classifier with nonlinear kernel: dots of different colors represent instances of two different classes; dotted lines represent the decision boundaries. (**b**) Example of a multi-layer perceptron with two hidden layers.

Multi-layer perceptron (MLP) [42] is a supervised learning algorithm using a feedforward neural network technique. An MLP is composed of an input layer, one or more hidden layers of threshold logic units (TLUs) and an output layer. Each hidden layer is fully connected with the next one, and each TLU computes a weighted sum of its inputs then applies an activation function to provide a result that will be used as input for the next layer (see right panel in Figure 2). The activation function is in general nonlinear and is selected to be *C*1-differentiable. The learning process is based on the back-propagation algorithm that can be summarized as follows [43]: for each training instance, the algorithm generates a prediction and measures the performance (error). Consequently, each layer in reverse is analyzed in order to evaluate the contribution to the error from each connection; then edge weights are tuned in order to improve the performance. In this study, the hyper-parameter tuning phase of MLP is driven by the choice of an activation function and the number of hidden layers. Classification algorithms and performance metrics analyzed refer to the Python scikit-learn library [44].

#### *3.4. Learning Experiment*

Once the image processing and feature extraction procedure was completed, each subject was represented by different feature groups associated with diffusivity metrics (FA, MD, RD and LD) each with dimensions in the order of 105. These groups can be used separately or combined in a single high-dimensional feature vector to feed a learning algorithm for the classification of patients with AD. The learning experiment proposed in the present work consists of comparing these two procedures with an ensemble learning approach in which each feature group is used to feed a classification algorithm and all the models are then combined through a voting scheme (see Figure 3). The idea is that different models trained independently can take into account different aspects of the data, and consequently a combination of algorithms can improve the predictions obtained with the single models in the ensemble. The ensemble configurations analyzed in this work are listed in Table 1.



*Mi* is the best classification method associated with the *i*-th feature group and E(*M*1, *M*2, ... , *Mj*) is the ensemble learning method based on the combination of best classifiers *M*1, *M*2, ... , *Mj*. The ensemble of a singleton corresponds to the best classifier, i.e., E(*Mi*) ≡ *Mi*. Finally, configuration 5-1 refers to the best classifier trained on a single high-dimensional vector concatenating all feature groups.

**Figure 3.** Classification framework based on ensemble learning with a soft-voting strategy.

The learning experiment consists of three steps.

1. For each group of features in (FA, MD, RD, LD) and their combined feature vector, find the best associated classifier among the three algorithms SVM, RF and MLP, as described in Section 3.3. A 5-fold cross validation grid search procedure should be performed to tune the hyperparameters and evaluate the best performer for each configuration, as shown in Table 2.


**Table 2.** Best model selection procedure.

For instance, for configuration 1-1 the model *MFA* is chosen among SVM1-1 *<sup>b</sup>* , RF1-1 *<sup>b</sup>* and MLP1-1 *<sup>b</sup>* .

2. For each possible configuration listed in Table 1, evaluate the performance of the ensemble learning algorithm, based on the combination of the best classifier selected in step 1. The voting scheme is a soft-voting procedure which is based on averaging the probability scores given by the individual classifiers according to the following equation:

$$\hat{y} = \operatorname\*{argmax}\_{i} \sum\_{j=1}^{n} w\_{j} p\_{ij} \tag{4}$$

where *y*ˆ is the ensemble predicted label, *n* is the number of classifiers, *wj* is the weight that can be assigned to the *j*th classifier (in the present analysis we consider uniform weights) and *pij* is the probability score assigned to the *i*th class from the *j*th classifier. In the case of binary classification *i* ∈ {0, 1}. The ensemble algorithm analyzed in the present work refers to the ensemble.VotingClassifier method of Python scikit-learn library [44]. The choice of this scheme is due to the fact that it is more flexible than the hard one, since it takes into account the classifiers' uncertainty about the final decision, which is more informative than the simple binary prediction.

3. Repeat steps 1 and 2 on a balanced dataset obtained from the original one (43 AD vs. 49 HC), removing 6 healthy controls using the instance hardness threshold method (IHT) of Smith et al. [45]. IHT is an under-sampling method for reducing class imbalance based on the removal of the "hard" instances (where instance hardness is the likelihood of being misclassified), while focusing on the majority class samples that overlap the minority class sample space. The balanced dataset is then composed of 43 diseased cases and 43 healthy controls.

The classification performances in step 2 are evaluated through a 10-fold stratified cross-validation (CV) such that each fold is composed of approximately the same number of patients associated with each diagnostic group. This CV procedure was repeated ten times with different permutations of the training and test samples, in order to make the performance evaluation more robust and generalized. The metrics used for the performance assessment were accuracy, precision, recall and area under the ROC curve (AUC). For the comparison among ensemble combinations, statistically significant differences between the performances of classification configurations were assessed through non-parametric one-tailed Mann–Whitney U-test (MWU) [46]. Given *F* as the distribution function corresponding to population *A* and *G* as the distribution function corresponding to population *B*, MWU tested the null hypothesis *H*<sup>0</sup> : *F*(*t*) = *G*(*t*), for every *t* (i.e., *X* and *Y* random variables have the same probability distribution) against the alternative hypothesis that Y is larger (or smaller) than X [47]. In order to address the problem of multiple comparison, *p*-values were corrected for multiple testing using the Benjamini–Hockberg (BH) procedure, summarized as follows: (1) Let *H*1, *H*2, ... , *HN* be the sequence of the null hypotheses to test with *p*1, *p*2, ... , *pN* as the associated *p*-values. (2) Rank *p*-values such that *<sup>p</sup>*(1) <sup>≤</sup> *<sup>p</sup>*(2) <sup>≤</sup> *<sup>p</sup>*(3) ≤ ··· ≤ *<sup>p</sup>*(*N*). (3) Given the level *<sup>q</sup>*∗, find the largest *<sup>k</sup>* such that *<sup>p</sup>*(*k*) <sup>≤</sup> *<sup>k</sup>* · *<sup>q</sup>*∗/*N*. (4) Reject all the null hypotheses *<sup>H</sup>*(*j*) with *<sup>j</sup>* <sup>=</sup> 1, 2, ... , *<sup>k</sup>*. The theorem of Benjamini–Hochberg states that the above procedure controls the false discovery rate with level *q*∗ [48].

### **4. Results**

In this section, we outline the results of the experiment. Firstly, we discuss the effects of ensemble learning in terms of performances on the original imbalanced dataset; then we show the results for the balanced dataset obtained via instance hardness threshold method. Finally, we discuss the outcomes of nonparametric statistical tests carried out to compare the different configurations and to obtain an overview of the efficacy of the ensemble approach.

The results associated with the imbalanced case (49 HC, 43 AD) are reported in Figure 4. In panel (a), the performance average values are plotted as a function of the possible configurations. Each point in the plot represents the average over the 100 estimates of the performance metrics, obtained from the 10-times repeated, 10-fold stratified cross validations. The best configuration, according to the overall metrics, is the configuration 3-3 corresponding to the ensemble E(*MFA*, *MMD*, *MRD*). In terms of accuracy, precision and AUC, the singleton E(*MFA*) outperformed the other individual configurations 1-2, 1-3, 1-4 and all other ensembles that did not contain fractional anisotropy as a feature group. On the other hand, in the case of recall, the ensemble strategy is crucial for enhancing the performances: almost all the ensemble configurations outperformed the single feature set configurations, and the best recall value was obtained for the most complex ensemble E(*MFA*, *MMD*, *MRD*, *MLD*) (configuration 4-1). The performance comparisons among all possible configurations were performed through a Mann–Whitney (MWU) test. The outcomes of MWU tests, for each performance measure, are reported in Panel (b-c-de). Each square of the heatmap represents the one-tailed MWU test between samples *Y* and *X*, where *Y* and *X* are given by 100 performance measures of the configurations on the *y*-axis and *x*-axis, respectively. The null hypothesis is that *X* and *Y* have the same probability distribution against the alternative hypothesis that *Y* is larger than *X*. The colors of heatmaps are related to the *p*-values of the test ranging from 0 (red) to 1 (blue). Levels shown in the maps refer to *p*-values corrected for multiple testing using the Benjamini–Hockberg procedure. Panel (b) shows that recall is generally enhanced by ensemble learning approaches and that ensemble configuration with *n* groups of features has higher sensitivity that those with *n* − 1 groups. This behavior occurs in the other performance comparisons, with the exception that ensemble methods without fractional anisotropy are not affected by significant improvement. Finally, in order to test whether the balancing effects on the dataset can impact the performances of ensemble methods, due to the instance hardness threshold procedure, we performed the same comparisons of the imbalanced case on a fair ground of 43 diseased cases versus 43 healthy controls. The results associated with the balanced case are reported in Figure 5. As expected, we notice in panel (a) that the average performance values as a function of configurations are generally shifted upwards. Indeed, as shown by Wei et al. in [49] the use of balanced training data can provide the highest balanced performances in classifiers based on support vector machines, neural networks and decision trees. Conversely, the balancing procedure attenuates the ensemble effects in the enhancement of recall and predicting accuracy. From panels (b-c-d-e), the ensemble E(*MFA*, *MMD*) emerges as the best configuration over all performance measures, and all the methods that contain fractional anisotropy and mean diffusivity outperformed the E(*M*FA,MD,RD,LD) method that concatenates all features in a high-dimensional single vector.

(**b**) (**c**)

**Figure 4.** The case of an imbalanced dataset. (**a**) Average performance values for each configuration. (**b**–**e**) Heatmaps of Mann–Whitney tests. Each square represents the *p*-value outcome of a one-tail Mann–Whitney test between a configuration on the y-axis and the other on the x-axis. Each p-value in the heatmap was corrected for multiple tests using the Benjamini– Hochberg procedure.

**Figure 5.** The case of a balanced dataset. (**a**) Average performance values for each configuration. (**b**–**e**) Heatmaps of Mann–Whitney tests. Each square represents the *p*-value outcome of a one-tail Mann–Whitney test between a configuration on the y-axis and the other on the x-axis. Each p-value in a heatmap has been corrected for multiple tests using the Benjamini–Hochberg procedure.

#### **5. Discussion and Conclusions**

Computational systems aimed at the automatic classification of Alzheimer's disease patients through voxel-based diffusivity measures have been widely investigated but mainly focused on the exploitation of individual learning methods. The authors of [18,50] used anisotropy and diffusivity voxels values of WM main tracts as features for HC/MCI discrimination with a single support vector machine, showing very high classification performances. However, as pointed out in [28], the key shortcoming of these approaches is given by a bias due to a non-nested feature selection method affecting the learning procedure. On the other hand, a recent study [30] based on an individual SVM classifier with Fisher score feature selection has reported valid performances focusing only on anisotropy measures of specific brain areas with well known AD-related connectivity abnormalities. Consequently, the idea of this work is to circumvent the problem of restricting the procedure to a single classifier or to an a priori selected group of features by exploiting all the information power of diffusion imaging techniques, through a computationally efficient learning strategy based on combinations of several feature groups and different classifiers. As a matter of fact, the simple concatenation of all feature groups (FA, MD, RD, LD) in a single high-dimensional vector would not be convenient in terms of time complexity and machinery efforts. Therefore, this approach addresses the problem of handling and selecting variables in the conditions where the feature dimensions are much larger than sample sizes typically available in medical classification tasks. In this framework, we presented a novel approach based on an ensemble learning strategy which combines classifiers that take into account different perspectives of the microstructural white matter integrity associated with each feature group. The work in [51] applied a similar ensemble methodology, feeding an a priori specified classifier with different tractography network measures describing specific aspects of brain connectivity.

We have investigated the validity of this ensemble learning procedure in the classification of HC vs. AD patients, in both cases of the original imbalanced dataset and a balanced dataset obtained by the instance hardness threshold under-sampling method. In particular, in the imbalanced case we found that all the ensemble combinations, including FA invariants, outperformed the singletons E(*MMD*), E(*MRD*) and E(*MLD*), and also the single vector containing all the feature groups. These results show the crucial contribution of fractional anisotropy in the correct classification of diseased subjects. In fact, fractional anisotropy, defined from diffusion tensor fitting as the degree of directionality of intravoxel diffusivity, has a behavior heavily related to variations in fiber density, axonal diameter and myelination in white matter in the presence of the onset of neurodegenerative diseases. According to Pierpaoli et al. [52], a hallmark of damage in white matter is the generalized loss of fiber tract integrity. Interestingly, further studies have shown that FA-associated voxel values have been able to uncover voxel microstructural alterations in the brains of AD patients at early stages too [18,28,53,54]. Moreover, while for AUC, accuracy and precision, the ensemble method did not significantly improve the performances of the single FA, the ensemble strategy was crucial for enhancing the recall of the classification framework. Furthermore, it is worth mentioning that, in terms of accuracy and sensitivity, the use of ensembles of classifiers associated with the diffusion measures not only turned out to be better than considering all measures concatenated in a single feature vector, but also provided higher performances as the combinations' dimensions increased. In the balanced scenario, mean diffusivity emerged as the second most informative measure for pathology discrimination. This evidence is supported by the fact that MD represents the overall mean squared displacement of molecules in the non-collinear directions of free diffusion. Consequently, a variation of mean diffusivity is a signal of an increase in free water diffusion and in turn of a loss of anisotropy of molecular mobility [52]. In literature there is evidence supporting the hypothesis that the microstructural alterations in molecular diffusivity along white matter fiber bundles, described by MD, may be of higher predictive value compared to FA microstructural changes [55,56]. In the balanced case, the effects on the improvement of accuracy and recall of the ensemble procedure were

attenuated. However, ensemble combinations that included FA and MD performed better than other variable sets considered individually and than the feature vector concatenating all groups together.

Based on results emerging in the present analysis, we can conclude that our ensemble classification framework, based on DTI features, is effective to improve HC/AD classification performances, and that ensembles including FA and MD are the best performing, confirming their role in the literature as most effective DTI measures for AD detection [57–60]. Moreover, although artificial data balancing attenuates the benefits of ensemble learning, the ensemble-based strategy generates significant improvements in the classification sensitivity and accuracy with respect to the general concatenation of all features into a high-dimensional vector. For this reason, the feature selection phase in similar classification tasks can take advantage of this kind of strategy, allowing one to exploit as much information as possible, but at the same time reducing the dimensionality of the feature space, and in turn the computational effort. Hence, the ensemble learning can be a promising approach to combining different types of features derived for DTI data, extending the application to DTI tractography network measures and diffusion voxel-based features.

Future advancements of the present work will consider firstly an extension of dataset size in order to ensure more robust procedures of algorithms calibration and validation. In this scenario, one would be enabled to analyze feature selection methods together with several families of classifiers in more extensive ensemble strategies. Indeed, the possibility of comparisons on a wider base between pairs of feature selectors and classifiers could lead to the identification of efficient methods for discriminating between diseased cases and healthy controls (for a thorough review of this kind of approach, see the large comparative study performed by Parmar et al. in [61]). Moreover, the availability of a larger number of observations would allow the application of state-of-the-art deep learning methods that could give important contributions in the uncovering of signatures and biomarkers of neurodegenerative disorders for highlighting hidden patterns. The key advantage of deep learning architectures with respect to standard learning approaches is given by the evidence that high values of classification performance can be optimally achieved without feature selection steps that are embedded in the process, yielding more computationally efficient frameworks (for an application of deep convolutional neural networks to MRI data, see the work of Basaia et al. in [62], and for a review of deep learning methods and applications in neuroimaging data in psychiatric and neurologica disorders, see [63]). Future investigations may also take into account not only diffusion-derived features, but also additional variables, such as clinical information, morphological measures and other features related to different image processing modalities and methodologies, such as functional and anatomical magnetic resonance imaging. As a matter of fact, a diversified plethora of biological information generated by different diagnostic modalities can provide not only a holistic view of the pathological condition, but can be exploited in the pre-clinical stage for the early detection of dementia precursors in presymptomatic conditions [64–66].

**Author Contributions:** Conceptualization, E.L.; methodology, E.L., A.P. and R.A.; software, E.L. and A.P.; validation, E.L., A.P., D.L., R.A. and F.V.; formal analysis, E.L. and R.A.; investigation, E.L., A.P., D.L. and R.A.; resources, E.L.; data curation, E.L.; writing—original draft preparation, E.L. and R.A.; writing—review and editing, E.L., A.P., D.L., R.A. and F.V.; visualization, E.L. and R.A.; supervision, R.A. and F.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Ministero dello Sviluppo Economico (MiSE) "Grandi Progetti R&S—PON 2014/2020 Agenda Digitale e Industria sostenibile"—BigImaging Project—Grant number F/080022/01-04/X35.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here: www.adni-info.org.

**Acknowledgments:** Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson and Johnson Pharmaceutical Research and Development LLC.; Lumosity; Lundbeck; Merck and Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research are providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**

