*4.2. Classification*

The literature of activity recognition using physiological signals includes numerous types of classifiers with different characteristics in terms of complexity, intelligence, and generalization. In this work, we compare the performance of four widely used classifiers with different rules aiming at studying the performance of the set of features: the Least Squares Linear Classifier (LSLC); the Least Squares Quadratic Classifier (LSQC); the Support Vector Machines (SVMs), the Multi-layer Perceptrons (MLPs), the *k*-Nearest Neighbor (kNN), the Centroid Displacement-Based k-Nearest Neighbor (CDNN) and Random Forests (RF) .

#### 4.2.1. Least Squares Linear Classifier (LSLC)

In a linear classifier, given a set of training patterns **x** = [*<sup>x</sup>*1, *x*2,..., *xL*]*<sup>T</sup>*, where each pattern has associated a class, denoted as *Ci*, *i* = 1, ... , *M*, the decision rule is obtained using a set of *M* linear combinations of the training patterns. In the least squares approach (the LSLC), the values of the weights of the linear combinations are those that minimize the mean squared error (MSE), obtaining the *Wiener-Hopf* equations [68]. These classifiers are fast and simple, and they present a good generalization capability.

#### 4.2.2. Least Squares Quadratic Classifier (LSQC)

Like with the LSLC, the LSQC also renders very good results with a very fast learning process. It slightly increases the intelligence of the LSLC by adding quadratic terms to the linear combinations, thus improving the performance by increasing the complexity, with the consequence of a decrease in generalization.

#### 4.2.3. Support Vector Machines (SVMs)

An SVM projects the observation vector **x** to a higher dimension space, using a set of kernel functions, where the patterns can be better linearly separated. The patterns of the design set selected to be the center of these functions are denominated "support vectors" [69]. In the present study, we used linear SVM (LINSVM) and nonlinear SVM using Gaussian Radial Basis Function (RBF) kernels, denoted RBFSVM.

SVMs are essentially binary classifiers, and to implement multi-class classifiers an strategy must be defined. In this paper we used a one-against-all strategy. Furthermore, SVMs present mainly two parameters (the kernel scale and the box constraint) that must be optimized. In this paper a *k*-fold cross validation strategy over the design set was carried out in order to determine the best values of these hyper-parameters. RBFSVMs are also sensitive to differences in the scaling of the features, thus to avoid scale problems features were normalized by removing the mean value and dividing by the standard deviation, being these values estimated using the design data.

#### 4.2.4. Multi-Layer Perceptrons (MLPs)

MLPs are composed of one or more layers of neurons/perceptrons arranged sequentially so that the outputs of the neurons of a layer are the inputs of the neurons of the next layer. It is a feed-forward network, therefore the outputs of the network can be calculated as explicit functions of inputs and weights. Each neuron implements a linear combination of its inputs applied to a nonlinear function denominated activation function. The complexity of the MLP depends on the number of neurons in the hidden layers, allowing to easily control the intelligence of the classifier.

In this paper we considered MLPs with one hidden layer of 8, 12 and 16 neurons. They were trained with the Levenberg Marquardt algorithm, and 20% of the design data was used to monitor and early-stop the training process, avoiding overfitting.

#### 4.2.5. *k*-Nearest Neighbor (kNN)

The kNN is a classification method in which no assumptions are made on the underlying data distribution in the learning process [70]. This classifier estimates the value of the posterior probability in **x** using the *k* closest patterns from the design database, being *k* a hyper-parameter of the classifier. So, a test pattern **x** is assigned to the class *Ci* that maximizes the posterior probability, that is, its class is determined by majority voting over the classes of its *k* nearest neighbors. To define the proximity a distance must be defined. In this paper we consider the euclidean distance. To determine the best value of *k* in each case, a *k*-fold validation process was carried out over the design data, and the value of *k* that renders the lowest error rate over the *k*-fold process is selected as the final *k* value. Data from the individuals included in the design set were used as folds on the process.

Like in the case of the RBFSVM, the distance measurement is sensitive to changes in the scale of the features. Thus, features were normalized using the mean and the standard deviation of the features over the design set. Some advantages of the kNNs are: there are no assumptions about data, and it is an easy to understand algorithm. The disadvantages of this classifier include: high memory requirements, computationally expensive, and sensitive to irrelevant features.

#### 4.2.6. Centroid Displacement-Based k-Nearest Neighborgs (CDNN)

The CDNN is a modified version of the kNN algorithm proposed in [71] that replaces the majority voting scheme of the kNN by a centroid based classification criterion. Considering the *k*-th nearest patterns in the database, the centroid of the patterns of each class with and without including the test pattern are evaluated, and the class that suffers less change due to the inclusion of the test pattern is selected. Like in the kNN method, the value of *k* is a hyper-parameter that must be properly determined. Again, *k*-fold cross validation over the design data is used to estimate the best value of *k*. Features were also previously normalized.

#### 4.2.7. Random Forests (RFs)

RFs [72] are classifiers consisting of a collection of *T* tree-structured classifiers *hT*(**x**), *k* = 1, ... , *T* where the decision is taken by majority voting over the *T* independent tree classifiers. Randomization is used in the design of each tree by two factors: first, design data is randomly selected without replacement from the data from the design set. Second, in each node of the tree a subset of *F* features is randomly selected. In this work we grew the trees using CART methodology without pruning, and the ratio of considered features in each node was *F* = *log*2*M* + <sup>1</sup>, as proposed in [72]. A total of *T* = 200 trees were used to generate each RF classifier.

#### *4.3. Feature Selection*

Feature selection is the process of selecting a subset of the most relevant features. There are mainly two reasons to use feature selection: to reduce the generalization problems by reducing overfitting and to simplify the model. The feature selection process used in the present work follows the wrapper approach [73]. This approach selects the subset of features that minimize the error rate of a predetermined classification algorithm.

In the literature there are numerous algorithms to select the best features of a set, being Genetic algorithms (GAs) widely used. GAs, proposed in [74], combine the principles of survival of the fittest applying evolutionary laws and emulating biological evolution in nature. These algorithms work with a population consisting of several possible solutions to the problem, being each one of them called chromosome. The optimization is carried out applying modifications to the genes of the chromosomes in the population of possible solutions. They constitute a meta-heuristically search algorithm which can be applied to optimization issues in different areas [75], and they can be successfully applied to the problem of feature selection [76,77].

In our problem, we seek the best reduced set of features which is able to obtain the minimum error probability of a classifier. For this purpose, a "population" of possible sets of features is evaluated with the goal of minimizing the classification error probability, with a limited number of features (the number of selected features must be lower than *Nmax*). To avoid loss of generalization of the results, the design set is exclusively used to determine the best subset of features by applying a GA, that is, the classification rate optimized by the GA is determined exclusively with the design data.

Since the GA requires the evaluation of many classifiers in the optimization process, the choice of the classifier used in the optimization is crucial. We must consider that for each chromosome in each generation the classifier must be fully trained. Thus, the use of classifiers with a very fast learning process is required. In this work, we rely on the LSLC.

The full process is described as follows:


To achieve less risk of premature stalling of the search, we used a method known as Elimination Tournament of GAs [78], that combines several small GAs in a tournament in which the original population of each GA is generated by a random crossover of the "winner" chromosomes from previous GAs.

For this work, the number of features selected was discretized by group size in 5, 10, 20, 40, 60, 80 and the full set, for instance 174 features in case of using the ECG measurement.

To avoid overfitting in the results (generalization loss) while maximizing accuracy in the estimation of the classification error rate, *k*-fold cross-validation was used in the experiments, being k the number of subjects available in the design database, 40 subjects. Thus, the data were divided into *k* folds or subsets containing data from each subject, and each time, the registers from one given subject are used as a test set, with the data from the remaining *k* − 1 used for the design task. For each fold, the design process is carried out, including the feature selection process, the choice of the parameters and the training of the classifier. That is, for each fold, features are normalized estimating the mean and standard deviation of the design set (the remaining *k* − 1 folds in the dataset), the GA is implemented selecting the best subset of features, the classifier is trained with the corresponding methodology, and the hyper-parameters of the classifiers are estimated (please note here that the hyper-parameters were estimated using exclusively the design set). Once this process is completed, the estimated mean and standard deviation is used to normalize the features selected by the GA, and the classifier is evaluated with the previously determined hyper-parameters. The classification of error is then estimated by analyzing the ratio of patterns wrongly classified in the test fold.

The final classification error rate is estimated by averaging the error rates obtained for all the *k* folds. Since data from the same subject is not used for designing and testing at the same time, this method guarantees the generalization of the results to subjects different from the ones included in the database.

This whole process is also repeated 20 times to analyze the statistical significancy of the results. So, the error rate measures the average ratio of classification errors over 40 different test folds (40 individuals of the dataset) and 20 full repetitions of the design process (including feature selection and training the classifier). To study the significance of the results we also carry out a hypothesis test, where the null hypothesis is that the method with the lowest error rate (taken as reference) is not really better than the considered method. So, the performance obtained with different methods and parameters is statistically compared using a single-tail paired-sample t-test over the estimated errors. From this t-test we measure the *p*-value, which can be defined as the level of marginal significance within the statistical hypothesis test [79]. This value represents the probability of obtaining an equal result to or "more extreme" result to what was actually observed when the null hypothesis is true. It is a number between 0 and 1, so that the null hypothesis is rejected if the significance level of the test is less than the significance level (*α*), which is normally 0.05. The method has been interpreted as follows:


#### **5. Results and Analysis**

This section includes the analysis of the results obtained in the experiments described in the previous section, including a detailed study of the window length selection, the classifier, the combination of signals, the number of features and the most selected features.

#### *5.1. Window Length Selection*

The first parameter to determine is the window length. In order to analyze the performance of the system with different window sizes, we consider windows of 10 s, 20 s, 40 s, and 60 s. Please note here that the shift between decisions is fixed in 10 s, independently of the window length. It means that the size of the database and the number of decisions are not affected by the variation in the window length. To determine which window length is the most appropriate to extract the features, several experiments were carried out for each feature set. Table 1 shows the results obtained using the simplest classifier (LSLC) for the different signal combinations considered in this work, as function of the window length. The table includes the best error probability and the number of selected features *Nmax* that generates this result. To assess the significance of the results obtained with respect to the window length, the *p*-value [79] has also been included in the table, comparing the best result and the remaining of results for each combination of signals.

The results indicate that the window length for which the obtained error probability is the lowest one is 40 s for all the cases in which the TEB signal is used. We observe that for the ECG signal, the best result was obtained with a window length of 60 s, and for EDA of 10 s. In case of using all the signals, the best result is obtained with a window length of 40 s as well. For this reason, we have fixed the window length to 40 s.


**Table 1.** Error Probability using a Least Squares Linear Classifier (LSLC) for the best number of features as function of the window length.

#### *5.2. Classifier Selection*

To select the best classifier, we have trained the different types of classifiers with different combinations of signals, and a different number of maximum features to be selected. Table 2 contains the error probability (%) obtained for each classifier using the different combination of signals. The best combination of signals is the case including all the physiological signals (ECG+TEB+EDA) with *Nmax* = 40 features, obtaining a 22.2% of error rate, and the second best is the case including ECG and TEB with *Nmax* = 60 features, that gets a 24.5% of error.

Figure 4 shows the error probability for each feature set and for each activity with the LSLC classifier and *Nmax* = 40 features, where it is possible to observe the percentages of error, being the lowest value obtained using all signals (ECG+TEB+EDA). Furthermore, we can appreciate that the activity most recognizable for all feature set is the physical activity and the least one the mental activity.

For a more detailed analysis, Figure 5 shows four different figures in which it is possible to observe each activity separately. The first one (top left) refers to the error probability for the neutral activity and for each of the feature set, where we can observe that the best performance of 19.11% is obtained using all feature set (ECG+TEB+EDA), provided by all signals. For the second one (top right) refers to the error probability for the emotional activity, in which the least error probability is 27.14% obtained for TEB+EDA. The third one (bottom left) shows the error probability for the mental activity, in which the minimum error probability is 41.07% using the feature set ECG+TEB+EDA. Finally, the fourth graph (on the bottom right) indicates the error probability for physical activity with errors ranging from 2.95% obtained with only EDA features to 5.45% for ECG. The error obtained for the ECG+TEB+EDA is 4.20%, which is very close to the minimum value.


**Table 2.** Error probability (%) obtained for each classifier using the different combination of signals with a window length of 40 s.

**Figure 4.** Error probability for each feature set and activity.

**Figure 5.** Error probability for each feature set and activity. Neutral (**top left**), Emotional (**top right**), Mental (**bottom left**) and Physical Activities (**bottom right**).

On the other hand, if we analyze the signals separately, we can see that the independent signal which renders the best results is the TEB (29.50% with *Nmax* = 40 features and an MLP with 8 hidden neurons).

In order to study the main differences in the identification of the activity, the confusion matrix shown in Figure 6 indicates the misclassification between classes obtained using a LSLC and *Nmax* = 40 features obtained from all 3 signals (ECG+TEB+EDA), where the classes that present more misclassification are emotional and mental activity.

For a more detailed analysis of the performance of the classifiers when the number of features is varied, three figures are presented below. The figures represent the performance of the classifiers in the most significant cases. As with all features, it combines all feature sets. Another case, with the two signals that combined ge<sup>t</sup> the best result (ECG+TEB feature set), and the signal that gets the best result independently (TEB feature set).

Figure 7, presents the results obtained with the combination including all signals (ECG+TEB+EDA). We can see that the linear classifiers render the best results, and that the GA-based feature selection process that limits the number of features helps improving the performance of the classifiers. The fact that the complex classifiers (MLPs and RBFSVMs) do not match the results of the linear classifiers might imply the presence of strong generalization problems.

Figure 8 shows the performance of the classifiers when the ECG+TEB feature set is used. In this case again the best results are provide by linear classifiers. However, the classifier that renders the best result for ECG+TEB feature set is the LINSVM with an error probability of 24.5%.

**Figure 6.** Confusion matrix between classes.

**Figure 7.** Classifiers comparison using *All feature set* (ECG+TEB+EDA).

**Figure 8.** Classifiers comparison using the *ECG+TEB feature set.*

Finally, in case of considering just one signal the best choice is the use of the TEB. Figure 9 shows the performance of the classifiers under study with only features from the TEB signal. In this case the results are somewhat different from the previous ones, since the classifier that gives the best results is the RF, with an error probability of 28.9%.

**Figure 9.** Classifiers comparison using the *TEB feature set.*

#### *5.3. Frequently Selected Features*

In order to complete the study, we will show which features and measurements are the most frequently selected and the percentage of selection. Table 3 shows the average number of features selected by the GAs from each measurement and each signal, considering a maximum number of selected features *Nmax* = 40, for the different combination of signals. As we can see, the most frequently selected measurement from the ECG is the RR. In general, the measurements extracted in the frequency domain for the ECG are not very useful. Concerning the TEB, the RF and the BRV measurements present high ratios in the case of considering all the signals in the combination. And the most selected measurement from the EDA is the processed measurement taken in the hand.

**Table 3.** Average number of features selected from the measurements of the different signals, with *Nmax* = 40 features.


To go deeper into the analysis, Table 4 shows the top-40 selected features, again in the case of selecting a maximum of *Nmax* = 40 features. In this case, we show the percentage of occurrence in the three best combinations of signals: the TEB alone, the TEB and the ECG, and the case of using all the biosignals. We can see that, in general, the mean baseline is one of the most frequent parameters. The most selected features from each signal in the case of considering all possible features in the GA are:

• From the ECG signal: the geometric mean of the HRV, the mean baseline of the RR, the logarithm of the SD of the RR, and the DFA1 of the HR.


**Table 4.** Top-40 selected features from the different signal, and percentage of occurrence with *Nmax* = 40 features.


#### **6. Discussion and Conclusion**

Nowadays, activity recognition based on physiological signals is a relevant research field with a promising future. This paper presents an evaluation of the classification performance of different sensing modes ECG, TEB and EDA for detection of 4 different activities. The evaluation includes typical characterization features for the measured signal within each sensing mode. The characterization features included in the evaluation have been selected from a throughout review of the literature available. The evaluation has been done from several perspectives, the sensing mode perspective, the type of activity targeted and other parameters related to the feature extraction and classifier training. Consequently numerous conclusion can be derived from this work:


As a final conclusion, we have demonstrated the suitability of the GAs to select the best features among a wide dataset, containing most of the features identified as useful in the literature. The present study allows to extract significant conclusions concerning the information in each measurement, and

determines a set of relevant measurements and features that can lead the research in future studies. On the other hand, the generalization capability of the classifiers has been identified as crucial in order to further improve the results in activity recognition through physiological signals, which opens new opportunities for researching within in the field.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/1424-8220/19/24/ 5524/s1, Supplementary Data.

**Author Contributions:** Conceptualization, I.M.-H. and R.G.-P.; methodology, I.M.-H. and R.G.-P.; software, I.M.-H.; validation, I.M.-H. and R.-G.-P.; formal analysis, I.M.-H. and R.G.-P.; investigation, I.M.-H.; resources, I.M.-H and R.G.-P.; data curation, I.M.-H. and R.G.-P.; writing–original draft preparation, I.M.-H.; writing–review and editing, I.M.-H., R.G.-P., M.R.-Z. and F.S.; visualization, I.M.-H. and F.S.; supervision, I.M.-H. and F.S.; project administration, M.R.-Z. and R.G.-P.; funding acquisition, M.R.-Z. and R.G.-P.

**Funding:** This research was funded by the Spanish Ministry of Economy and Competitiveness/FEDER under Project RTI2018-098085-B-C42.

**Conflicts of Interest:** The authors declare no conflict of interest.
