*3.5. Evaluation*

The performance evaluation of our FDS under the selected classifiers was done using *k-fold cross-validation*. This required splitting the dataset into *k* sets. *k* − 1 sets are used as training and 1 as testing. The process is repeated *k* times with a different set as the test one. Given that FDSs must be able to detect falls for new people (e.g., unseen data), the test set should not contain people data that the algorithm has been trained on.

We chose a value of *k* = 5. This creates a training set of 80% and a test set of 20%. We filtered the *SisFall* to only keep subjects that performed all activities. Thus, despite being our motivation to develop a FDS for the elderly, we found it necessary to remove the data related to the elderly subjects, as these had not performed simulated falls. Similarly, we removed three young people's data due to missing records. This leaves us with data from 20 subjects. This number turned out to be ideal as it allowed us to guarantee that no data from a given subject is used for both training and testing (in an 80/20 split). In other words, the trained models would always be tested with data from new subjects. Consequently, we have 1900 ADLs (19 ADLs × 5 trials × 20 subjects) and 1500 falls (15 falls × 5 trials × 20 subjects), resulting in a more balanced dataset of 3400 records.

During the evaluation of ML algorithms, each prediction falls in one of the following categories:


Each prediction is added to the count of its category which allows then to calculate various metrics such as the accuracy. A usual representation of these categories is a confusion matrix.

In fall detection, two metrics are especially important: Sensitivity (SE) (Equation (1)) and the Specificity (SP) (Equation (2)) [7]. The SE (or *recall*) corresponds to how many relevant elements are actually selected. This is basically the detection probability meaning how many falls have actually been detected. The SP corresponds to how many non-relevant elements are selected, i.e., how many events classified as non-falls are actually non-falls.

$$Sensitivity = \frac{TP}{TP + FN} \tag{1}$$

$$Specificity = \frac{TN}{TN + FP} \tag{2}$$

We also calculated the accuracy (Equation (3)) and the F1-score (Equation (4)). Additionally, we calculated the Area Under the Receiver Operating Characteristics Curve (AUROC) as provided in *scikit-learn*. The AUROC is used to evaluate classifiers' performance which is used in pattern recognition and ML [44]. In simple terms, an AUROC close to the value of one is indicative of a well-performing algorithm, with high true-positive and true-negative rates consistently.

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{3}$$

$$F1\text{-score} = \frac{2 \times TP}{2 \times TP + FP + FN} \tag{4}$$

### *3.6. Multi-Class Approach Considerations*

To answer our third research question, i.e., "What is the difference in performance across various types of ML algorithms by adopting a multi-class approach for identifying phases of a fall?", we needed to do one more step to prepare the data for the ML algorithms. The goal of this additional step was to divide the fall sample into three parts which are: pre-fall, impact and post-fall. In doing so, two related questions arise: Where should we split the fall sample and what duration should each part have. Given that a fall has been defined as an uncontrolled, high acceleration [7], especially around the impact point, we defined that the latter would be our reference point to split the fall data sample. Based on this definition, we calculated the magnitude of each accelerometer axis along the sample and selected the highest magnitude as the impact point for each sample. The average time between the moment of loss of balance and the impact point is 0.715 s with a standard deviation of 0.1 s [25]. Consequently, we defined the impact part of the fall as a 2 s interval

in the sample which includes the impact point, with 1.5 s leading to it, and the remaining 0.5 s after it. This interval is labeled as *impact*. The remaining part of the sample before the impact interval is labeled as *pre-fall* and the remaining, final part is labeled as *post-fall* (note that based on the result of RQ2, we selected a sample frequency of 50 Hz.). Thus, each 10 s fall sample creates three features vector, one for each phase. The impact phase always represents a 2 s window. The remaining 8 s represents the pre- and post-fall phases. Since the magnitude of the fall is not always at the same timestamp, the pre- and post-fall phase duration varies. If the fall happens early in the sample, the pre-fall phase will be much shorter than the post-fall phase. The opposite if the fall happens late in the sample.

To illustrate the above process, we present Figure 3a, a fall sample of the *SisFall* dataset [31]. Each line represents one of the accelerometer's axis. In it, a peak in the middle is highlighted which is the impact point (shown as a dotted line) (Figure 3b). The dashed lines limit the three parts of the fall, including the 2 s window of the impact interval. The lefthand part is the pre-fall and the right-hand part is the post-fall. The feature extraction step is applied to each phase of the fall as well as the ADLs.

(**b**) Zoom on the impact phase.

**Figure 3.** Division of a fall sample into pre-fall, impact and post-fall phases.

By identifying the three different phases of the fall in the manner described above, the FDS becomes a multi-class problem. More specifically, when ADLs are taken into consideration, it becomes a four-class classification problem. The motivation behind it lies on the importance in differentiating between ADLs and any phases of a fall, as labeled in the *SisFall* dataset. In order to do that, we apply the same ML algorithms as in Section 3.4. As the SVM classifier is a binary classifier, we extended it by choosing a one-vs.-one scheme.

In order to evaluate the performance of such classification, it is possible to use metrics such as SE, SP F1-score and AUROC, presented in Section 3.5. However, these are typically defined for two-class classification and it is important to show how we have extended them for multi-class problems. We evaluated the performance with the same metrics, using the calculation of the *macro* score for SE, SP, F1-score and AUROC. This is the average metric per class which gives the same importance for each class. The other solution is the *micro* score which average the metric by giving more importance to the amount of data per class. As falls happen rarely, it creates unbalanced dataset but it is crucial to detect them correctly, thus the need to give importance to this class. In our multi-class problem, we calculated the SE for a specific class against all the others together as if they were one class. Matches for this specific class represent the positive cases and matches for the combined class represent the negative cases. Applying this step for each class offers four different Sensitivities, which then are averaged using the previously explained macro score, as per Equation (5). A similar process is applied for SP and F1-score, as shown in Equations (6) and (7).

$$SE\_{\text{macro}} = \frac{1}{|Class|} \times \sum\_{i=1}^{|Class|} \frac{TP\_i}{TP\_i + FN\_i} \tag{5}$$

$$SP\_{\text{macro}} = \frac{1}{|\mathbb{C}lass|} \times \sum\_{i=1}^{|\text{Class}|} \frac{TN\_i}{TN\_i + FP\_i} \tag{6}$$

$$F1\text{-score}\_{\text{macro}} = \frac{1}{|Class|} \times \sum\_{i=1}^{|Class|} \frac{2 \times TP\_i}{2 \times TP\_i + FP\_i + FN\_i} \tag{7}$$

### **4. Results and Discussion**

This section presents and discusses the results for each of the research questions listed in Section 1, namely: Section 4.1 presents the comparison of various Machine Learning (ML) algorithms; Section 4.2 talks about the effect of the sensors' sampling rates on the detection performance, and Section 4.3 presents the results by splitting each fall into its phases.

### *4.1. Fall Detection System (FDS) Performance*

Tables 5–9 present the results of the evaluation of our FDS under the selected five ML algorithms, showing that we successfully developed a reliable FDS. The Sensitivity (SE) reached 98.4% and the Specificity (SP), 99.68%, respectively with Gradient Boosting (GB) and k-Nearest Neighbor (KNN). These results outperformed those reported by Sucerquia et al. [31]. From our review of classification algorithms (Section 3.4), we expected ensemble learning algorithms to achieve better performance than the others. In practice, this trend has been confirmed even though there are some exceptions (see Table 6). This is because they use multiple ML algorithms, though the improvement in performance is at the expense of more resources. Support Vector Machine (SVM) had more difficulties to distinguish the activities. However, by tuning some hyper-parameters, its results may improve.

The high quality of these results was unexpected especially without any optimization such as hyper-parameters tuning. We infer that Activities of Daily Living and falls in the *SisFall* dataset are discriminating by default, similar to [16]. Thus, any algorithm can perform very well. However, in real-life conditions, the SE and SP would very likely drop because of the falls heterogeneity as highlighted by Krupitzer et al. [33,34]. The difficulty of obtaining real falls data is the main shortcoming in FDS studies, given that it is challenging to capture them in realistic settings with the elderly, as noted by Bagalà et al. [26], who compiled a database of only 29 real-world falls.


**Table 5.** Comparison of the Sensitivity across the ML algorithms, with the highest values in bold.

**Table 6.** Comparison of the Specificity across the ML algorithms, with the highest values in bold.


**Table 7.** Comparison of the accuracy across the ML algorithms, with the highest values in bold.


**Table 8.** Comparison of the F1-score across the ML algorithms, with the highest values in bold.



**Table 9.** Comparison of the AUROC across the ML algorithms, with the highest values in bold.

### *4.2. Sensors' Sampling Rate Effect*

Regarding the sensors' sampling rate, the trend is that the higher the rate the better the results, which is intuitive since more data are considered when creating the feature vector. However, SVM has a different behavior than the other three, as shown in Figure 4. This shows the variation of the different metrics of each algorithm over the sensors' sampling rate. It peaks with a sensors' sampling rate of 20 Hz, indicating that the higher sampling rate does not necessarily improve performance. Especially since a high sampling rate comes with disadvantages such as more computational costs and higher battery consumption. Moreover, the results do not sugges<sup>t</sup> that increasing the sampling rate any further would make a meaningful improvement. In our case, the performance no longer increases significantly after reaching 50 Hz. This sampling rate is in fact the typical one used in the reviewed literature, offering the best reported results (Table 1).

### *4.3. Multi-Class Approach Performance*

The multi-class approach to identify different phases of falls achieved promising results with an accuracy close to 99% as shown by Figure 5 for two algorithms. The figure presents also the variability of the results over each fold of the cross-validation for each algorithm. The RF and GB algorithms consistently produced good results over the different metrics except for a single fold, which is seen as an outlier in Figure 5a–d. One explanation might be that it is related to data of a subject who performed the ADLs and falls differently to other subjects. The DT algorithm has a the biggest variability across the algorithms followed by KNN. The variability is low, close to 5% from which a high confidence on the algorithms can be inferred. This is the desired behavior for the type of application, where consistency in minimizing both SE and SP is important to facilitate adoption and usefulness of the FDS. Furthermore, the results of this experiment also confirm the expectation about ensemble learning algorithms performance, which had been observed in the results presented in Section 4.1.

Figure 6 presents a deeper insight of the classification results with the confusion matrices of each split of the k-fold cross-validation for the KNN algorithm. The accuracy of this algorithm is the median amongs<sup>t</sup> all algorithms' accuracies, therefore it is useful to discuss in depth. We can see that the pre-fall and post-fall phases were consistently correctly classified. The main source of misclassifications comes from the other two classes, i.e., ADL and impact. This negative tendency is stronger in the SVM and DT algorithms but lessened in the RF and GB ones. These confusion matrices are interesting because the patterns of misclassifications are consistent to that expected in a binary detection (i.e., ADL vs. fall). Therefore, an approach could involve removing data associated to the correctlyidentified phases of pre-fall and post-fall and treat the problem as a binary classification. However, having correctly isolated and identified these phases, these could be used as a supplementary input to confirm the prediction. Suppose for a given sample, a pre-fall and a post-fall are correctly identified, but the impact is predicted as an ADL. Then, by the mean of a threshold on a confidence interval, the misclassified impact could be overridden and corrected. Another solution could simply consider the fact of identifying a pre-fall and

a post-fall phase to always raise an alarm for an impact, given the high confidence of the prediction of both phases.

This novel approach usefulness lies on its provision of an added guarantee that the fall is correctly detected, by offering a mechanism to "fix" a potential misclassification. For a given fall sample, the algorithm should identify once each part of a fall, otherwise, it is identified that one or several classifications are incorrect. Additionally, the ability to recognize the pre-fall stage has many useful applications for fall prevention systems, including airbags for example. This could reduce the likelihood of injuries caused by falls.

(**e**) AUROC variation.

**Figure 4.** Metrics variation over the sampling rates of five algorithms. The highest average metrics across all algorithms is obtained with a sampling of 50 Hz.

(**e**) AUROC score.

**Figure 5.** Comparison of various metrics including Sensitivity, Specificity, accuracy, F1 and AUROC of each k-fold split across the ML algorithms.

The obtained results are of high quality in terms of their accuracy, SE, SP, F-1 and AUROC. This may not be the case when applying the system on data collected on the wild, as we identified during the first experiment. As many other datasets in the FDS community, the *SisFall* dataset is highly discriminating between ADLs and falls. Because their samples lack realism, the studies under laboratory conditions will always outperform those in the real world. In particular, from inspecting *SisFall* data, subjects remained still after a fall, but it is unclear if an older person would act in this way during a real fall, particularly if there was no loss of consciousness.

In our experiment, the pre-fall part was very often correctly classified. However, under real conditions, misclassifications may have arisen (for example, as an ADL). This is due to the fact that, in reality, falls are *unexpected* events occurring perhaps in the middle of an ADL. Therefore, the pre-fall phase may be very short, following immediately from the ADL part of the sample. Whereas in the *SisFall* dataset (as shown in Figure 3) the pre-fall part is not an ADL, instead, the subject is "preparing" to fall (i.e., the fall is not unexpected). In addition, the setup of the experiment in the wild will not be the same as in the lab. It would lack the annotation, and therefore the behavior of the algorithm may not be the same (in particular, with regards to dividing samples). With real-life non-annotated data, it is unknown whether the received data is a fall, and hence a sample associated to an ADL would also be divided into various parts. This would require further investigation.

**Figure 6.** Confusion matrices of the k-Nearest Neighbor Machine Learning algorithm whose accuracy is the median amongs<sup>t</sup> all other algorithms' accuracies.

### **5. Conclusions and Future Work**

In this paper, we present our development of a Fall Detection System (FDS) using wearable technologies, to investigate and answer the following three research questions:

RQ1 *What is the difference in performance across various types of Machine Learning (ML) algorithms in a FDS?*

Our FDS implemented several ML algorithms for comparison: k-Nearest Neighbors, Support Vector Machine, Decision Trees, Random Forest and Gradient Boosting. Our results are an improvement over those reported by Musci et al. [36] and Sucerquia et al. [31], with a final Sensitivity and Specificity over 98%. The system is reliable as we were able to test it on a large dataset containing several thousands of Activities of Daily Living (ADLs) and falls. We obtained these results using various ML algorithms which we were able to compare. We observed that ensemble learning algorithms perform better than lazy or eager learning ones. We also further investigated the effect of the sensors' sampling rate on the detection rate.


We found that the multi-class approach to identify the phases of a fall showed promising results with an accuracy close to 99%. In addition, it includes key features which are the possibility for improved performance by adding subsequent logic to the ML algorithm to address possible misclassifications. Given this performance, we would advocate this multi-class approach as being useful in a different contexts such as fall prevention systems.

There is scope for future work. With the high computation resources available nowadays, it would be interesting to explore Deep Learning (DL) algorithms. In our case however, the size of the cleaned dataset is insufficient for this method to be appropriate given the requirements of DL. The much larger OPPORTUNITY dataset [45] for ADLs has been shown as appropriate for the use of the DL methods [46]. There is a study [36] using Recurrent Neural Networks but there are other algorithms available such as Convolutional Neural Networks with the advantage of automatic feature extraction from time series [46]. This reduces the number of steps to implement and removes the question of how many and which features are needed to be extracted. Additionally, it would be very interesting to reproduce the experiment on the sensors' sampling rate but with DL algorithms. The results may be different from traditional ML algorithms. The *SisFall* dataset allows plenty of experiments. However, the lack of falls data availability in realistic settings is a common challenge in FDS studies, which also affected our study. In particular currently available datasets with falls in realistic settings (such as in [26]) are far too small for ML approaches to be successful, most particularly, for the state-of-the-art DLs.

Further work would benefit from exploring the use of a multi-class approach for FDS using realistic datasets in order to compare against the performance in the lab and further address any misclassification issues arising in that context. The results presented in this work sugges<sup>t</sup> this is worthwhile doing, and the use of such a system shows promise to make a difference in assisting people sustaining falls.

**Author Contributions:** Conception, design, experimentation, N.Z.; supervision, review, edition, A.W. and P.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by HES-SO University of Applied Sciences and Arts Western Switzerland.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors wish to express their gratitude to Juan Ye from the School of Computer Science at the University of St Andrews; Adam Prugel-Bennett and Jonathon Hare from the University of Southampton for their insightful comments on early stages of this work; the anonymous reviewers for their interesting and constructive comments.

**Conflicts of Interest:** The authors declare no conflict of interest.
