1. Introduction
Human activity recognition has been a significant focus of research and development in wearable and ubiquitous computing for the last several decades due to its ability to provide a real-time understanding of human behavior and its potential to inform intelligent and personalized user-centered technologies. This recognition has been powered by a combination of sensors, located either on the body or in the environment, and machine learning techniques that have become increasingly adept at distinguishing among a variety of human behaviors and activities. Researchers have applied human activity recognition techniques to a diverse range of applications, including healthcare and well-being [
1,
2,
3,
4], weightlifting and sports [
5,
6,
7,
8], sign language translation [
9], and car manufacturing and safety [
10,
11]. Within the area of healthcare and well-being, researchers have devoted particular attention to the recognition of activities of daily living (ADLs), as ADL performance is a key indicator of day-to-day health and wellness [
12,
13,
14]. Over the years, researchers have developed pipelines capable of recognizing a handful of ADLs [
15,
16,
17,
18,
19,
20,
21] or specific ADLs of interest, such as washing hands [
22,
23,
24,
25,
26], taking medication [
2,
27], brushing teeth [
28], and eating and drinking [
29,
30,
31,
32,
33,
34]. Such systems could be beneficial in practice to a number of populations: parents could guarantee that their children are learning and maintaining good health habits, caregivers could ensure that older adults are safely and successfully taking care of themselves, and the average person could track how well they are maintaining their day-to-day health. Furthermore, they would provide flexibility and facilitate greater autonomy than the existing paradigm for monitoring day-to-day health.
However, for these systems to be widely adopted by these populations, they need to be accurate, reliable, and robust in real-world settings. This requires these systems to be built and evaluated on data captured during the real-world performance of ADLs, as the performance of ADLs in controlled and semi-naturalistic settings often differs from the performance of ADLs in real-world settings [
15,
35,
36]. Additionally, such systems must be able to differentiate between the performance of ADLs of interest and the performance of every other activity an individual performs. Often, ADLs of interest are a small minority of the activities performed on a given day. As a result, reliable human activity recognition in the wild requires overcoming a significant data imbalance issue. As examples, ADLs such as washing hands, taking medication, and brushing teeth, which are indicators of day-to-day health and hygiene, only occur at most a few times per day, with each instance only lasting seconds to a few minutes. Even within the broader field of machine learning, the issue of class imbalance remains a challenging open problem [
37].
Although recent studies in the field of human activity recognition have acknowledged the challenge of in-the-wild recognition, the practicality of using many of the models proposed by human activity recognition studies in deployed systems remains limited by sub-optimal performance, the nature of the datasets that these models are trained on, or both [
21,
38,
39,
40]. The limitation of the nature of the datasets is perhaps the most endemic to the field, as the majority of studies rely on publicly available datasets that have a limited number of labeled activities and are comprised of data collected in controlled or semi-naturalistic environments. Thus, these models remain untested on data produced under real-world conditions (i.e., on
in-the-wild data), which consist of more activities and feature a significantly higher level of class imbalance. The work by Vaizman et al. [
41] highlights the importance of ensuring that models perform well
in real-world settings, emphasizing that deployed applications need to work in a variety of contexts and when behaviors are performed irregularly. The more recent work by Bhattacharya et al. [
21], seeking to evaluate their pipeline in real-world settings, evaluated an activity recognition pipeline on
in-the-wild data. However, their work did not consider the large
NULL class, i.e., all activities that are not of interest, which predominantly comprises
in-the-wild data. The authors acknowledge this and note that the problem of
in-the-wild ADL recognition requires more attention, as there is still significant room for performance improvement.
To address these limitations, we investigated the design of a human activity recognition system that has been trained on in-the-wild data in order to detect a set of ADLs. Specifically, we consider standard methods for handling imbalanced data, introduce a postprocessing technique to improve prediction precision, and assess the performance of classical feature-based models and deep learning models within these data-processing pipelines. These results establish a baseline for user-independent, in-the-wild activity recognition for a set of common ADLs. The main contributions of our work are as follows:
A fully in-the-wild dataset. First, we present an annotated in-the-wild dataset, which consists of accelerometer and gyroscope data from off-the-shelf smartwatches worn on both wrists. This dataset consists of 106.74 h of data from nine participants behaving naturally and following their personal daily routines in their homes and workplaces.
An evaluation of existing techniques to handle imbalances in human activity recognition data. Second, we investigate techniques to improve classification performance on activity recognition systems trained on in-the-wild data. These techniques include common methods for dealing with imbalanced classes (e.g., undersampling and oversampling), as well as model training strategies such as cost-sensitive learning. Our experiments show that, in the case of in-the-wild data, these techniques improve the recall of the model at the cost of its precision. As a result, we find that these techniques in isolation are not enough to address the challenges associated with in-the-wild recognition.
A novel postprocessing technique. Third, we propose a context-based prediction correction method to improve prediction stream stability. We evaluate the performance of this algorithm with five different weighting functions using the best-performing models from the previous experiments with and without preprocessing. Our model achieved an event-based F1-score of over 0.9 for the activities of brushing teeth, combing hair, walking, and washing hands in a user-independent evaluation using both preprocessing and postprocessing techniques.
4. Results
While most works in human activity recognition utilize some form of cross-validation, given our dataset’s size, we evaluated our pipeline’s efficacy using a training–validation–evaluation split on a 70%, 20%, and 10% basis. For these splits, we used a stratified approach on the participant level to ensure that ADL data were distributed as evenly as possible among datasets and that data streams were contiguous. This design also results in a user-independent evaluation.
4.1. Impact of Preprocessing Techniques
We present the results of three common preprocessing techniques using the definitions in
Section 3.6 on the validation set to demonstrate the impact on model classification performance alongside the baseline performance of each model. We show the macro precision, macro recall, macro F1-score, event-level precision, event-level recall, and event-level F1-score in
Table 4. We report the macro definitions instead of the weighted definitions, as data imbalances inflate weighted performance metrics. Naive Bayes is not included in the cost-sensitive learning trial, as this model does not have a concept of class weights.
Of the models and configurations tested, XGBoost with no preprocessing performed the best, achieving an event-level F1-score of 0.52. XGBoost with ROS achieved the second-highest performance, with an event-level F1-score of 0.50. Looking at the results overall, preprocessing techniques generally led to decreases in performance; although they largely improved the event-level recall, they also caused decreases in the event-level precision. It is worth noting that Naive Bayes consistently achieved near-perfect event-level recall; however, this was always offset by abysmal event-level precision.
4.2. Impact of Postprocessing
We evaluated our postprocessing technique on the highest-performing classifiers with and without preprocessing from the previous experiment: baseline XGBoost and XGBoost with ROS. We investigated all weighting functions given in
Table 3 with window (
W) sizes that ranged from 10 s to 240 s in increments of 10 s. Precision and recall tended to increase as the window size grew up to a plateau point; after that point, recall remained stable, while precision started to decrease, lowering the overall F1-score.
Table 5 shows the highest-performing postprocessing algorithm for the baseline and preprocessing conditions for the validation set alongside the original performance metrics for direct reference. Overall, XGBoost with ROS and our postprocessing algorithm achieved the highest results with an event-level F1 score of 0.64, an improvement of 0.12 from the highest-performing model without any postprocessing.
In contrast to the effects of the preprocessing techniques, postprocessing increased the system’s precision at the cost of its recall. Furthermore, the effects of preprocessing and postprocessing did not cancel each other out, as they resulted in net gains for both precision and recall and produced the best overall pipeline for in-the-wild ADL recognition.
4.3. Evaluation
We present the results of the highest-performing models from the preprocessing and postprocessing experiments on the held-out evaluation set. Performance metrics are given in
Table 6, and confusion matrices are given in
Figure 3. While the gains were not as high as those on the validation set, the evaluation set saw improvements in the event-level F1-score performance metric. Additionally, the confusion matrices demonstrate an increase in model specificity with a decrease in false positives on the ADL classes. The effect of postprocessing on event-based performance metrics per class is given in
Table 7. The model was not able to detect the classes of drinking or taking medication but was able to recognize the other classes to varying extents. Otherwise, the addition of postprocessing improved the precision without lowering the recall. Notably, of the classes that the model was able to recognize, the activities of brushing teeth, combing hair, and washing hands achieved an event-level F1-score of at least 0.97, and the activity of walking achieved an event-level F1-score of 0.91.
5. Discussion
5.1. Challenge of Non-Distinct Minority Classes
Recognizing activities performed in real-world scenarios is both the ultimate goal of human activity recognition and the most difficult version of this classification task. Our baseline results show that both traditional and deep learning models, even when combined with preprocessing steps, struggle to detect some of the classes. This outcome implies that these ADLs are not distinct from the NULL class. In general, imbalances hinder the recall of minority classes, which is problematic when the goal of the activity recognition model is to detect instances of infrequent behaviors such as ADLs. Techniques for handling imbalances, such as data preprocessing or cost-sensitive learning, can improve the recall for these classes of interest. However, the impact of these techniques creates unintended consequences for the performance of the model. By learning decision boundaries that favor the detection of these minority classes, models will incur false positives on nearby samples that belong to other classes. In this case, it is to be expected that gains in recall will be counterbalanced by equal if not worse losses of precision. We observe this phenomenon in the first experiment, where we investigate the impacts of preprocessing; the recall improves, but the overall performance in terms of the F1-score decreases. However, while preprocessing alone does not improve performance, it does synergize with our proposed postprocessing technique. We designed our postprocessing technique to correct the prediction stream based on the local context, allowing it to intuitively clean up false positives and false negatives. Because of the imbalances in the dataset and the nature of the data themselves, predicted events tend to be small in size. As such, postprocessing can convert true positives into false negatives when a true event is detected with low coverage. However, because preprocessing increases recall, these true events become less likely to be incorrectly adjusted, and incorrect detections will receive the same treatment as true detections without preprocessing. In short, preprocessing and postprocessing contribute to the overall performance of the model more than they hinder it, and their respective benefits cancel out the other’s costs.
5.2. Intraclass Variability in Certain ADLs
The nature of certain ADLs makes them difficult to detect in in-the-wild data, especially in user-independent evaluation. Exemplifying this concept, the final evaluation model was unable to detect the activities of drinking or taking medication even after using preprocessing. While it was detected, performance on the eating ADL was notably lower than that on the remaining ADLs. One attribute that these activities have in common is that they all involve consuming various substances. The more telling similarity for why their performance was low, however, is the number of ways these activities can be performed. These activities can involve one or two hands, which can be used sequentially or simultaneously to complete the activity, resulting in significant intraclass variability. The participants ate different foods with or without utensils, opened different types of packaging when taking medication, and used different styles of cups or mugs for drinking. Due to a lack of similar samples in the training set, these ADLs were commonly missed as part of the NULL class. Future work can focus on these activities to determine techniques for addressing this intraclass variability inherent to several ADLs.
5.3. Towards Real-World Human Activity Recognition
As with the real-world deployment of any machine learning system, the deployment of a human activity recognition system requires making a decision on which type of error is more unacceptable. In this work, we developed a postprocessing technique that was able to improve the overall performance of the system by dramatically increasing the precision at the cost of a decrease in the system’s recall. In other words, we decreased the number of false positives at the cost of an increase in the number of false negatives. For an application of human activity recognition such as elderly care (i.e., where a caregiver could remotely monitor the ADL performance of an older adult), this is likely the preferred choice, as this system would be highly certain when an older adult is performing their ADLs at the cost of caregivers sometimes still having to manually check in. Naturally, if the recall of such a system was too low, caregivers would likely feel as though such a system was not really saving them too much work, as they would still have to manually check most of the time. In contrast, for an application of human activity recognition such as surveillance, this paradigm would likely not be preferred, as human monitors would rather have to manually check a false alarm than miss suspicious behavior indicative of malicious intent. In this use case, a system whose precision was too low would create an onerous amount of additional work.
5.4. Hardware Considerations
Regardless of the classification performance of a human activity recognition system in real-world situations, such a system is not likely to fail without equal attention being placed on the specific consumer hardware on which these algorithms will run. In this work, we utilized Polar M600 watches [
74], Android smartwatches available in the United States that run Google’s Wear OS, an operating system that is used by a number of different popular (and more modern) smartwatches currently on the market. A few users in our study did find this particular watch cumbersome; thus, developing algorithms for sensors and OSs that are widely available will be essential to ensure that users can choose the watch that fits their own personal preferences. The other major hardware consideration is power consumption, as these algorithms will likely be running in the background at all times to detect activities as they occur. In our study, users reported that the watch lasted a full day (that is, from when they put it on in the morning to when they took it off at night). Users who wore the watch a second day did have to charge the watches overnight, as the batteries were low by the end of the first day.
5.5. Future Work
There are a number of areas in which future work is necessary for real-world human activity recognition systems to become accurate and ubiquitous. In this work, our models were designed to differentiate among seven activities of daily living and a NULL class based on accelerometer and gyroscope data from a Polar M600. Future work should look at a broader range of activities, as well as other sensors and sensor locations, specifically in real-world settings. Additionally, given the potential of this field to aid different populations with specific use cases (e.g., children, older adults), future work should collect data from these populations and explore how solutions for handling the NULL class need to be adapted for these populations. Finally, in this work, we conducted a day-long study; future work can utilize data from longitudinal studies, exploring how to build models on weeks and months of data and how those models might be deployed and maintained effectively in real-world settings.
6. Conclusions
Recognition of human activities in a real-world environment is a difficult pattern recognition task. Over the course of a day, people perform an abundance of activities and motions, most of which will not be of importance to a human activity recognition system. As a natural consequence, these systems have trouble distinguishing uncommon activities from the swaths of data collected. In an effort to address this problem, we looked specifically at the recognition of human activities in real-world environments.
Our first contribution was collecting a fully in-the-wild dataset, in which the real-world performance of activities of daily living was annotated. The data were collected on commodity smartwatches on a diverse and representative set of activities that researchers can utilize to build richer human activity recognition systems. Our second contribution was investigating and discovering a novel postprocessing technique that improves classification performance for human activity recognition systems trained on in-the-wild data. The technique addresses the challenge of overlapping classes that researchers can utilize to build more robust human recognition systems. Our third contribution was our novel investigation directly into the class imbalance and class overlap problems when applying standard algorithms and data preprocessing techniques. This investigation can encourage other researchers in the domain of human activity recognition to expand our investigation into this open problem. These contributions collectively represent a significant step towards practical real-world recognition of human activities such as ADLs.