Exploring the Impact of the NULL Class on In-the-Wild Human Activity Recognition

Cherian, Josh; Ray, Samantha; Taele, Paul; Koh, Jung In; Hammond, Tracy

doi:10.3390/s24123898

Open AccessArticle

Exploring the Impact of the NULL Class on In-the-Wild Human Activity Recognition

by

Josh Cherian

^*,†,‡

,

Samantha Ray

^‡

,

Paul Taele

,

Jung In Koh

and

Tracy Hammond

^*

Department of Computer Science & Engineering, Texas A&M University, College Station, TX 77843, USA

^*

Authors to whom correspondence should be addressed.

^†

Current address: Department of Biomedical Engineering, Wake Forest University School of Medicine, Winston-Salem, NC 27101, USA.

^‡

These authors contributed equally to this work.

Sensors 2024, 24(12), 3898; https://doi.org/10.3390/s24123898

Submission received: 10 May 2024 / Revised: 7 June 2024 / Accepted: 13 June 2024 / Published: 16 June 2024

(This article belongs to the Special Issue Sensors for Human Movement Recognition and Analysis)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Monitoring activities of daily living (ADLs) plays an important role in measuring and responding to a person’s ability to manage their basic physical needs. Effective recognition systems for monitoring ADLs must successfully recognize naturalistic activities that also realistically occur at infrequent intervals. However, existing systems primarily focus on either recognizing more separable, controlled activity types or are trained on balanced datasets where activities occur more frequently. In our work, we investigate the challenges associated with applying machine learning to an imbalanced dataset collected from a fully in-the-wild environment. This analysis shows that the combination of preprocessing techniques to increase recall and postprocessing techniques to increase precision can result in more desirable models for tasks such as ADL monitoring. In a user-independent evaluation using in-the-wild data, these techniques resulted in a model that achieved an event-based F1-score of over 0.9 for brushing teeth, combing hair, walking, and washing hands. This work tackles fundamental challenges in machine learning that will need to be addressed in order for these systems to be deployed and reliably work in the real world.

Keywords:

human activity recognition; activities of daily living; class imbalance; preprocessing; postprocessing; in-the-wild; smartwatch

1. Introduction

Human activity recognition has been a significant focus of research and development in wearable and ubiquitous computing for the last several decades due to its ability to provide a real-time understanding of human behavior and its potential to inform intelligent and personalized user-centered technologies. This recognition has been powered by a combination of sensors, located either on the body or in the environment, and machine learning techniques that have become increasingly adept at distinguishing among a variety of human behaviors and activities. Researchers have applied human activity recognition techniques to a diverse range of applications, including healthcare and well-being [1,2,3,4], weightlifting and sports [5,6,7,8], sign language translation [9], and car manufacturing and safety [10,11]. Within the area of healthcare and well-being, researchers have devoted particular attention to the recognition of activities of daily living (ADLs), as ADL performance is a key indicator of day-to-day health and wellness [12,13,14]. Over the years, researchers have developed pipelines capable of recognizing a handful of ADLs [15,16,17,18,19,20,21] or specific ADLs of interest, such as washing hands [22,23,24,25,26], taking medication [2,27], brushing teeth [28], and eating and drinking [29,30,31,32,33,34]. Such systems could be beneficial in practice to a number of populations: parents could guarantee that their children are learning and maintaining good health habits, caregivers could ensure that older adults are safely and successfully taking care of themselves, and the average person could track how well they are maintaining their day-to-day health. Furthermore, they would provide flexibility and facilitate greater autonomy than the existing paradigm for monitoring day-to-day health.

However, for these systems to be widely adopted by these populations, they need to be accurate, reliable, and robust in real-world settings. This requires these systems to be built and evaluated on data captured during the real-world performance of ADLs, as the performance of ADLs in controlled and semi-naturalistic settings often differs from the performance of ADLs in real-world settings [15,35,36]. Additionally, such systems must be able to differentiate between the performance of ADLs of interest and the performance of every other activity an individual performs. Often, ADLs of interest are a small minority of the activities performed on a given day. As a result, reliable human activity recognition in the wild requires overcoming a significant data imbalance issue. As examples, ADLs such as washing hands, taking medication, and brushing teeth, which are indicators of day-to-day health and hygiene, only occur at most a few times per day, with each instance only lasting seconds to a few minutes. Even within the broader field of machine learning, the issue of class imbalance remains a challenging open problem [37].

Although recent studies in the field of human activity recognition have acknowledged the challenge of in-the-wild recognition, the practicality of using many of the models proposed by human activity recognition studies in deployed systems remains limited by sub-optimal performance, the nature of the datasets that these models are trained on, or both [21,38,39,40]. The limitation of the nature of the datasets is perhaps the most endemic to the field, as the majority of studies rely on publicly available datasets that have a limited number of labeled activities and are comprised of data collected in controlled or semi-naturalistic environments. Thus, these models remain untested on data produced under real-world conditions (i.e., on in-the-wild data), which consist of more activities and feature a significantly higher level of class imbalance. The work by Vaizman et al. [41] highlights the importance of ensuring that models perform well in real-world settings, emphasizing that deployed applications need to work in a variety of contexts and when behaviors are performed irregularly. The more recent work by Bhattacharya et al. [21], seeking to evaluate their pipeline in real-world settings, evaluated an activity recognition pipeline on in-the-wild data. However, their work did not consider the large NULL class, i.e., all activities that are not of interest, which predominantly comprises in-the-wild data. The authors acknowledge this and note that the problem of in-the-wild ADL recognition requires more attention, as there is still significant room for performance improvement.

To address these limitations, we investigated the design of a human activity recognition system that has been trained on in-the-wild data in order to detect a set of ADLs. Specifically, we consider standard methods for handling imbalanced data, introduce a postprocessing technique to improve prediction precision, and assess the performance of classical feature-based models and deep learning models within these data-processing pipelines. These results establish a baseline for user-independent, in-the-wild activity recognition for a set of common ADLs. The main contributions of our work are as follows:

A fully in-the-wild dataset. First, we present an annotated in-the-wild dataset, which consists of accelerometer and gyroscope data from off-the-shelf smartwatches worn on both wrists. This dataset consists of 106.74 h of data from nine participants behaving naturally and following their personal daily routines in their homes and workplaces.
An evaluation of existing techniques to handle imbalances in human activity recognition data. Second, we investigate techniques to improve classification performance on activity recognition systems trained on in-the-wild data. These techniques include common methods for dealing with imbalanced classes (e.g., undersampling and oversampling), as well as model training strategies such as cost-sensitive learning. Our experiments show that, in the case of in-the-wild data, these techniques improve the recall of the model at the cost of its precision. As a result, we find that these techniques in isolation are not enough to address the challenges associated with in-the-wild recognition.
A novel postprocessing technique. Third, we propose a context-based prediction correction method to improve prediction stream stability. We evaluate the performance of this algorithm with five different weighting functions using the best-performing models from the previous experiments with and without preprocessing. Our model achieved an event-based F1-score of over 0.9 for the activities of brushing teeth, combing hair, walking, and washing hands in a user-independent evaluation using both preprocessing and postprocessing techniques.

2. Related Work

Given the impact that data imbalance and class overlap have on classification performance, both of these issues have received attention from researchers over the years. In this section, we provide an overview of some of the more common approaches that researchers have taken to handle these challenges and discuss publicly available human activity recognition datasets.

2.1. Human Activity Recognition

Researchers have been exploring ways of recognizing human activities for several decades, with early work exploring where sensors might be placed and what these form factors might look like [36,42,43]. As smartphones and wearable sensing devices such as fitness trackers and smartwatches have become increasingly ubiquitous and machine learning techniques have become increasingly advanced, researchers have shown that reliably recognizing many everyday human activities is technically achievable in broadly appealing form factors. As the field has been researched for so long, there are a number of papers that have surveyed the breadth of work that human activity recognition researchers have accomplished, covering the sensors used, common techniques and algorithms, applications, public datasets, and key challenges and future directions [38,44,45,46]. We refer interested readers to these works for a more in-depth understanding of the field.

2.2. General Techniques for Dealing with Data Imbalances

Techniques designed to handle data imbalances in machine learning applications are applied as a preprocessing step (data level), as part of the training of the model (algorithm level), or both (hybrid) [47,48,49]. Data-level techniques are methods designed to change the class distribution of an imbalanced dataset, allowing an algorithm to pay more attention to the minority class (or classes). As these techniques are applied prior to the application of the algorithm, they are commonly referred to as preprocessing techniques and generally utilize either undersampling or oversampling approaches. Undersampling approaches reduce the size of the majority class. Common undersampling techniques include Random Undersampling (RUS) [50], Condensed Nearest Neighbor [51], and Tomek Links (TL) [52]. Newer approaches have proposed various techniques for targeting specific data points within the majority class for removal [53,54]. Oversampling techniques increase the size of the minority class. The most popular oversampling approaches are Random Oversampling (ROS) and SMOTE [55], with many of the more recent oversampling techniques being modifications to the latter [56,57]. There have also been several works that both undersample the majority class and oversample the minority class(es) [58,59]. These data-level techniques are inherently flexible, as they can simply be inserted into the pipeline without requiring any significant modifications downstream. However, these techniques can over- or underemphasize patterns inherent to the data, leading to models that are less applicable to real-world scenarios.

Algorithm-level techniques refer to modified versions of existing algorithms that are designed to handle class imbalances. These techniques are generally designed for specific tasks and datasets, sacrificing flexibility for better performance on a narrower problem. By far the most common algorithm-level methods are cost-sensitive techniques in which the algorithm imposes a higher cost on misclassifying the minority class [60,61]. Other algorithm-level approaches include single-class learning, in which a classifier is trained on just the minority class, and boundary modification methods, which shift the decision boundary artificially to account for class imbalances [62,63]. Hybrid approaches seek to combine data-level and algorithm-level approaches in ways that maximize their benefits while mitigating their drawbacks. These approaches largely operate by utilizing undersampling or oversampling approaches to balance the data seen by weak learners during the training of the algorithm. Liu et al. [64] proposed EasyEnsemble and BalanceCascade, approaches that consist of an ensemble of classifiers trained on undersampled subsets of the data. Several studies have incorporated undersampling and oversampling into boosting, using RUS [65] and SMOTE [66] to resample the data prior to training the weak learner during each round of boosting, while the work by Wang and Yao [67] similarly evaluated the efficacy of RUS, SMOTE, and ROS to balance the data seen by the baseline estimators of a Bagging classifier.

2.3. Handling Data Imbalances in Human Activity Recognition

There are a number of works within the field of human activity recognition that have developed or utilized techniques specifically designed to mitigate the effects of data imbalances. At the data level, several works have employed existing undersampling and oversampling techniques [68,69]. Alharbi et al. [40] and Nguyen et al. [70] introduced novel oversampling methods, which they evaluated on several publicly available datasets. At the algorithm level, Guo et al. [39] developed a dual-ensemble framework, an algorithm-level technique that utilizes a nested ensemble structure comprised of an internal ensemble model that selects a base classifier from three classifiers based on classification performance on a balanced subset of the data and an external ensemble model that uses an evolutionary algorithm to find the optimal combination of base classifiers. Several works have utilized cost-sensitive learning, evaluating the efficacy of weighted extreme learning machines (ELMs) to handle data imbalances in publicly available datasets [71,72]. While these works have directly looked at imbalances, they all target the issue of data imbalances between classes of interest (e.g., in the Opportunity dataset, drinking from a cup occurs 22% of the time, while the next most frequent activity, cleaning a table, occurs 7% of the time). Conversely, our work focuses on the imbalance between classes of interest and the NULL class, which comprises all of the activities that a person performs that are not of interest. While both forms of imbalance are prevalent in real-world situations, addressing the latter form means developing techniques that are robust to significantly higher forms of imbalance. Even within the broader field of machine learning, most works have focused on evaluating techniques to handle imbalances in datasets with imbalance ratios ranging from 1:4 to 1:100 [73], while our work utilizes a dataset with a much higher level of imbalance.

3. Materials and Methods

3.1. Data Collection

We collected 3D accelerometer and gyroscope data from two Polar M600 (Polar Electro, Kempele, Finland) smartwatches worn by participants on both of their wrists [74]. The watches transmitted raw sensor data to Android smartphones in real time using a custom-built data collection application. Participants used this smartphone application to start and stop data collection and label the data being collected. The application had buttons for the following labels: “Washing Hands”, “Brushing Teeth”, “Eating”, “Drinking”, “Taking Meds”, “Combing Hair”, “Clapping”, “Walking”, “Nothing”, and “Other”. By default, the data were labeled as “Nothing”. When participants performed an activity, they selected the label for the activity and deselected the label (or selected “Nothing”) when they were done performing the activity. If the participant performed an activity that did not have a corresponding button and wanted to label that activity, pressing the Other button allowed them to type in another label. A diagram of the experimental setup is shown in Figure 1.

We recruited nine participants (aged 18–30 years; 2 females) to wear the sensors during their daily lives. Participants wore the devices as long as they wanted; the duration varied from 3 h to 34 h over the course of two days with the sensors off during the night. Four of the nine participants participated in a second data collection session over a year after the first data collection session. The data collection protocols were reviewed and approved by our organization’s Institutional Review Board (IRB). Cherian et al. [2] used these data in their work to recognize the activity of taking medication.

3.2. Activities

In this work, we sought to recognize the following seven activities: brushing teeth, combing hair, drinking, eating, taking medication, walking, and washing hands. Thus, data labeled as “Clapping” were not used in this study. Additionally, any activities labeled by participants using the “Other” button were relabeled as “nothing” for our analysis, with the exception of ascending and descending stairs, which were relabeled as walking. Additionally, one participant used an electric toothbrush when brushing their teeth; these data were relabeled as “nothing”, as using an electric toothbrush does not require the use of the same physical movements as using a manual toothbrush.

Within the field of human activity recognition, the NULL class is commonly used to collectively describe unknown or out-of-scope classes [21]. Thus, throughout the remainder of this work, we refer to all data within our dataset that were not labeled as one of the seven activities (i.e., labeled as “nothing”) as the NULL class.

3.3. Dataset Metrics

As expected in a real-world study, the majority of the data we collected in our studies were not samples of the ADLs of interest, reflecting the reality that these are relatively short activities that are only performed a few times over the course of a day. In total, we collected 106.68 h of sensor data, and only 5.84 h or 5.47% of the data were samples of ADLs. Table 1 gives the breakdown of data by class in terms of hours and imbalance.

For imbalance, we calculated the imbalance ratio (IR), which is a simple ratio between the number of data points comprising the majority class and the number of data points comprising the minority class. This definition is logical for binary classification tasks, but it loses detail for multiclass classification where minority classes are not the same size. In this work, we report the average IR across activity classes using Equation (1). Here,

|C_{a}|

is the number of instances belonging to class a, n is the total number of instances including the NULL class, and k is the number of activity classes.

I R_{a v g} = \frac{(\sum_{a = 1}^{k} \frac{n - | C_{a} |}{| C_{a} |})}{k}

(1)

3.4. Activity Recognition Performance Metrics

Human activity recognition studies commonly report performance on the classification of short windows of sensor data using one or more of the following metrics: accuracy, sensitivity, specificity, precision, recall, and F1-score [46]. Of these metrics, the precision, recall, and F1-score are especially popular among studies that evaluate their systems on datasets with some level of imbalance. However, these metrics generally quantify a system’s ability to recognize short characteristic segments or windows of sensor data, whereas the success of a real-world implementation of such a system depends on its ability to recognize larger activity events. That is to say, the event-level performance, rather than the window-level performance, is a more relevant indicator of a system’s practical viability.

Thus, in this work, we report the performance at both the window level and the event level. For the latter, we utilize the performance metrics put forth by Ward et al. [75], who define an event as a “variable duration sequence of positive frames within a continuous time-series” that “has a start time and a stop time”. At the event level, Ward et al. defined several event categories, which are defined in Figure 2. Based on these event categories, we define the event-level precision (Equation (2)), event-level recall (Equation (3)), and event-level F1-score (Equation (4)).

P_{e v e n t} = \frac{C + M^{'} + F M^{'} + F^{'}}{C + M^{'} + F M^{'} + F^{'} + I^{'}}

(2)

R_{e v e n t} = \frac{C + F + F M + M}{C + F + F M + M + D}

(3)

F 1_{e v e n t} = 2 \frac{P_{e v e n t} \cdot R_{e v e n t}}{P_{e v e n t} + R_{e v e n t}}

(4)

3.5. Algorithms

We evaluated the efficacy of both classical machine learning algorithms and deep learning models in recognizing the performance of ADLs in real-world scenarios.

3.5.1. Classical Machine Learning Models

Table 2 gives an overview of the features used in the classical machine learning algorithms. These features were extracted using a one-second sliding window with 75% overlap. Features (A) through (Q) were calculated for each axis of the accelerometer for each hand, and features (R) and (S) were calculated for each hand, bringing the total number to 106 (

(17 \times 3 axes \times 2 hands) + (2 \times 2 hands)

).

Most of the included features are statistical or physical characteristics of a signal and are commonly used in activity recognition [2,15,21,28,76,77]. Equations (5)–(7) are used to calculate (O)–(Q) [78], where

X_{i} (k)

represents the amplitude of the kth bin of the DFT performed on the sensor signal. Equation (8) is used to calculate (S) [79], where

N (ϵ)

is the number of boxes of length

ϵ

required to cover the sensor signal A.

C = \frac{\sum_{k = 1}^{N} k X (k)}{\sum_{k = 1}^{N} X (k)}

(5)

S = \sqrt{\frac{\sum_{k = 1}^{N} {(k - C)}^{2} X (k)}{\sum_{k = 1}^{N} X (k)}}

(6)

\underset{S R}{arg min} \sum_{k = 1}^{S R} X (k) \geq 0.85 \sum_{k} X (k)

(7)

D (A) : = lim_{ϵ \to 0} \frac{log N (ϵ)}{log (1 / ϵ)}

(8)

In this work, we tested Naive Bayes, Random Forest (100 trees, max_depth = 30), SVM (RBF kernel), and XGBoost (version 1.7.1) (100 trees, max_depth = 30). These models were implemented with sklearn (version 1.1.1) [80], with the exception of XGBoost, which was implemented using the xgboost python package developed based on work by Chen and Guestrin [81].

3.5.2. Deep Learning Models

In this work, we evaluated two deep learning models: DeepConvLSTM [82] and Attend&Discriminate [83]. The former is a deep learning network comprised of convolutional and LSTM recurrent layers and has been used as a benchmark in several recent studies [21,84]. The latter uses a self-attention mechanism to learn the relations between multiple sensor channels and a temporal attention mechanism to learn the importance of individual time steps. Both of these models were implemented using PyTorch (version 1.12.1) [85] and use designs that align with prior work [21]. In this work, the DeepConvLSTM processes accelerometer and gyroscope features separately through its own convolutional layers and LSTMs before concatenating the resulting features into a final classifier. For both types of features, we use four convolutional layers with 64 filters and a linear kernel of size 5. Batch normalization occurs after each convolution, and max pooling occurs every two convolutions with a linear kernel of size 3. The Attend&Discriminate model uses two convolutional layers with 8 filters and a linear kernel of size 3. Self-attention is applied along each sensor channel before feeding into a GRU and then a final classifier.

3.6. Existing Methods for Addressing Imbalances

As discussed in Section 2.3, researchers have suggested a number of techniques for handling imbalances. In this work, we evaluate several widely used methods at both the data level and algorithm level, as our aim is to provide insight into the ability of common techniques to mitigate the effects of the imbalances characteristic of real-world human activity recognition. Specifically, we look at random undersampling (RUS), random oversampling (ROS), and cost-sensitive learning (CSL). In RUS, the majority class is resampled without replacement such that it is the same size as the non-majority samples. In ROS, random samples from the minority class are duplicated to decrease the data skew. Generally, this practice entails oversampling each minority class such that it is equal in size to the majority class. Due to the extreme imbalance ratios for certain classes, however, balancing the classes would result in a poor model that has the a priori assumption that all classes occur with equal frequency. Thus, instead, we oversampled our minority classes such that the ratios between the minority classes remained the same and the total number of non-majority samples equaled the number of majority samples. Both undersampling and oversampling were implemented using the Imbalanced-learn python package [86]. For cost-sensitive learning, which occurs at the classifier level, we used balanced class weights, i.e., each class was weighted inversely proportionally to its size.

3.7. Classification Postprocessing

To improve the stability of the prediction stream, we take advantage of the temporal nature of the problem of activity recognition. Specifically, we leverage the basic truth that a data point is most likely to have the same class as the data points that immediately precede and succeed it. Past works in the field of activity recognition have taken advantage of this idea; for instance, the work by Cherian et al. [28] utilized a simple set of rules specific to the activity of brushing teeth to improve their algorithm’s ability to identify instances of that activity. Other works use simple rules such as taking the majority of a group of predictions after a delay [87,88,89,90]. In this work, we take a more general approach to avoid the necessity of having to develop subjective activity-specific rules.

Intuitively, our approach corrects predictions by considering preceding and succeeding predictions and giving higher weights to neighboring predictions that are closer to the prediction under consideration. Given an array of predicted labels

P = [P_{1}, \dots, P_{n}]

, where n is the number of windows extracted from the raw data stream, a corrected array of predictions

P^{'}

can be generated by analyzing the preceding and succeeding predictions of each prediction

P_{i}

. The subset of predictions centered on

P_{i}

, which we denote as W, is used to update

P_{i}

to the most likely prediction given this surrounding context using a weighting function. Algorithm 1 illustrates our approach at the highest level, showing how the stream of predictions P is converted into a corrected stream of predictions

P^{'}

. Algorithm 2 demonstrates how the predictions surrounding

P_{i}

are weighted and used to produce the corrected prediction. We investigated several different weighting functions; these are given in Table 3. Note that in deployment, this technique will introduce a latency to the classification determination that is proportional to the number of predictions following

P_{i}

that are considered as part of W.

Algorithm 1: Classification Postprocessing Algorithm

Input: P: prediction stream,

w_{s}

: context window size,

w e i g h t

: context window scoring function

Output:

P^{'}

: corrected prediction stream for P

1: $P^{'} \leftarrow []$
2: $W \leftarrow []$
3: for i in (1 to $|P|$ ):
4: if $|W|$ < $w_{s}$
5: Push $P [i]$ onto $P^{'}$
6: else
7: Push postprocess (W, $w_{s}$ , $w e i g h t$ ) onto $P^{'}$
8: Pop front W
9: Push $P [i]$ onto W
10: return $P^{'}$

Algorithm 2: Context-basedPrediction Correction

Input: W: context window,

w_{s}

: context window size,

w e i g h t

: context window scoring function

Output: p: prediction

1: $c o u n t s \leftarrow C o u n t e r (W)$
2: $s c o r e s \leftarrow m a p ()$
3: for x in counts.keys:
4: $s c o r e s [x] \leftarrow 0$
5: $m i d p o i n t \leftarrow ⌊ \frac{|W|}{2} ⌋ + 1$
6: for i in (1 to $|W|$ ):
7: $s c o r e s [W [i]] \leftarrow s c o r e s [W [i]] + w e i g h t (w_{s}, i - m i d p o i n t + 1)$
8: $p \leftarrow c o u n t s . k e y s [0]$
9: $m a x \leftarrow 0$
10: for x in counts.keys:
11: if $s c o r e s [x]$ > $m a x$
12: $m a x \leftarrow s c o r e s [x]$
13: $p \leftarrow x$
14: else if $s c o r e s [x] = = m a x$
15: if $c o u n t s [x] > c o u n t s [p]$
16: $p \leftarrow x$
17: return p

4. Results

While most works in human activity recognition utilize some form of cross-validation, given our dataset’s size, we evaluated our pipeline’s efficacy using a training–validation–evaluation split on a 70%, 20%, and 10% basis. For these splits, we used a stratified approach on the participant level to ensure that ADL data were distributed as evenly as possible among datasets and that data streams were contiguous. This design also results in a user-independent evaluation.

4.1. Impact of Preprocessing Techniques

We present the results of three common preprocessing techniques using the definitions in Section 3.6 on the validation set to demonstrate the impact on model classification performance alongside the baseline performance of each model. We show the macro precision, macro recall, macro F1-score, event-level precision, event-level recall, and event-level F1-score in Table 4. We report the macro definitions instead of the weighted definitions, as data imbalances inflate weighted performance metrics. Naive Bayes is not included in the cost-sensitive learning trial, as this model does not have a concept of class weights.

Of the models and configurations tested, XGBoost with no preprocessing performed the best, achieving an event-level F1-score of 0.52. XGBoost with ROS achieved the second-highest performance, with an event-level F1-score of 0.50. Looking at the results overall, preprocessing techniques generally led to decreases in performance; although they largely improved the event-level recall, they also caused decreases in the event-level precision. It is worth noting that Naive Bayes consistently achieved near-perfect event-level recall; however, this was always offset by abysmal event-level precision.

4.2. Impact of Postprocessing

We evaluated our postprocessing technique on the highest-performing classifiers with and without preprocessing from the previous experiment: baseline XGBoost and XGBoost with ROS. We investigated all weighting functions given in Table 3 with window (W) sizes that ranged from 10 s to 240 s in increments of 10 s. Precision and recall tended to increase as the window size grew up to a plateau point; after that point, recall remained stable, while precision started to decrease, lowering the overall F1-score. Table 5 shows the highest-performing postprocessing algorithm for the baseline and preprocessing conditions for the validation set alongside the original performance metrics for direct reference. Overall, XGBoost with ROS and our postprocessing algorithm achieved the highest results with an event-level F1 score of 0.64, an improvement of 0.12 from the highest-performing model without any postprocessing.

In contrast to the effects of the preprocessing techniques, postprocessing increased the system’s precision at the cost of its recall. Furthermore, the effects of preprocessing and postprocessing did not cancel each other out, as they resulted in net gains for both precision and recall and produced the best overall pipeline for in-the-wild ADL recognition.

4.3. Evaluation

We present the results of the highest-performing models from the preprocessing and postprocessing experiments on the held-out evaluation set. Performance metrics are given in Table 6, and confusion matrices are given in Figure 3. While the gains were not as high as those on the validation set, the evaluation set saw improvements in the event-level F1-score performance metric. Additionally, the confusion matrices demonstrate an increase in model specificity with a decrease in false positives on the ADL classes. The effect of postprocessing on event-based performance metrics per class is given in Table 7. The model was not able to detect the classes of drinking or taking medication but was able to recognize the other classes to varying extents. Otherwise, the addition of postprocessing improved the precision without lowering the recall. Notably, of the classes that the model was able to recognize, the activities of brushing teeth, combing hair, and washing hands achieved an event-level F1-score of at least 0.97, and the activity of walking achieved an event-level F1-score of 0.91.

5. Discussion

5.1. Challenge of Non-Distinct Minority Classes

Recognizing activities performed in real-world scenarios is both the ultimate goal of human activity recognition and the most difficult version of this classification task. Our baseline results show that both traditional and deep learning models, even when combined with preprocessing steps, struggle to detect some of the classes. This outcome implies that these ADLs are not distinct from the NULL class. In general, imbalances hinder the recall of minority classes, which is problematic when the goal of the activity recognition model is to detect instances of infrequent behaviors such as ADLs. Techniques for handling imbalances, such as data preprocessing or cost-sensitive learning, can improve the recall for these classes of interest. However, the impact of these techniques creates unintended consequences for the performance of the model. By learning decision boundaries that favor the detection of these minority classes, models will incur false positives on nearby samples that belong to other classes. In this case, it is to be expected that gains in recall will be counterbalanced by equal if not worse losses of precision. We observe this phenomenon in the first experiment, where we investigate the impacts of preprocessing; the recall improves, but the overall performance in terms of the F1-score decreases. However, while preprocessing alone does not improve performance, it does synergize with our proposed postprocessing technique. We designed our postprocessing technique to correct the prediction stream based on the local context, allowing it to intuitively clean up false positives and false negatives. Because of the imbalances in the dataset and the nature of the data themselves, predicted events tend to be small in size. As such, postprocessing can convert true positives into false negatives when a true event is detected with low coverage. However, because preprocessing increases recall, these true events become less likely to be incorrectly adjusted, and incorrect detections will receive the same treatment as true detections without preprocessing. In short, preprocessing and postprocessing contribute to the overall performance of the model more than they hinder it, and their respective benefits cancel out the other’s costs.

5.2. Intraclass Variability in Certain ADLs

The nature of certain ADLs makes them difficult to detect in in-the-wild data, especially in user-independent evaluation. Exemplifying this concept, the final evaluation model was unable to detect the activities of drinking or taking medication even after using preprocessing. While it was detected, performance on the eating ADL was notably lower than that on the remaining ADLs. One attribute that these activities have in common is that they all involve consuming various substances. The more telling similarity for why their performance was low, however, is the number of ways these activities can be performed. These activities can involve one or two hands, which can be used sequentially or simultaneously to complete the activity, resulting in significant intraclass variability. The participants ate different foods with or without utensils, opened different types of packaging when taking medication, and used different styles of cups or mugs for drinking. Due to a lack of similar samples in the training set, these ADLs were commonly missed as part of the NULL class. Future work can focus on these activities to determine techniques for addressing this intraclass variability inherent to several ADLs.

5.3. Towards Real-World Human Activity Recognition

As with the real-world deployment of any machine learning system, the deployment of a human activity recognition system requires making a decision on which type of error is more unacceptable. In this work, we developed a postprocessing technique that was able to improve the overall performance of the system by dramatically increasing the precision at the cost of a decrease in the system’s recall. In other words, we decreased the number of false positives at the cost of an increase in the number of false negatives. For an application of human activity recognition such as elderly care (i.e., where a caregiver could remotely monitor the ADL performance of an older adult), this is likely the preferred choice, as this system would be highly certain when an older adult is performing their ADLs at the cost of caregivers sometimes still having to manually check in. Naturally, if the recall of such a system was too low, caregivers would likely feel as though such a system was not really saving them too much work, as they would still have to manually check most of the time. In contrast, for an application of human activity recognition such as surveillance, this paradigm would likely not be preferred, as human monitors would rather have to manually check a false alarm than miss suspicious behavior indicative of malicious intent. In this use case, a system whose precision was too low would create an onerous amount of additional work.

5.4. Hardware Considerations

Regardless of the classification performance of a human activity recognition system in real-world situations, such a system is not likely to fail without equal attention being placed on the specific consumer hardware on which these algorithms will run. In this work, we utilized Polar M600 watches [74], Android smartwatches available in the United States that run Google’s Wear OS, an operating system that is used by a number of different popular (and more modern) smartwatches currently on the market. A few users in our study did find this particular watch cumbersome; thus, developing algorithms for sensors and OSs that are widely available will be essential to ensure that users can choose the watch that fits their own personal preferences. The other major hardware consideration is power consumption, as these algorithms will likely be running in the background at all times to detect activities as they occur. In our study, users reported that the watch lasted a full day (that is, from when they put it on in the morning to when they took it off at night). Users who wore the watch a second day did have to charge the watches overnight, as the batteries were low by the end of the first day.

5.5. Future Work

There are a number of areas in which future work is necessary for real-world human activity recognition systems to become accurate and ubiquitous. In this work, our models were designed to differentiate among seven activities of daily living and a NULL class based on accelerometer and gyroscope data from a Polar M600. Future work should look at a broader range of activities, as well as other sensors and sensor locations, specifically in real-world settings. Additionally, given the potential of this field to aid different populations with specific use cases (e.g., children, older adults), future work should collect data from these populations and explore how solutions for handling the NULL class need to be adapted for these populations. Finally, in this work, we conducted a day-long study; future work can utilize data from longitudinal studies, exploring how to build models on weeks and months of data and how those models might be deployed and maintained effectively in real-world settings.

6. Conclusions

Recognition of human activities in a real-world environment is a difficult pattern recognition task. Over the course of a day, people perform an abundance of activities and motions, most of which will not be of importance to a human activity recognition system. As a natural consequence, these systems have trouble distinguishing uncommon activities from the swaths of data collected. In an effort to address this problem, we looked specifically at the recognition of human activities in real-world environments.

Our first contribution was collecting a fully in-the-wild dataset, in which the real-world performance of activities of daily living was annotated. The data were collected on commodity smartwatches on a diverse and representative set of activities that researchers can utilize to build richer human activity recognition systems. Our second contribution was investigating and discovering a novel postprocessing technique that improves classification performance for human activity recognition systems trained on in-the-wild data. The technique addresses the challenge of overlapping classes that researchers can utilize to build more robust human recognition systems. Our third contribution was our novel investigation directly into the class imbalance and class overlap problems when applying standard algorithms and data preprocessing techniques. This investigation can encourage other researchers in the domain of human activity recognition to expand our investigation into this open problem. These contributions collectively represent a significant step towards practical real-world recognition of human activities such as ADLs.

Author Contributions

Conceptualization, J.C., S.R. and T.H.; Data Curation, J.C.; Formal Analysis, J.C. and S.R.; Funding Acquisition, J.C., S.R. and T.H.; Investigation, J.C. and S.R.; Methodology, J.C. and S.R.; Project Administration, P.T. and T.H.; Resources, J.C.; Software, J.C., S.R. and J.I.K.; Supervision, P.T. and T.H.; Validation, J.C. and S.R.; Visualization, J.C., S.R. and J.I.K.; Writing—Original Draft, J.C. and S.R.; Writing—Review and Editing, J.C., S.R., J.I.K. and P.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation (grant number 1952236).

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Texas A&M University (IRB2015-0669D; approved 27 July 2021).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

We thank the Sketch Recognition Lab at Texas A&M University for their support during this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADL	Activity of Daily Living
RUS	Random Undersampling
ROS	Random Oversampling
CSL	Cost-Sensitive Learning

References

Inoue, S.; Lago, P.; Hossain, T.; Mairittha, T.; Mairittha, N. Integrating Activity Recognition and Nursing Care Records: The System, Deployment, and a Verification Study. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2019, 3, 86. [Google Scholar] [CrossRef]
Cherian, J.; Ray, S.; Hammond, T. An Activity Recognition System for Taking Medicine Using In-the-Wild Data to Promote Medication Adherence. In Proceedings of the IUI ’21: 26th International Conference on Intelligent User Interfaces, College Station, TX, USA, 13–17 April 2021; pp. 575–584. [Google Scholar] [CrossRef]
Bharti, P.; Panwar, A.; Gopalakrishna, G.; Chellappan, S. Watch-dog: Detecting self-harming activities from wrist worn accelerometers. IEEE J. Biomed. Health Inform. 2017, 22, 686–696. [Google Scholar] [CrossRef] [PubMed]
Plötz, T.; Hammerla, N.Y.; Rozga, A.; Reavis, A.; Call, N.; Abowd, G.D. Automatic Assessment of Problem Behavior in Individuals with Developmental Disabilities. In Proceedings of the UbiComp ’12: 2012 ACM Conference on Ubiquitous Computing, Pittsburgh, PA, USA, 5–8 September 2012; pp. 391–400. [Google Scholar] [CrossRef]
Morris, D.; Saponas, T.S.; Guillory, A.; Kelner, I. RecoFit: Using a Wearable Sensor to Find, Recognize, and Count Repetitive Exercises. In Proceedings of the CHI ’14: SIGCHI Conference on Human Factors in Computing Systems, Toronto, ON, Canada, 26 April–1 May 2014; pp. 3225–3234. [Google Scholar] [CrossRef]
Mendiola, V.; Doss, A.; Adams, W.; Ramos, J.; Bruns, M.; Cherian, J.; Kohli, P.; Goldberg, D.; Hammond, T. Automatic exercise recognition with machine learning. In Proceedings of the International Workshop on Health Intelligence, Honolulu, HI, USA, 27–28 January 2019; pp. 33–44. [Google Scholar] [CrossRef]
Seuter, M.; Pollock, A.; Bauer, G.; Kray, C. Recognizing Running Movement Changes with Quaternions on a Sports Watch. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2020, 4, 151. [Google Scholar] [CrossRef]
Mortazavi, B.J.; Pourhomayoun, M.; Alsheikh, G.; Alshurafa, N.; Lee, S.I.; Sarrafzadeh, M. Determining the single best axis for exercise repetition recognition and counting on smartwatches. In Proceedings of the 2014 11th International Conference on Wearable and Implantable Body Sensor Networks, Zurich, Switzerland, 16–19 June 2014; pp. 33–38. [Google Scholar] [CrossRef]
Hou, J.; Li, X.Y.; Zhu, P.; Wang, Z.; Wang, Y.; Qian, J.; Yang, P. SignSpeaker: A Real-Time, High-Precision SmartWatch-Based Sign Language Translator. In Proceedings of the MobiCom ’19: 25th Annual International Conference on Mobile Computing and Networking, Los Cabos, Mexico, 21–25 October 2019. [Google Scholar] [CrossRef]
Stiefmeier, T.; Roggen, D.; Ogris, G.; Lukowicz, P.; Tröster, G. Wearable activity tracking in car manufacturing. IEEE Pervasive Comput. 2008, 7, 42–50. [Google Scholar] [CrossRef]
Leland, J.; Stanfill, E.; Cherian, J.; Hammond, T. Recognizing Seatbelt-Fastening Behavior with Wearable Technology and Machine Learning. In Proceedings of the CHI EA ’21: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 8–13 May 2021. [Google Scholar] [CrossRef]
Katz, S. Assessing self-maintenance: Activities of daily living, mobility, and instrumental activities of daily living. J. Am. Geriatr. Soc. 1983, 31, 721–816. [Google Scholar] [CrossRef] [PubMed]
Branch, L.G.; Jette, A.M. A prospective study of long-term care institutionalization among the aged. Am. J. Public Health 1982, 72, 1373–1379. [Google Scholar] [CrossRef] [PubMed]
Caffrey, C.; Sengupta, M.; Melekin, A. Residential Care Community Resident Characteristics: United States, 2018. Natl. Cent. Health Stat. Data Brief 2021, 404, 1–8. [Google Scholar] [CrossRef]
Bao, L.; Intille, S.S. Activity recognition from user-annotated acceleration data. In Pervasive Computing; Springer: Vienna, Austria, 2004; pp. 1–17. [Google Scholar] [CrossRef]
Kao, T.P.; Lin, C.W.; Wang, J.S. Development of a portable activity detector for daily activity recognition. In Proceedings of the 2009 IEEE International Symposium on Industrial Electronics, Seoul, Republic of Korea, 5–8 July 2009; pp. 115–120. [Google Scholar] [CrossRef]
Dernbach, S.; Das, B.; Krishnan, N.C.; Thomas, B.L.; Cook, D.J. Simple and complex activity recognition through smart phones. In Proceedings of the Intelligent Environments (IE), 2012 8th International Conference on, Guanajuato, Mexico, 26–29 June 2012; pp. 214–221. [Google Scholar] [CrossRef]
Weiss, G.M.; Timko, J.L.; Gallagher, C.M.; Yoneda, K.; Schreiber, A.J. Smartwatch-based activity recognition: A machine learning approach. In Proceedings of the 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Las Vegas, NV, USA, 24–27 February 2016; pp. 426–429. [Google Scholar] [CrossRef]
Laput, G.; Harrison, C. Sensing Fine-Grained Hand Activity with Smartwatches. In Proceedings of the CHI ’19: 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–13. [Google Scholar] [CrossRef]
Jain, Y.; Tang, C.I.; Min, C.; Kawsar, F.; Mathur, A. ColloSSL: Collaborative Self-Supervised Learning for Human Activity Recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 17. [Google Scholar] [CrossRef]
Bhattacharya, S.; Adaimi, R.; Thomaz, E. Leveraging Sound and Wrist Motion to Detect Activities of Daily Living with Commodity Smartwatches. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 42. [Google Scholar] [CrossRef]
Galluzzi, V.; Herman, T.; Polgreen, P. Hand Hygiene Duration and Technique Recognition Using Wrist-Worn Sensors. In Proceedings of the IPSN ’15: 14th International Conference on Information Processing in Sensor Networks, Seattle, WA, USA, 13–16 April 2015; pp. 106–117. [Google Scholar] [CrossRef]
Mondol, M.A.S.; Stankovic, J.A. Harmony: A Hand Wash Monitoring and Reminder System Using Smart Watches. In Proceedings of the MOBIQUITOUS ’15: 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services on 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, Coimbra, Portugal, 22–24 July 2015; pp. 11–20. [Google Scholar] [CrossRef]
Samyoun, S.; Shubha, S.S.; Mondol, M.A.S.; Stankovic, J.A. iWash: A smartwatch handwashing quality assessment and reminder system with real-time feedback in the context of infectious disease. Smart Health 2021, 19, 100171. [Google Scholar] [CrossRef]
Mondol, M.A.S.; Stankovic, J.A. HAWAD: Hand Washing Detection using Wrist Wearable Inertial Sensors. In Proceedings of the 2020 16th International Conference on Distributed Computing in Sensor Systems (DCOSS), Marina del Rey, CA, USA, 15–17 June 2020; pp. 11–18. [Google Scholar] [CrossRef]
Cao, Y.; Li, F.; Chen, H.; Liu, X.; Yang, S.; Wang, Y. Leveraging Wearables for Assisting the Elderly with Dementia in Handwashing. IEEE Trans. Mob. Comput. 2022, 22, 6554–6570. [Google Scholar] [CrossRef]
Kalantarian, H.; Alshurafa, N.; Sarrafzadeh, M. Detection of Gestures Associated with Medication Adherence Using Smartwatch-Based Inertial Sensors. IEEE Sens. J. 2016, 16, 1054–1061. [Google Scholar] [CrossRef]
Cherian, J.; Rajanna, V.; Goldberg, D.; Hammond, T. Did You Remember to Brush? A Noninvasive Wearable Approach to Recognizing Brushing Teeth for Elderly Care. In Proceedings of the PervasiveHealth ’17: 11th EAI International Conference on Pervasive Computing Technologies for Healthcare, Barcelona, Spain, 23–26 May 2017; pp. 48–57. [Google Scholar] [CrossRef]
Thomaz, E.; Essa, I.; Abowd, G.D. A practical approach for recognizing eating moments with wrist-mounted inertial sensing. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osaka, Japan, 7–11 September 2015; pp. 1029–1040. [Google Scholar] [CrossRef]
Chun, K.S.; Bhattacharya, S.; Thomaz, E. Detecting Eating Episodes by Tracking Jawbone Movements with a Non-Contact Wearable Sensor. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 2, 4. [Google Scholar] [CrossRef]
Amft, O.; Kusserow, M.; Tröster, G. Probabilistic parsing of dietary activity events. In Proceedings of the 4th International Workshop on Wearable and Implantable Body Sensor Networks (BSN 2007), Aachen, Germany, 26–28 March 2007; pp. 242–247. [Google Scholar] [CrossRef]
Dong, Y.; Hoover, A.; Muth, E. A device for detecting and counting bites of food taken by a person during eating. In Proceedings of the 2009 IEEE International Conference on Bioinformatics and Biomedicine, Washington, DC, USA, 1–4 November 2009; pp. 265–268. [Google Scholar] [CrossRef]
Dong, Y.; Hoover, A.; Scisco, J.; Muth, E. A new method for measuring meal intake in humans via automated wrist motion tracking. Appl. Psychophysiol. Biofeedback 2012, 37, 205–215. [Google Scholar] [CrossRef]
Chun, K.S.; Sanders, A.B.; Adaimi, R.; Streeper, N.; Conroy, D.E.; Thomaz, E. Towards a generalizable method for detecting fluid intake with wrist-mounted sensors and adaptive segmentation. In Proceedings of the 24th International Conference on Intelligent User Interfaces, Marina del Ray, CA, USA, 16–20 March 2019; pp. 80–85. [Google Scholar] [CrossRef]
Ishimaru, S.; Hoshika, K.; Kunze, K.; Kise, K.; Dengel, A. Towards reading trackers in the wild: Detecting reading activities by EOG glasses and deep neural networks. In Proceedings of the UbiComp ’17: 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers, New York, NY, USA, 11–15 September 2017; pp. 704–711. [CrossRef]
Foerster, F.; Smeja, M.; Fahrenberg, J. Detection of posture and motion by accelerometry: A validation study in ambulatory monitoring. Comput. Hum. Behav. 1999, 15, 571–583. [Google Scholar] [CrossRef]
Lorena, A.C.; Garcia, L.P.F.; Lehmann, J.; Souto, M.C.P.; Ho, T.K. How Complex Is Your Classification Problem? A Survey on Measuring Classification Complexity. ACM Comput. Surv. 2019, 52, 107. [Google Scholar] [CrossRef]
Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep Learning for Sensor-Based Human Activity Recognition: Overview, Challenges, and Opportunities. ACM Comput. Surv. 2021, 54, 77. [Google Scholar] [CrossRef]
Guo, Y.; Chu, Y.; Jiao, B.; Cheng, J.; Yu, Z.; Cui, N.; Ma, L. Evolutionary dual-ensemble class imbalance learning for human activity recognition. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 6, 728–739. [Google Scholar] [CrossRef]
Alharbi, F.; Ouarbya, L.; Ward, J.A. Comparing sampling strategies for tackling imbalanced data in human activity recognition. Sensors 2022, 22, 1373. [Google Scholar] [CrossRef]
Vaizman, Y.; Weibel, N.; Lanckriet, G. Context Recognition In-the-Wild: Unified Model for Multi-Modal Sensors and Multi-Label Classification. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 1, 168. [Google Scholar] [CrossRef]
Van Laerhoven, K.; Cakmakci, O. What shall we teach our pants? In Proceedings of the Digest of Papers. Fourth International Symposium on Wearable Computers, Atlanta, GA, USA, 18–21 October 2000; pp. 77–83. [Google Scholar]
Tryon, W.W. Activity Measurement in Psychology and Medicine; Plenum Press: New York, NY, USA, 1991. [Google Scholar]
Bulling, A.; Blanke, U.; Schiele, B. A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. 2014, 46, 1–33. [Google Scholar] [CrossRef]
Abdallah, Z.S.; Gaber, M.M.; Srinivasan, B.; Krishnaswamy, S. Activity recognition with evolving data streams: A review. ACM Comput. Surv. 2018, 51, 71. [Google Scholar] [CrossRef]
Sousa Lima, W.; Souto, E.; El-Khatib, K.; Jalali, R.; Gama, J. Human activity recognition using inertial sensors in a smartphone: An overview. Sensors 2019, 19, 3213. [Google Scholar] [CrossRef] [PubMed]
Branco, P.; Torgo, L.; Ribeiro, R.P. A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 2016, 49, 1–50. [Google Scholar] [CrossRef]
Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
Das, S.; Datta, S.; Chaudhuri, B.B. Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognit. 2018, 81, 674–693. [Google Scholar] [CrossRef]
Japkowicz, N. The class imbalance problem: Significance and strategies. In Proceedings of the Int ’l Conference on Artificial Intelligence, Vancouver, BC, Canada, 13–15 November 2000; Volume 56, pp. 111–117. [Google Scholar]
Hart, P. The condensed nearest neighbor rule (Corresp.). IEEE Trans. Inf. Theory 1968, 14, 515–516. [Google Scholar] [CrossRef]
Tomek, I. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, 6, 769–772. [Google Scholar] [CrossRef]
Devi, D.; kr. Biswas, S.; Purkayastha, B. Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognit. Lett. 2017, 93, 3–12. [Google Scholar] [CrossRef]
Lin, W.C.; Tsai, C.F.; Hu, Y.H.; Jhang, J.S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409–410, 17–26. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. DBSMOTE: Density-based synthetic minority over-sampling technique. Appl. Intell. 2012, 36, 664–684. [Google Scholar] [CrossRef]
Raghuwanshi, B.S.; Shukla, S. SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl. Based Syst. 2020, 187, 104814. [Google Scholar] [CrossRef]
Ramentol, E.; Caballero, Y.; Bello, R.; Herrera, F. SMOTE-RSB*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 2012, 33, 245–265. [Google Scholar] [CrossRef]
Sáez, J.A.; Luengo, J.; Stefanowski, J.; Herrera, F. SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 2015, 291, 184–203. [Google Scholar] [CrossRef]
Ling, C.X.; Sheng, V.S. Cost-sensitive learning and the class imbalance problem. Encycl. Mach. Learn. 2008, 2011, 231–235. [Google Scholar]
Cheng, F.; Zhang, J.; Wen, C.; Liu, Z.; Li, Z. Large cost-sensitive margin distribution machine for imbalanced data classification. Neurocomputing 2017, 224, 45–57. [Google Scholar] [CrossRef]
Krawczyk, B.; Woźniak, M.; Herrera, F. Weighted one-class classification for different types of minority class examples in imbalanced data. In Proceedings of the 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Orlando, FL, USA, 9–12 December 2014; pp. 337–344. [Google Scholar] [CrossRef]
Datta, S.; Das, S. Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Netw. 2015, 70, 39–52. [Google Scholar] [CrossRef]
Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 2008, 39, 539–550. [Google Scholar] [CrossRef]
Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part Syst. Huma. 2009, 40, 185–197. [Google Scholar] [CrossRef]
Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Cavtat-Dubrovnik, Croatia, 22–26 September 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 107–119. [Google Scholar] [CrossRef]
Wang, S.; Yao, X. Diversity analysis on imbalanced data sets by using ensemble models. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining, Nashville, TN, USA, 30 March–2 April 2009; pp. 324–331. [Google Scholar] [CrossRef]
Ni, Q.; Fan, Z.; Zhang, L.; Nugent, C.D.; Cleland, I.; Zhang, Y.; Zhou, N. Leveraging wearable sensors for human daily activity recognition with stacked denoising autoencoders. Sensors 2020, 20, 5114. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Shen, C. Performance analysis of smartphone-sensor behavior for human activity recognition. IEEE Access 2017, 5, 3095–3110. [Google Scholar] [CrossRef]
Nguyen, K.T.; Portet, F.; Garbay, C. Dealing with imbalanced data sets for human activity recognition using mobile phone sensors. In Proceedings of the 3rd International Workshop on Smart Sensing Systems, Rome, Italy, 25–28 June 2018. [Google Scholar]
Wu, D.; Wang, Z.; Chen, Y.; Zhao, H. Mixed-kernel based weighted extreme learning machine for inertial sensor based human activity recognition with imbalanced dataset. Neurocomputing 2016, 190, 35–49. [Google Scholar] [CrossRef]
Gao, X.; Chen, Z.; Tang, S.; Zhang, Y.; Li, J. Adaptive weighted imbalance learning with application to abnormal activity recognition. Neurocomputing 2016, 173, 1927–1935. [Google Scholar] [CrossRef]
Santos, M.S.; Abreu, P.H.; Japkowicz, N.; Fernández, A.; Soares, C.; Wilk, S.; Santos, J. On the joint-effect of class imbalance and overlap: A critical review. Artif. Intell. Rev. 2022, 55, 1–69. [Google Scholar] [CrossRef]
Polar Electro. Polar M600 User Manual. Available online: https://support.polar.com/e_manuals/M600/wear-os/polar-m600-user-manual-english/Content/introduction.htm (accessed on 12 June 2024).
Ward, J.A.; Lukowicz, P.; Gellersen, H.W. Performance Metrics for Activity Recognition. ACM Trans. Intell. Syst. Technol. 2011, 2, 6. [Google Scholar] [CrossRef]
Figo, D.; Diniz, P.C.; Ferreira, D.R.; Cardoso, J.M. Preprocessing techniques for context recognition from accelerometer data. Pers. Ubiquitous Comput. 2010, 14, 645–662. [Google Scholar] [CrossRef]
Lara, O.D.; Labrador, M.A. A survey on human activity recognition using wearable sensors. Commun. Surv. Tutor. IEEE 2013, 15, 1192–1209. [Google Scholar] [CrossRef]
Vavoulas, G.; Chatzaki, C.; Malliotakis, T.; Pediaditis, M.; Tsiknakis, M. The mobiact dataset: Recognition of activities of daily living using smartphones. In Proceedings of the International Conference on Information and Communication Technologies for Ageing Well and e-Health, Rome, Italy, 21–22 April 2016; Volume 2, pp. 143–151. [Google Scholar] [CrossRef]
Alamudun, F.; Yoon, H.J.; Hudson, K.B.; Morin-Ducote, G.; Hammond, T.; Tourassi, G.D. Fractal analysis of visual search activity for mass detection during mammographic screening. Med. Phys. 2017, 44, 832–846. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD ’16: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ordóñez, F.J.; Roggen, D. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef] [PubMed]
Abedin, A.; Ehsanpour, M.; Shi, Q.; Rezatofighi, H.; Ranasinghe, D.C. Attend and Discriminate: Beyond the State-of-the-Art for Human Activity Recognition Using Wearable Sensors. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2021, 5, 1. [Google Scholar] [CrossRef]
Guan, Y.; Plötz, T. Ensembles of Deep LSTM Learners for Activity Recognition Using Wearables. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2017, 1, 11. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res. 2017, 18, 1–5. [Google Scholar]
Xu, X.; Gong, J.; Brum, C.; Liang, L.; Suh, B.; Gupta, S.K.; Agarwal, Y.; Lindsey, L.; Kang, R.; Shahsavari, B.; et al. Enabling Hand Gesture Customization on Wrist-Worn Devices. In Proceedings of the CHI ’22: 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 30 April–5 May 2022. [Google Scholar] [CrossRef]
Riboni, D.; Murtas, M. Sensor-based activity recognition: One picture is worth a thousand words. Future Gener. Comput. Syst. 2019, 101, 709–722. [Google Scholar] [CrossRef]
Yang, J.B.; Nguyen, M.N.; San, P.P.; Li, X.L.; Krishnaswamy, S. Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition. In Proceedings of the IJCAI ’15: 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 3995–4001. [Google Scholar]
Cao, H.; Nguyen, M.N.; Phua, C.; Krishnaswamy, S.; Li, X.L. An integrated framework for human activity classification. In Proceedings of the UbiComp ’12: 2012 ACM Conference on Ubiquitous Computing, Pittsburgh, PA, USA, 5–8 September 2012; pp. 331–340. [Google Scholar] [CrossRef]

Figure 1. Data collection setup. Individuals were asked to wear watches on both wrists during data collection and use a custom-built data collection app on a provided smartphone to label activities of interest.

Figure 2. Visual definitions of the event categories that are used to calculate the event-level performance metrics. These were defined by Ward et al. [75].

Figure 3. Confusion matrices of XGBoost’s performance on the evaluation set. The colors of the cells correspond to the frequency of the prediction outcome.

Table 1. Data size and imbalance ratios by class within the complete dataset. The overall IR and

R_{a u g}

are the average values across classes.

Table 1. Data size and imbalance ratios by class within the complete dataset. The overall IR and

R_{a u g}

are the average values across classes.

Activity Label		Total Data Size (H)	IR
BT	Brushing Teeth	0.24	500.72
CH	Combing Hair	0.23	457.15
DR	Drinking	0.21	509.85
EA	Eating	2.13	48.83
NU	NULL	100.90	-
TM	Taking Medication	0.29	364.25
WA	Walking	2.32	44.78
WH	Washing Hands	0.44	239.08
-	Overall	106.74	309.23

Table 2. Classical Machine Learning Features.

#	Feature Name
(A)	Average Jerk	(K)	Standard Deviation of the Number of Peaks
(B)	Average Height	(L)	Number of Valleys
(C)	Standard Deviation Height	(M)	Average Number of Valleys
(D)	Energy	(N)	Standard Deviation of the Number of Valleys
(E)	Entropy	(O)	Spectral Centroid
(F)	Average	(P)	Spectral Spread
(G)	Standard Deviation	(Q)	Spectral Rolloff (<85%)
(H)	Root-Mean-Square	(R)	Axis Overlap
(I)	Number of Peaks	(S)	Fractal Dimension
(J)	Average Number of Peaks

Table 3. Examples of weighting functions, where d is the distance from the window center.

#	Type	Formula
W1	Normal	$⌈ \frac{w_{s}}{2} ⌉ - d$
W2	Normal Inverted	$\frac{1}{d + 2}$
W3	Squared	${(⌈ \frac{w_{s}}{2} ⌉ - d)}^{2}$
W4	Log	$log (⌈ \frac{w_{s}}{2} ⌉ - d + 1)$
W5	Log Inverted	$\frac{1}{log (d + 2)}$

Table 4. Performance on the validation set. The model with the highest event-level F1-score is highlighted in bold.

Trial	Algorithm	P_Macro	R_Macro	F1_Macro	P_event	R_event	F1_event
Baseline	Naive Bayes	0.22	0.54	0.15	0.11	0.99	0.16
	SVM	0.23	0.36	0.23	0.13	0.51	0.17
	Random Forest	0.61	0.25	0.30	0.56	0.57	0.51
	XGBoost	0.52	0.29	0.34	0.47	0.69	0.52
	DeepConvLSTM	0.36	0.43	0.36	0.28	0.89	0.36
	Attend&Discriminate	0.37	0.38	0.36	0.29	0.84	0.39
RUS	Naive Bayes	0.22	0.56	0.18	0.11	1.00	0.17
	SVM	0.25	0.47	0.29	0.15	0.65	0.22
	Random Forest	0.31	0.51	0.34	0.22	0.82	0.30
	XGBoost	0.30	0.52	0.34	0.20	0.90	0.29
	DeepConvLSTM	0.25	0.53	0.28	0.14	0.89	0.21
	Attend&Discriminate	0.25	0.48	0.29	0.15	0.86	0.23
ROS	Naive Bayes	0.22	0.53	0.14	0.11	0.99	0.16
	SVM	0.19	0.46	0.19	0.07	0.90	0.12
	Random Forest	0.51	0.30	0.35	0.45	0.68	0.49
	XGBoost	0.46	0.33	0.36	0.40	0.77	0.50
	DeepConvLSTM	0.31	0.42	0.33	0.22	0.93	0.31
	Attend&Discriminate	0.24	0.55	0.28	0.13	0.97	0.21
CSL	Naive Bayes	-	-	-	-	-	-
	SVM	0.06	0.60	0.10	0.07	0.99	0.13
	Random Forest	0.41	0.35	0.36	0.34	0.76	0.43
	XGBoost	0.38	0.37	0.36	0.30	0.82	0.40
	DeepConvLSTM	0.27	0.53	0.31	0.17	0.95	0.24
	Attend&Discriminate	0.24	0.63	0.26	0.13	0.98	0.20

Table 5. Performance of the XGBoost classifier on the validation set with postprocessing.

Trial	Postprocessing	P_Macro	R_Macro	F1_Macro	P_event	R_event	F1_event
Baseline	-	0.52	0.29	0.34	0.47	0.69	0.52
Baseline	W4, 160 s	0.72	0.29	0.35	0.69	0.65	0.60
ROS	-	0.46	0.33	0.36	0.40	0.77	0.50
ROS	W4, 170 s	0.71	0.33	0.39	0.68	0.74	0.64

Table 6. Performance on the evaluation set.

Trial	Postprocessing	P_Macro	R_Macro	F1_Macro	P_event	R_event	F1_event
Baseline	-	0.55	0.31	0.35	0.49	0.63	0.53
Baseline	W4, 160 s	0.60	0.30	0.33	0.55	0.61	0.56
ROS	-	0.51	0.33	0.37	0.45	0.63	0.51
ROS	W4, 170 s	0.61	0.33	0.37	0.56	0.63	0.58

Table 7. Effect of postprocessing on the evaluation set by class for XGBoost with ROS. The performance of drinking and taking medication is not shown here, as the event-level precision, event-level recall, and event-level F1-score were all 0. F1-scores above 0.9 are highlighted in bold.

Activity	P_event		R_event		F1_event
Activity	−	+Post	−	+Post	−	+Post
Brushing Teeth	0.94	0.98	1.00	1.00	0.97	0.99
Combing Hair	0.70	0.97	1.00	1.00	0.82	0.99
Drinking	0.00	0.00	0.00	0.00	0.00	0.00
Eating	0.06	0.14	0.43	0.43	0.10	0.21
Taking Medication	0.00	0.00	0.00	0.00	0.00	0.00
Walking	0.70	0.84	1.00	1.00	0.82	0.91
Washing Hands	0.77	0.94	1.00	1.00	0.87	0.97

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cherian, J.; Ray, S.; Taele, P.; Koh, J.I.; Hammond, T. Exploring the Impact of the NULL Class on In-the-Wild Human Activity Recognition. Sensors 2024, 24, 3898. https://doi.org/10.3390/s24123898

AMA Style

Cherian J, Ray S, Taele P, Koh JI, Hammond T. Exploring the Impact of the NULL Class on In-the-Wild Human Activity Recognition. Sensors. 2024; 24(12):3898. https://doi.org/10.3390/s24123898

Chicago/Turabian Style

Cherian, Josh, Samantha Ray, Paul Taele, Jung In Koh, and Tracy Hammond. 2024. "Exploring the Impact of the NULL Class on In-the-Wild Human Activity Recognition" Sensors 24, no. 12: 3898. https://doi.org/10.3390/s24123898

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring the Impact of the NULL Class on In-the-Wild Human Activity Recognition

Abstract

1. Introduction

2. Related Work

2.1. Human Activity Recognition

2.2. General Techniques for Dealing with Data Imbalances

2.3. Handling Data Imbalances in Human Activity Recognition

3. Materials and Methods

3.1. Data Collection

3.2. Activities

3.3. Dataset Metrics

3.4. Activity Recognition Performance Metrics

3.5. Algorithms

3.5.1. Classical Machine Learning Models

3.5.2. Deep Learning Models

3.6. Existing Methods for Addressing Imbalances

3.7. Classification Postprocessing

4. Results

4.1. Impact of Preprocessing Techniques

4.2. Impact of Postprocessing

4.3. Evaluation

5. Discussion

5.1. Challenge of Non-Distinct Minority Classes

5.2. Intraclass Variability in Certain ADLs

5.3. Towards Real-World Human Activity Recognition

5.4. Hardware Considerations

5.5. Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI