6.1. Experimental Setup
The computational environment consisted of a Single Nvidia Tesla K80 GPU, 12 GB of RAM, and a 1 TB NVMe SSD. The software environment included Python 3.13.0 as the programming language, and TensorFlow 2.6.0 with the Keras API as the DL framework [
69].
The batch size was set to 128, the learning rate was initialized to 0.001 using the Adam optimizer, with the default values for the other parameters: , , and . The loss function employed was categorical cross-entropy, and the models were trained for 100 epochs. To ensure optimal performance and reduce the risk of overfitting, we applied several techniques: early stopping was implemented to monitor the validation loss with a patience of 10 epochs, dropout regularization with rates ranging from 0.2 to 0.5 was employed across different layers, and Xavier initialization was used to set the initial weights
For CNN, a sequential model was used for the implementation because it allows the creation of a model and then adding layers to it later on. That greatly simplifies the creation of DL models. 1D convolution layers, 1D MaxPooling layer, flattening layer, dropout, and dense layers were added to the sequential model. The parameters of each of these layers have been explained previously.
The LSTM RNN was a sequential model comprising several layers of LSTM, dense, dropout, and batch normalization. The input layer consisted of 100 neurons. Then, 3 more LSTM layers were added on top of each other, allowing the pattern and dependencies between the input data and their corresponding activities to be built up. Lastly, there was an output layer at the end, consisting of 6 neurons for the 6 different classifications. The final model was trained with a mean squared error loss, with the default learning rate of 0.002, using the Adam optimizer for 100 epochs.
The network input, such as temporal dependencies, has been fed using a sliding time window that extracts separate data segments. To gain better accuracy, the window width and step size have been modified and optimized. Because an activity label is assigned to every time step, for each segment, the label selected would be the one with higher frequency. As a rule, it’s window width or time segment, which here is 200 while the time step is 100.
The data is split into two subgroups: rain data and test data, in the ratio of 80:20. Similarly, the training data is again divided into training and validation data. Then, the generated HAR model is trained using the training data and validated on the validation data. We tested the pretrained model using 20% random samples of the dataset in order to assess the performance of the network.
6.2. Results
Figure 6 shows the confusion matrix for a CNN model’s performance on the validation set for classifying human activities. The matrix visualizes the model’s accuracy in predicting different the different activities (
,
,
,
,
,
). On validation set, the CNN model achieves an accuracy of up to 92.1%.
As shown in
Figure 6, the model shows high accuracy in identifying
(0.99),
(0.94), and
(0.96). The model struggles more with
with 0.83 and
with 0.77. Even if the score is relatively good in
, at 0.88, it misclassifies a notable number of instances with other activities-mostly with
. Indeed, though the model is able to find a quite good part of the
examples, with a remarkable portion misclassifying into walking and then, at a much lesser rate, to other activities.
Interesting to notice in the results is that stationary activities such as have much higher accuracy compared to mobile activities like etc. This is possibly because a person, while , does less movement of the wrists-where the Accelerometer was placed, whereas when said person would be moving, his wrist position can change quite a lot, which can make the accelerometer values to change drastically thus not allowing to recognize a pattern.
Either or are predicted as or -both of which are positions where the wrist positions of people might be similar to not only each other but to or as well. The positive result over here is that it doesn’t classify either of or as —a position where the posture and wrist position must be very different compared to the rest. Therefore, the classifier can find a pattern in activities which involve stationariness but fail to do the same for mobile activities.
Figure 7a shows how well the model learned on the test data as the number of epochs were increased. The model shows that it learns well over time as the accuracy on the training set increases from 80% to 96% over 90 epochs. Moreover, the accuracy on the testing set has been improved in this range of epochs. The model seems to be consistent in its accuracy on the testing set as it remains approximately at 91%.
The training and testing sets’ losses decrease with time, as seen in
Figure 7b. A model is better the smaller its loss (unless it has overfitted to the training set). The model’s performance for these two sets is shown by the loss, which is computed based on training and validation. Loss is not a percentage, in contrast to accuracy. It is an accumulation of all the mistakes produced in training or validation sets for every example.
Our primary goal in the learning model is to change the weight vector values using various optimization techniques in order to lower (minimize) the loss function’s value with regard to the model’s parameters. The derived Loss values indicate how well or poorly a certain model performs following each optimization cycle. The decrease of loss should ideally occur after one or more iterations.
The CNN-based autoencoder model achieves an accuracy of up to 90.5% on the testing dataset. The confusion matrix of CNN-based autoencoder classifier is given in
Figure 8.
Similar to the standard CNN, results shown in
Figure 8 are also really strong at
(1.0),
(0.96), and
(0.95). For this model, compared to
Figure 6, there is slightly lower accuracy for
(0.71) and
(0.79). The confusion on
is also higher in this model than in the standard CNN. A greater amount of
is classified incorrectly, particularly as
.
Both models show generally good performance and are very good in classifying , , and . Both struggle with and . Though for the autoencoder-based model, similar strong performance for a number of classes is present, there is slightly decreased performance, as indicated by a larger amount of misclassifications for and .
Both models classify the data from this HAR task really well, while neither provides superior classification with regards to classes and . This type of matrix makes more visible kinds of mistakes that each model performs; these further give reason to model modification with feature engineering or architecture modification.
Figure 9a shows how well the model learned on the test data as the number of epochs were increased. The model shows that it learns well over time as the accuracy on the training set increases from 92.4% to 97% over 100 epochs. Moreover, the accuracy on the testing set has been improved in this range of epochs (from 86.5% to 91%). The model seems to be consistent in its accuracy on the testing set as it remains approximately at 90.5%.
The accuracy of the LSTM RNN classifier is 96.1%; however, it might potentially be somewhat enhanced by reducing the sliding window step size. The confusion matrix from
Figure 10 is used to analyze the results for each particular activity.
Similarly, the same trend is continued in the case of the LSTM RNN model. The various activities such as , , , and can be classified quite well, whereas and activities are not well classified.
Contrary to the performance of the CNN on the validation dataset, the model did not succeed in identifying upstairs and downstairs activities more correctly. The LSTM RNN model performs better on the validation dataset. This verifies the theoretical assumption made before these models were created that LSTM RNNs, conceptually, do a better job in classifying time series data compared to CNNs. For , LSTM and CNN models perform at 99% and 96% respectively. But for and , the CNN model performs much better compared to the LSTM RNN.
Figure 11 shows the changes in model accuracy on both training and test sets over a running range of epochs. The highest accuracy for the model in the training set comes at 40–50 epochs where there is less disparity between the testing set accuracy and training set accuracy. Training and validation accuracy increased, with the training accuracy a bit higher than that of the validation, but not to an extreme limit, meaning fairly reasonable overfitting. It means that whereas the model is learning well in training, it does not fully generalize to the data that it has not seen. The accuracy of training and validation reaches a plateau at approximately 0.95 and 0.92, respectively.
The loss plot shows a decrease in both training and validation loss during training. Again, the training loss is lower than the validation loss, which is typical, but the difference is reasonably small, further supporting the conclusion of moderate overfitting rather than significant overfitting. The preprocessing steps described previously contributed to improved model performance by ensuring that only high-quality, labeled samples were used.
While the primary focus of this study was to optimize DL models specifically for the WISDM dataset, we acknowledge the importance of evaluating generalizability across diverse datasets. The WISDM dataset offers a comprehensive set of accelerometer-based readings from 36 users performing six distinct activities, making it a benchmark dataset widely used in HAR research. This allows for direct comparison with prior studies and provides a robust baseline for model evaluation. The data preprocessing steps, as well as model optimizations (CNN, CNN-based autoencoder, and LSTM RNN), were tailored to maximize performance on this dataset while ensuring consistency and reproducibility.
In order to understand better where the most disparate models are, we calculated in
Table 4 standardized residuals to highlight where the greatest deviations are. Standardized residuals measure how far each observation is from the expected count in the Chi-square test.
According to
Table 4, LSTM RNN best correct classification (better than predicted) and minimum misclassification rate (much improved over predicted) have the best values. It confirms it’s better than others. CNN-based autoencoder worse than predicted with fewer correct classification and significantly more errors than predicted. CNN fairly neutral; never deviates very far from prediction.
In summary, LSTM RNN performs much better than CNN and CNN-based autoencoder, confirming it is the top model. CNN-based autoencoder is the worst, with a much worse-than-expected error rate. CNN is in between, performing as expected.
We present in
Table 5 the performance comparison results with WISDM dataset. The obtained accuracy is higher than that of previous research using the suggested optimizations offered in the baseline architectures (LSTM RNN).
The types of sensors used in the two datasets significantly affect the nature of the data and further model performance. In UCI-HAR, the embedded inertial sensors (accelerometer and gyroscope) from the smartphone themselves are mainly tri-axial for activity recognition. WSDN considers wireless sensing technologies like WiFi and ultra-wideband signals. This represents a wide range of sensors, from the very precise tracking of motion provided by inertial sensors to the wider environmental contexts possible with wireless sensing. There is also considerable variation in the diverse environments and activities that these datasets capture.
Although it may include some unpredictability and noise, WSDN considers less controlled surroundings and hence captures intricate and varied activities that could better mimic real-world circumstances. Despite having the same HAR goal, the difference in approach and sensors between the two datasets leads to different classification accuracy and practical usefulness. UCI-HAR represents data from structured environments with well-defined activities such as , , , , , and . The dataset is collected from 30 people, performing different activities with a smartphone to their waists. The data is recorded with the help of sensors (accelerometer and gyroscope) in that smartphone.
Table 6 presents a performance comparison between the 4-layer CNN-LSTM model and the optimal LSTM-RNN model for human activity recognition (HAR) using the UCI-HAR dataset. The overall accuracy reached is 99.6%.
Both models achieve perfect classification (accuracy = 1) for , , , , and activities. For the activity, the 4-layer CNN-LSTM achieves an accuracy of 0.983, slightly lower than the 0.987 achieved by the optimal LSTM-RNN. This indicates that the optimal LSTM-RNN has a marginal advantage in distinguishing the activity.
The 4-layer CNN-LSTM model achieves an overall accuracy of 99.71%, while the optimal LSTM-RNN model achieves a slightly lower accuracy of 99.6%. This reflects that both models are highly effective for human activity detection using the UCI-HAR dataset, with negligible differences in their overall performance.
While the primary focus of this article was to optimize DL models specifically for the WISDM dataset, we also demonstrate the efficiency of the proposed architecture in detecting activities using UCI-HAR dataset.
6.3. Discussion
Activities such as , , and have a very high accuracy, whereas the significant number of misclassifications among the three classes , , and . This is not very surprising, because there is an inherent similarity in the sensor information collected for these three activities.
In all models, the confusion matrices show a considerable number of instances where is misclassified as and vice versa. This is probably due to similar postural characteristics. Both activities involve relatively static body positions, hence changing the acceleration minimally, which the model fails to highlight. Sensor limitations with respect to subtle movements, besides the difficulty in distinguishing between minimal changes in acceleration, may also present some challenges.
The errors involving the and classifications arise from the similarities in the rhythmic nature of the movement and their similar frequency spectrum. Both involve changes in posture and motion but may have overlapping frequency ranges that the model has trouble separating without additional, perhaps more subtle, features.
While our models are good at recognizing structured human activities such as , , and , they may require further adaptation or supplementation to accurately classify short-term gestures. We also suggested some potential future research directions, such as multi-modal sensor fusion or finer time segmentation, to improve classification for transient activities. This limitation is particularly relevant for gestures, which will often require higher sampling rates, additional sensor modalities (e.g., rotation using a gyroscope), or domain-specific models such as attention-based models in an effort to improve recognition performance.
To address this, we revised the manuscript by clarifying that while our models excel in recognizing structured human activities such as walking, jogging, and sitting, they may require further refinement or complementary methods for accurately classifying short-term gestures. Additionally, we suggested future research directions, such as multi-modal sensor fusion or finer-grained time segmentation, to improve classification for transient activities.
Although our suggested DL architectures—CNN, CNN-based Autoencoder, and LSTM RNN—performed best in HAR using the WISDM dataset, the following limitations should be taken into account:
Dataset-specific performance: Models were trained and tested using the WISDM dataset, which can be devoid of variability present in actual HAR situations. Sensor placement variability, user group variability, and environment variability between datasets can affect the model’s generalizability.
Limited sensor modalities: Accelerometer data only is utilized in the paper. Though adequate for HAR, incorporation of other sensors such as gyroscopes and magnetometers would provide additional feature representations, hence improved classification performance for movements with fewer differences in motions.
Sensor placement variability: The data is assumed to have a consistent smartphone placement (often in the pocket). In reality, use is less consistent, phones are held in different positions (e.g., hand, wrist, backpack), affecting data patterns. Domain adaptation methods for model robustness for placements need to be researched in the future.
Class imbalance: WISDM dataset is class imbalanced with standing and sitting under-represented. Preprocessing is already performed, but there are other methods such as synthetic data augmentation or weighted loss that may be employed in order to enhance performance further.
Computational complexity: Deep models, particularly LSTM RNNs, are extremely computationally intensive. Optimization algorithms such as model pruning, quantization, or knowledge distillation can possibly be utilized while deploying such models on edge devices or low-processing-capacity phones.
Real-time constraints: Although accuracy was our priority, real-time inference speed and latency were not investigated to a large degree. Investigating the trade-offs between model complexity and deployment efficiency is an important area of future work.
Future work will be engaged in overcoming these limitations by performing cross-dataset validation, multimodal sensor fusion, adaptive learning methods, and computationally efficient DL models for real-time deployment.