**3. Results**

The performance achieved using various machine learning classifiers for seven experiments (E1–E7) are presented in Figures 3–9, and the respective performance by classes are presented in Tables A1–A7 in Appendix A.

**Figure 3.** Performance analysis of classifiers using the train/test split in experiment 1 (E1).

**Figure 4.** Performance analysis of classifiers using the train/test split in experiment 2 (E2).

**Figure 5.** Performance analysis of classifiers using the train/test split in experiment 3 (E3).

**Figure 6.** Performance analysis of classifiers using the train/test split in experiment 4 (E4).

**Figure 7.** Performance analysis of classifiers using the train/test split in experiment 5 (E5).

**Figure 8.** Performance analysis of classifiers using the train/test split in experiment 6 (E6).

**Figure 9.** Performance analysis of classifiers using the train/test split in experiment 7 (E7).

It is fairly evident from Figure 3 that all the classifiers have achieved the performance above 85% in classifying the ADLs (sitting, standing, walking, lying, walking upstairs and walking downstairs). These findings show the strength of the proposed machinelearning-based activity classification methods to classify ADLs. The best performer among all classifiers appeared to be SVM which achieved a performance of 96.38% (please see Figure 3). The SVM also outperformed the activity classification method proposed by Anguita et al. [28], thus confirming performance improvement when compared to the existing works. The second-best performer is GB, with the performance of 93.82%. All other classifiers also performed considerably well except the CB, whose performance is worst among all (85.99%). The detailed performance by class is visualized in Table A1. It is evident from Table A1 that most low-performing classifiers largely struggled in distinguishing between sitting and standing activities and struggled to distinguish between upstairs and downstairs walking. This could be due to the fact the smartphone is waist mounted during the data collection and the standing and sitting postures with respect to the accelerometer and gyroscope signals are relatively similar considering the smartphone orientation. The same is the case during upstairs and downstairs activities, which could make it hard for the classifiers to distinguish between different postures and locomotive activities. However, the SVM performed well in this scenario, and this could be due to the fact that SVM use high margin and hyperplanes to distinguish between different classes during the training stage, which assisted in better distinguishing these ADLs (sit vs. stand, upstairs vs. downstairs).

In experiment 2 (E2), only the walking class is imbalanced during the training stage with a total of 100 samples, while number of samples of all other classes remained the same as per the original or balanced distribution (please see Table 2). As expected and evident from Figure 4 and Table A2, most classifiers struggled in classifying the walking activity due to its low representation in the training stage. This suggests that class imbalance has serious consequences on the overall performance of the activity classification system and on the performance of the underrepresented/imbalanced class (es). Our findings suggest that the best performance of 90.21% is obtained using the SVM classifier, while none of the other classifiers are able to achieve the performance above 80%. This is an interesting finding and suggests the strength of SVM in handling class imbalance. The SVM inherently possess the properties of adaptive weighting, which provides more weight to

the imbalanced classes and fewer weights to the over represented or balanced classes [10], thus improving classification performance.

In experiment 3 (E3), walking and walking upstairs are imbalanced classes at the training stage with a total of 100 samples each. The results in Figure 5 and Table A3 suggest that the best performer is still SVM in classifying the different ADLs with performance of around 80%, and the second best is ADA (RF) with performance of 75.69%.

In experiment 4 (E4), the underrepresented classes or imbalanced classes are walking, walking upstairs and walking downstairs. The best performance of 87.88% is achieved by the SVM classifier and the second best candidates are ADA (DT) and ADA (RF), with performance of above 83%, as shown in Figure 6 and Table A4. The worst performer is GB, with an F-score of 62.96%. It is important to note that the performance of all the classifiers is generally improved in E4 as compared to in E3. This could be due to the fact that only walking and walking upstairs are imbalanced in E3, while in E4, walking, walking upstairs and walking downstairs are imbalanced. This suggests that in E3, class sample mismatch between walking upstairs and walking downstairs could have more biased induced in training classifiers due to class imbalance in only walking upstairs class and not in the walking downstairs class. However, this has been reduced in E4 since both walking upstairs and walking downstairs are imbalanced with equal proportion when compared to other classes. Thus, giving equal opportunities to most of the classifiers to train properly.

In experiment 4 (E4), walking, walking upstairs, walking downstairs and sitting are underrepresented and imbalanced as compared to other majority classes. The results shown in Figure 7 and Table A5 suggest that the SVM again outperformed all the other classifiers with an F-score of 81.7%, and ADA (RF) is the second-best classifier with an F-score of 78.02%.

During experiment 6 (E6), all classes are underrepresented except the majority class represented class, which is lying. The performance analysis of the classifiers using the E6 train/test split is shown in Figure 8 and Table A6. All the classifiers are able to achieve the performance of above 70%. Similar to previous experiments' results, SVM outperformed all the classifiers with performance of 85.03%, and the second-best performance was obtained by the ADA (RF) classifier.

In experiment 7 (E7), all the classes are balanced with equal samples; however, the samples are very low (100, please see Table 2) when compared to the original samples (around 500 for each class, please see Table 2) in E1. Lower number of training samples can influence the performance of the machine learning classifiers since supervised machine learning is all about feeding sufficient data to the classifiers. Therefore, fewer samples mean fewer training opportunities for the classifier to estimate and quantify the underlying trends from the data. The performances of the different classifiers using the E7 dataset are depicted in Figure 9 and Table A7. The SVM and ADA (RF) classifiers performed well with the performance of 84.97% and 83.15%, respectively, while the lowest performance of 62.4% was achieved by the GB classifier.
