We investigated the main factors for the low accuracy of BMI classification reported in [
28]. The main reason for this low accuracy was that Andersson et al. in [
28] used an undersampling technique to overcome the class imbalance problem in their dataset. For benchmarking, we tested five machine learning algorithms, namely C-SVM, nu-SVM,
k-NN, Naive Bayes (NB) and decision tree (DT), in conjunction with the undersampling technique. C-SVM and nu-SVM were implemented using LIBSVM (version 3.21), which is an open-source library for SVMs [
51]. Additionally, according to the recommendations in [
52], the radial basis function (RBF) kernel was used for the C-SVM and nu-SVM models. C-SVM has a cost parameter denoted as
c whose value ranges from zero to infinity. nu-SVM has a regularization parameter denoted as
g whose value lies within
. The RBF kernel has a gamma parameter denoted as
. The grid search method was used to find the best parameter combination for the C-SVM and nu-SVM models. The remaining
k-NN, NB and DT models were implemented using the MATLAB functions “fitcknn”, “fitcnb” and “fitctree”, respectively. Additionally, to determine the best hyperparameter configuration for each model, we executed the hyperparameter optimization processes supported by the aforementioned functions. To this end, we used “Statistics and Machine Learning Toolbox”. In addition, the models were implemented and tested in MATLAB R2018a (9.4.0.813654).
6.1. Results of Five-Fold Cross-Validation
During the development of the proposed ensemble model, determining the optimal number of
k-NN models to be used was a significant challenge. To this end, we constructed several ensemble models with different numbers of
k-NN models and compared the resulting performance metrics. To implement each ensemble model, we used the MATLAB function “fitcknn”. To this end, we used “Statistics and Machine Learning Toolbox”. To determine the optimal hyperparameter configuration for each
k-NN model, we performed hyperparameter optimization using the aforementioned function. For the hyperparameter optimization process, there were five optimizable parameters: the value of
k, the distance metric, the distance weighting function, the Minkowski distance exponent and a flag to standardize predictors. During the training phase for each model, these five parameters were optimized using the hyperparameter optimization process. The proposed ensemble model was implemented and tested in MATLAB R2018a (9.4.0.813654). The MATLAB code is available at
https://sites.google.com/view/beomkwon/bmi-classification.
Table 2 lists the performance metrics for the ensemble models with different numbers of
k-NN models. A detailed description of the experimental setup is listed in
Table 3. In
Table 2, the proposed ensemble model consists of 15
models (
). “Comparison model #1” consists of three
models (
). “Comparison model #2” consists of six
models (
). “Comparison model #3” consists of 10
models (
). “Comparison model #4” consists of 16
models (
). “Comparison model #5” consists of 17
models (
). “Comparison model #6” consists of 18
models (
). “Comparison model #7” consists of 19
models (
). “Comparison model #8” consists of 20
models (
). “Comparison model #9” consists of 21
models (
). One can see that the performance of the ensemble model increased as the number of
k-NN models increased. However, when the number of
k-NN models was greater than 15, the performance of the ensemble model was not improved. Based on these results, we selected an ensemble model consisting of 15
k-NN models for additional testing. However, as shown in
Table 2, since the average running time of the ensemble model increased as the number of
k-NN models increased, the trade-off between classification accuracy and running time needs to be considered in real-world applications.
Table 4 lists the optimized parameter settings for the ensemble model consisting of 15
k-NN models in each cross-validation fold.
Table 5 lists the five-fold cross-validation results of the five benchmark methods discussed above. Here, instead of SMOTE, the undersampling technique was used for performance evaluations. Therefore, during the training phase of each cross-validation fold, for class 2, 268 (i.e,
) anthropometric feature vectors were randomly selected among the 292 total vectors and removed from the training dataset. For class 3, 104 (i.e.,
vectors were randomly selected among the 128 total vectors and removed from the training dataset. The remaining 72 (i.e.,
vectors were used to train the algorithm for each method. As shown in the table, for each method, there were significant variations in the TPR, PPV and
values over the three classes. These results demonstrate that the undersampling technique used in [
28] is not effective in overcoming the class imbalance problem.
Table 6 lists the five-fold cross-validation results of the five benchmark methods when SMOTE is used to alleviate the class imbalance problem. The results in this table demonstrate that SMOTE improves the TPR, PPV and
score values of the five methods compared to the results in
Table 5. By using (
12) to (
14), we computed the average metric values over five-fold cross-validation, as shown in
Table 7. The results in this table demonstrate that it is better to use SMOTE instead of the undersampling technique to alleviate the class imbalance problem. For all benchmark methods, average performance improvements are achieved when SMOTE is applied. In particular, the
k-NN model outperforms the other methods, achieving results of TPR = 0.9276, PPV = 0.8512,
= 0.8798 and accuracy = 0.9279. Based on these results, we decided to use
k-NN models to construct the proposed ensemble model.
Figure 6 presents the confusion matrices for the proposed ensemble model for each cross-validation fold. The diagonal elements in each confusion matrix indicate the numbers of correctly classified samples. The other elements indicate the numbers of incorrectly classified samples. One can see that the proposed ensemble model classifies class 1 as the minority class among the three classes without misclassification. Additionally, for class 3 (the second minority class), the ensemble model also exhibits a low misclassification rate.
Based on the results in the confusion matrices in
Figure 6, we computed the TPR, PPV and
score values of each class for five-fold cross-validation. Additionally, we computed the macro-average values of these metrics and classification accuracy. The performance evaluation results for the proposed ensemble model are summarized in
Table 8. As shown in this table, the proposed model exhibits a robust and accurate classification performance for the minority class and majority class, achieving approximately 98.2% average classification accuracy.
Table 9 lists the macro-average values of TPR, PPV and
scores, as well as the accuracy of each method over five-fold cross-validation. One can see that the proposed ensemble model performs best in terms of all evaluation metrics. This is because the ensemble model is trained using anthropometric features calculated over long/mid/short-term periods. In other words, the use of such features enables the ensemble model to be trained effectively by minimizing the adverse effects of variance in extracted features. As a result, the proposed model can achieve robust and accurate BMI classification performance. Among the considered benchmark methods, the
k-NN model achieves the best performance, whereas DT achieves the worst performance. The classification accuracy of the proposed model is approximately 5.23% greater than that of a single
k-NN model.
To verify the benefits of using the anthropometric features calculated for various different periods in the BMI classification task, we analyzed the standard deviations of the anthropometric features. For explanation, let
be
for the
rth skeleton sequence in the dataset. Here, according to the definition of
in (
6), the dimension of
is 20. In addition, let
,
, be the
uth element of
. Then, the average value of
over the whole sequences can be obtained as
where
R is the total number of skeleton sequences.
Based on (
15), the standard deviation of each of the 20 anthropometric features is calculated as
Figure 7 presents the standard deviations of the 20 anthropometric features for five different periods (i.e.,
W,
,
,
and
). In this figure, for the cases of
,
,
and
, we calculated the average of the standard deviations. For example, for
, there are five equal-length segments according to the frame division process of the proposed method. We calculated the standard deviations in (
16) for each of the segements, and then calculated the average of them over the five segments. As shown in
Figure 7a, the features have high standard deviations when they are calculated for all frames (i.e.,
W). In contrast, the features calculated for
,
,
and
have relatively low standard deviations. In particular, the features calculated for
have the lowest standard deviation values. A low standard deviation indicates that the features are clustered around the average. A high standard deviation means that the features are spread out over a wide range. Because the anthropometric features are calculated as average lengths over a given segment sequence, the high standard deviations shown in
Figure 7a may adversely affect the performance of machine learning algorithms. In contrast, in our ensemble learning method, anthropometric features with low standard deviations are used to train/test multiple
k-NN models. Based on the use of these features, the proposed ensemble model can be trained effectively by minimizing the adverse effects of variance in extracted features, resulting in state-of-the-art performance.
6.2. Results of Leave-One-Person-Out Cross-Validation
Figure 8 shows the leave-one-person-out cross-validation process used in this work. As shown in the figure, in each validation round, the skeleton sequences for one person were used to test the classifier, and the sequences for the remaining 111 people were used for training. In each validation round, the predicted results (i.e., predicted BMI classes) for the testing skeleton sequences were obtained. After all 112 validations rounds were completed, we calculated a confusion matrix using all predicted results. Based on the confusion matrix, we calculated the TPR, PPV and
values over the three classes in order to evaluate the performance of the classifier.
Table 10 shows the leave-one-person-out cross-validation results of the five benchmark methods when the undersampling technique was used. On the other hand,
Table 11 shows the leave-one-person-out cross-validation results when SMOTE was used. By using (
12) to (14), we computed the macro-average values of these metrics over the three classes. In addition, we calculated the classification accuracy of each benchmark method. The performance evaluation results for the five benchmark methods are summarized in
Table 12. From the table, it is seen that SMOTE improves the macro-average of the TPR, PPV and
of each benchmark method, compared with the results where the undersampling technique was used. In addition, the BMI classification accuracy of each method also improved when SMOTE was used. These results demonstrate that SMOTE is more effective at overcoming the class imbalance problem in the dataset used in [
28] than the undersampling technique.
To find the optimal number of
k-NN classifiers in the proposed ensemble model, we evaluated the performance of the ensemble models with different numbers of
k-NN classifiers.
Table 13 lists the performance metrics for each model. From this table, it can be seen that the best performance was achieved when the ensemble model consisted of 15
models (
). Based on these results, we selected an ensemble model consisting of 15
k-NN models for additional testing.
Table 14 lists the macro-average values of TPR, PPV and
scores, as well as the accuracy of each method over leave-one-person-out cross-validation. As shown in the table, for all evaluation metrics, the proposed method outperforms the other methods, achieving a BMI classification accuracy of 73%. In the proposed method, anthropometric features calculated over long/mid/short-term periods were used for the training of the ensemble model. In addition, the use of such features in the training phase could reduce the adverse effects of a variance in extracted features. As a result, the model could be trained effectively and achieved the best performance among the methods.