1. Introduction
Obesity has become a widespread health problem worldwide, which may be related to an increased risk of diseases such as cardiovascular disease, diabetes, and stroke. The body mass index (BMI) is a simple and reliable index based on weight and height that is commonly used to identify and classify adults as underweight, normal, overweight (pre-obesity), or obese and was developed in the 19th Century by the Belgian statistician and anthropometrist Adolphe Quetelet. It is computed by its conventional definition that utilizes body height (cm) and mass (kg),
. For adults, BMI falls into one of the following categories by the World Health Organization, presented in
Table 1.
The conventional method to obtain BMI is usually to measure both body weight and height carefully; however, some patients who have a pre-obesity or obesity status may delay medical care because of concerns about disparagement by physicians and health care staff or a fear of being weighed [
1]. Many users can use online applications on smart mobile terminals to obtain BMI. However, these applications require personal body data including height, weight, age, and sex, which are usually very sensitive topics. The measurement process requires the active participation of subjects to ensure the authenticity of the data, but improper usage of these personal body data easily leads to significant privacy and security threats.
Sensor-rich smartphones are popular worldwide. These built-in sensors, such as the camera, microphone, touch screen, and motion sensors, initially used for a phone’s enhancement, are now being used for a variety of sensing applications, and many aspects of human behavior and traits can be inferred, so some studies have begun to develop automatic measuring methods of BMI for long-term remote monitoring using smartphones on a large scale. Several existing studies have focused on learning based on human facial [
2,
3,
4,
5] and speech signals [
6,
7,
8,
9]. However, on the one hand, there are several potential risks to using these privacy-sensitive sensors such as cameras and microphones to collect users’ private data. The sensors mentioned above are also environmentally sensitive, so application scenarios are limited.
The human gait consists of the interaction between hundreds of muscles and joints in the body, and motion sensors can capture these and translate them into the characteristic patterns linked to their traits. In practice, it is more advantageous to obtain physical traits from motion sensors, due to the following reasons:
- (1)
first and foremost, motion sensors are generally considered to be not highly privacy sensitive and more acceptable for users;
- (2)
different from environmentally sensitive sensors, such as cameras and microphones, motion sensors always alleviate the limitations of the environment;
- (3)
motion sensors are more popular than other sensors, not only in smartphones, but also in other smart devices, such as smart bracelets.
Motion sensor-based BMI prediction may be a great challenge. Firstly, the sensor data of the human gait have a comparatively low signal-to-noise ratio, i.e., sources that have BMI-relevant information affect signals less than sources that do not. Data from motion sensors are also multi-dimensional with a special temporal-spatial structure for conventional feature-based approaches [
10,
11]. Furthermore, the design of a specific feature extractor that transforms raw data into feature vectors relies on heuristic hand-crafted feature engineering and considerable domain expertise.
In the past decade, deep learning (DL) has achieved great success in many areas. A deep architecture with multiple layers is built up for automating feature design. Specifically, each layer in a deep architecture performs a nonlinear transformation of the outputs of the previous layer [
12], so that DL can automatically make use of much more high-level and meaningful hidden features. The convolutional neural network (CNN) is a well-known DL model with the ability to learn complex, high-dimensional, nonlinear mappings from large collections of examples [
13], and this makes them obvious candidates for image classification [
14], speech recognition [
15], and other recognition tasks related to time series [
16,
17,
18]. Long short-term memory (LSTM) recurrent networks are also widely adopted for general-purpose sequence modeling, and they have proven stable and sturdy for long-range modeling dependencies in previous studies in such fields as audio analysis [
19,
20], video captioning [
21,
22], and sensor-based human activity recognition (HAR) [
23]. Recently, the combination of CNNs and LSTM in a unified stack framework has already offered state-of-the-art results in speech recognition [
24] and some motion sensor-based tasks [
25].
The main contributions of the paper can be summarized as follows:
- (1)
To the best of our knowledge, we are the first to design a hybrid deep neural network with a CNN-LSTM architecture to learn spatial features and temporal features from sensor data for identifying salient patterns related to the BMI.
- (2)
We define motion entropy (MEn), which is a measure of the regularity and complexity of the motion sensor, and propose a novel MEn-based filtering strategy to select parts of these sub-sequences for training the prediction model.
- (3)
We evaluate the hybrid deep neural network model with the MEn-based filtering strategy using two public datasets in comparison with baseline conventional feature-based approaches that have been applied to infer simple human traits from gaits [
10,
11]. Experimental results show that the proposed model significantly outperforms the baseline methods.
- (4)
We also investigate which type of activities of daily living (ADLs) is more suitable for online BMI prediction.
The remainder of this paper is structured as follows:
Section 2 reviews some related literature. We give a brief overview of the state of related works. In
Section 3, we present the proposed hybrid deep neural network model and the corresponding MEn-based filtering strategy. Experimental results from the evaluation of two public datasets are presented in
Section 4. Conclusions and a discussion are in
Section 5.
4. Experiments
4.1. Dataset Description
In the experiments, we considered two public datasets, MobiAct [
35] and Motion-Sense [
36]. We proposed a deep learning model with other traditional feature-based approaches to check them.
In the following, we provide an introduction of MobiAct and Motion-Sense. Unlike other public datasets that require the smartphone to be rigidly placed on the human body and with a specific orientation, the datasets we selected were from devices that were freely located in pants pockets in a random orientation. These data corresponded to daily life, which also distinguished them from other data that were acquired from one or more wearable sensors banded on certain sensitive parts of the human body.
4.1.1. MobiAct
The MobiAct dataset consisted of recordings from 67 subjects. The average age of subjects was 25.19, the average height 175.75 cm, and the average weight 76.80 kg. Motion sensor data were recorded by a Samsung Galaxy S3 with the LSM330DLC inertial module.
4.1.2. Motion-Sense
The Motion-Sense dataset consisted of recordings from 24 subjects (14 men and 10 women) with a balance of gender. The subjects’ age spanned between 18 and 46, the height from 161 to 190 cm, and the weight from 48 to 102 kg. For each participant, an iPhone 6s was located in the pants pocket freely and in a random orientation. The motion sensors were logged by SensingKit, which included accelerometer, gravity sensor, gyroscope, and attitude sensor data.
4.1.3. BMI values and the Nutritional Status of the Two Datasets
The BMI values of the two benchmark datasets are shown in
Figure 2. According to the conversion method of
Table 1, the nutritional status of the subject associated with BMI is shown in
Figure 3. We found that the normal weight and pre-obesity population accounted for a large proportion, which was mainly related to the data samples collected from the university campus, and the average age of volunteers was less than age 30.
4.2. Comparison with Existing Methods
We compared our proposed hybrid deep neural network model with some state-of-the-art feature-based methods using WEKA [
37], which included a k-nearest neighbor algorithm (kNN) that is called an instance-based learner (IBk) in WEKA [
10], a support vector machine (SVM), and a C4.5 decision tree that is called J48 in WEKA [
10]. Among them, the data contained in one example duration were converted into a single example, described by 43 features [
10], which were variations of six essential features, including the average acceleration value, the standard deviation, the average absolute difference, the average resultant acceleration, the time between peaks, and the binned distribution.
IBk uses a distance measure to locate k “close” instances in the training data for each test instance and uses those selected instances to make a prediction.
SVM with radial the basis function (RBF) kernel was used as the classifier.
J48 is an algorithm used to generate a decision tree that can be used for prediction, and for this reason, J48 (C4.5) is often referred to as a statistical classifier.
4.3. Leave-One-Subject-Out Cross-Validation
Suppose a dataset with N subjects. For each experiment, we used subjects’ sensor data for training and the remaining subject’s sensor data for testing. At first, in LOSO, the subjects were partitioned into N groups. The samples were then partitioned by the groups into N sub-samples of the N sub-samples. A single sub-sample was retained for testing the model, and the remaining sub-samples were used as the training data. Finally, the CV process was then repeated N times, with each of the N sub-samples used exactly once as the validation data.
4.4. Evaluation Measures
To have a fair comparison, we used two kinds of regression measures, namely, mean absolute error (MAE) and root mean squared error (RMSE), to evaluate the performance of different methods on the test data in the experiments.
- (1)
Mean absolute error (MAE):
The MAE is the absolute value of the difference between the predictions and the targets (L1 norm). It is a linear score, which means that all the individual differences are weighted equally in the average.
- (2)
Root mean squared error (RMSE):
The RMSE is a quadratic scoring rule that also measures the average magnitude of the error (L2 norm). It’s the square root of the average of the squared differences between the prediction and the actual observation. The RMSE is more sensitive to outliers, and MAE is more stable.
4.5. Results
In this section, we evaluate the performance of the proposed prediction model on two datasets and compare it with three other feature-based methods. Additionally, we verify the effectiveness of the MEn-based filtering strategy, and we discuss what types of ADLs might be suitable for the prediction of BMI.
4.5.1. Experiments without Data Filtering
To demonstrate the performance of our proposed hybrid deep neural network (CNN-LSTM), we used LOSO cross-validation on MobiAct and Motion-Sense.
Firstly, to give a reliable performance comparison, we trained the model for a relative optimal network structure and hyper-parameters. Since CNN-LSTM is deep in both the spatial domain and the time domain, there were different structures that may have an impact on the performance of the model. For example, considering only the number of network layers, the number of convolutional layers may affect the ability of the model to learn the spatial structures. The number of recursive hidden layers may affect the intensity of the model’s learning of temporal relationships. The number of fully connected hidden layers may change the learning feature transformation. The optimization of hyper-parameters also improved the overall performance and generalization capacity of the model. Tuning hyper-parameters for the DL model, however, was a time consuming and challenging task because there were numerous parameters to be configured. In this work, we used the grid search method to determine the optimal hyper-parameters of the proposed model. Six common hyper-parameters, namely the optimizer, the learning rate, the number of epochs, the batch size, the dropout rate, and the regularizer, were optimized. To implement a grid search more efficiently, the search spaces of hyper-parameters were manually selected in the initial stage after experiments on the datasets. The optimized parameters, their search spaces, and their determined optimal values are presented in
Table 2.
Secondly, the results presented in
Table 3 and
Table 4 show that our proposed hybrid deep neural network (CNN-LSTM) with an optimal structure led to significant performance improvements compared with conventional feature-based models in all cases.
Finally, both
Table 3 and
Table 4 show that jogging was the most suitable activity for BMI prediction, based on a comparison of four ADLs. In practice, an acceptable BMI prediction value could also be obtained while walking and walking upstairs.
4.5.2. Experiments with Data Filtering
To verify the effectiveness of our proposed MEn-based filtering strategy, we performed comparisons among three different data filtering settings by threshold parameter
and tolerance parameter
, as mentioned in
Section 3.1.4. The experimental results showed that the entropy filtering strategy improved the performance of the CNN-LSTM model by about
, if the parameters were set properly.
Table 5 shows that the MAEs of the BMI were 2.461 ± 1.000 in the jogging state in the MobiAct dataset and 3.137 ± 1.300 in the jogging state in the Motion-Sense dataset. This indicated that the error of our proposed model was low, considering the wide range of BMI from 17 to 35, visually shown in
Figure 2.
As shown in
Figure 3, both datasets were quite unbalanced, as subjects in the underweight and obesity categories are scarcity.
Table 6 shows the total number of sub-sequences (samples) in the jogging state selected by data filtering. Taking Motion-Sense as an example, the normal weight category with the threshold parameter
= 0.1 contained the highest number of sub-sequences (samples), which was about 57.6% of the whole set. The pre-obesity category was the second largest, which contained 431 sub-sequences (samples). These two categories possessed about 92.5% of the whole set.
The MAE was defined as the average of the absolute errors between the predicted BMIs and the ground-truth BMIs.
Figure 4 shows the MAEs of our proposed CNN-LSTM model in most BMI categories: underweight, normal weight, pre-obesity, and obesity, with threshold parameter
= 0.1 and tolerance parameter
= 0.05 in the jogging state. As a result, we obtained a higher accuracy, 94.8% ± 1.5%, in predicting BMI according to the recordings of motion sensors in the normal and pre-obesity BMI categories. The lower prediction accuracies in the underweight and obesity categories were probably because of the limited sub-sequences in both training and testing, as shown in
Table 6.
5. Conclusions and Discussion
In this paper, we proposed a hybrid deep neural network with a novel MEn-based filtering strategy for predicting the BMI of smartphone users using built-in motion sensors only. We also have shown that the accuracy of our proposed prediction model is the highest in a jogging state, and an acceptable BMI prediction value can also be obtained while walking and walking upstairs, by extensive experiments on two public datasets. We believe that the conclusions of this study will help to develop a long-term remote BMI monitoring system. The so-called long-term remote monitoring refers to the collection of a few motion sensor data every day for many years to trace a person’s physical conditions, which is different from the short-term real-time recognition of human activity.
Despite the progress made in this work, sensor-based BMI predictions remain challenging. The performance of deep learning models still heavily depends on labeled samples. Obtaining enough activity labels is expensive and time consuming. Therefore, in order to predict the BMI value corresponding to sensor data without activity labels, unsupervised HAR is urgent. We will try to use transfer learning in subsequent research to perform data annotation by leveraging labeled data from other auxiliary domains.