Multi-Sensor Fusion Approach to Drinking Activity Identification for Improving Fluid Intake Monitoring

Li, Ju-Hsuan; Yu, Pei-Wei; Wang, Hsuan-Chih; Lin, Che-Yu; Lin, Yen-Chen; Liu, Chien-Pin; Hsieh, Chia-Yeh; Chan, Chia-Tai

doi:10.3390/app14114480

Open AccessArticle

Multi-Sensor Fusion Approach to Drinking Activity Identification for Improving Fluid Intake Monitoring

by

Ju-Hsuan Li

^1,†,

Pei-Wei Yu

^1,†,

Hsuan-Chih Wang

¹,

Che-Yu Lin

¹,

Yen-Chen Lin

¹,

Chien-Pin Liu

¹,

Chia-Yeh Hsieh

^2,*

and

Chia-Tai Chan

^1,*

¹

Department of Biomedical Engineering, National Yang Ming Chiao Tung University, Taipei City 112, Taiwan

²

Bachelor’s Program in Medical Informatics and Innovative Applications, Fu Jen Catholic University, New Taipei City 242, Taiwan

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(11), 4480; https://doi.org/10.3390/app14114480

Submission received: 20 March 2024 / Revised: 18 May 2024 / Accepted: 21 May 2024 / Published: 24 May 2024

(This article belongs to the Special Issue Intelligent Electronic Monitoring Systems and Their Application)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

People nowadays often ignore the importance of proper hydration. Water is indispensable to the human body’s function, including maintaining normal temperature, getting rid of wastes and preventing kidney damage. Once the fluid intake is lower than the consumption, it is difficult to metabolize waste. Furthermore, insufficient fluid intake can also cause headaches, dizziness and fatigue. Fluid intake monitoring plays an important role in preventing dehydration. In this study, we propose a multimodal approach to drinking activity identification to improve fluid intake monitoring. The movement signals of the wrist and container, as well as acoustic signals of swallowing, are acquired. After pre-processing and feature extraction, typical machine learning algorithms are used to determine whether each sliding window is a drinking activity. Next, the recognition performance of the single-modal and multimodal methods is compared through the event-based and sample-based evaluation. In sample-based evaluation, the proposed multi-sensor fusion approach performs better on support vector machine and extreme gradient boosting and achieves 83.7% and 83.9% F1-score, respectively. Similarly, the proposed method in the event-based evaluation achieves the best F1-score of 96.5% on the support vector machine. The results demonstrate that the multimodal approach performs better than the single-modal in drinking activity identification.

Keywords:

multimodal signal; drinking activity identification; fluid intake monitoring; wearable inertial sensor; machine learning

1. Introduction

Water plays a key role in many of the human body’s functions, including maintaining normal temperature, getting rid of wastes and preventing kidney damage. Typically, an adult male body contains approximately 59% water, while a female body contains roughly 50% water. Even when sedentary, a healthy adult loses 1.05 to 3.1 L of water every day through breathing, urine, feces, or skin diffusion [1]. Therefore, people need to stay hydrated by eating food and drinking fluids. According to the National Academy of Medicine, sufficient water intake is at least 3.7 L/day for men and 2.7 L/day for women [1]. However, the U.S.A. National Health and Nutrition Examination Survey shows that from 2011 to 2016, around 60% of people did not consume the recommended amount of water [2]. In addition, the Youth Risk Behavior Survey found that 48.7% of senior high school students drank water less than three times a day, and 24.6% of them drank water less than once a day in 2017 [3]. Once the fluid intake is lower than the consumption, it is difficult to metabolize waste. The accumulation of waste in the body increases the burden on the kidneys and even leads to chronic kidney disease [4]. Furthermore, insufficient fluid intake can also cause headaches, dizziness and fatigue [5]. Hence, fluid intake monitoring can help people understand their fluid consumption and remind individuals to hydrate appropriately [6].

The traditional methods of monitoring and assessing water intake are manually filling out questionnaires or using mobile applications to record the number of times and amount of water consumed each day [7]. Nevertheless, the manual methods described above still have disadvantages, such as being time-consuming, making personal errors, and being subjective [8,9]. With the rapid development of technology, more and more professionals have begun to dive into research on intelligent activity identification of fluid intake. Scholars use different approaches to automatically identify complete episodes of drinking activities. Common sensing devices can be divided into two categories according to their characteristics: ambient-based and wearable-based systems. Ambient-based systems monitor personal drinking behavior through environmental sensing devices, such as cameras [10,11], smart containers with built-in sensors or wireless communication modules [12,13,14,15,16]. Wearable-based systems extract characteristics of drinking information by placing sensors on the body. Typical wearable devices include wrist-worn inertial measurement units [17] and microphones placed near the neck [18,19].

The ambient-based systems mentioned above are easy to use and convenient. Nonetheless, the user must perform the activity within the cramped space when using the camera to capture drinking behavior. This method not only restricts the scope of the monitoring but may also interfere with the privacy of users or other people. Although wireless communication modules can solve privacy issues, they still have problems, such as monitoring space limitations and radiofrequency shielding. The environment does not restrict the smart containers with built-in sensors, but people may forget to use them because they are not used to carrying these devices. Compared with ambient-based devices, wearable-based devices are not affected by the environment. In addition, with the maturity of microelectromechanical systems (MEMS) technology, sensors are gradually miniaturized and integrated into devices such as smart watches, mobile phones, and earphones. Wearable-based approaches have the advantages of being small, lightweight, and mobility-friendly. Therefore, many previous studies [20,21,22,23,24,25,26] have demonstrated the feasibility and effectiveness of wearable devices for fluid intake assessment and monitoring.

Gomes et al. [21] recorded motion signals and detected drinking activities with the wrist-worn inertial measurement unit (IMU). The authors considered the computational cost for future applications and used a random forest classifier to identify the movement of the container close to the mouth. Finally, the classifier achieved 97.4% precision, 97.1% recall and 97.2% F1-score on a data set containing 312 drinking actions and 216 other daily actions. Additionally, Rahman et al. [27] used a microphone worn near the throat to recognize drinking events. Afterward, the linear discriminative classifier was used to determine whether each segmented frame was a fluid intake behavior, and a recall of 72.09% was achieved. The above wearable-based methods still have the possibility of misjudgment in certain situations. In terms of wrist-worn inertial sensors, the activity design of the experiment usually contains only unitary drinking gestures and common daily activities [26]. They do not include events analogous to drinking, such as eating, combing hair, pushing glasses, etc. For wearable microphones, the signals of swallowing water and saliva are very similar in the frequency domain. Overall, in the task of fluid intake identification, single-modal signals may cause recognition errors in some analogous activities.

In order to enable fluid intake monitoring, better recognition performance must be achieved in practical situations. The purpose of this study is to develop a drinking activity identification approach using multimodal signals for fluid intake monitoring. This method identifies drinking events through the signals of wrist-worn IMUs, smart containers with built-in sensors and in-ear microphones. In addition, the experimental design also involves non-drinking behaviors that can easily lead to misjudgment and various drinking postures to correspond more to the actual situation. Finally, the performance of single-modal and multimodal signals in the drinking activity recognition task was compared.

The main contributions of this study are listed as follows:

Experimental settings: In order to make the identification approach closer to the real situation, this study introduces non-drinking behaviors that are easily confused with drinking activities, such as eating, pushing glasses, scratching necks, etc.
Approach of drinking activity identification: When the movement or acoustic signals of drinking and non-drinking activities are similar, the abundant information provided by multimodal signals can effectively enhance activity recognition performance. Hence, this study develops a multi-sensor fusion approach that considers the signal characteristics of drinking movements and swallowing sounds to improve the identification performance of drinking activities.

2. Materials and Methods

This study aims to propose a fluid intake monitoring approach to identify drinking and non-drinking events. The functional diagram of the proposed multimodal method is shown in Figure 1. It consists of four main stages: data acquisition, signals pre-processing, machine learning-based classification and post-processing. Firstly, the inertial measurement units (IMUs) and an in-ear microphone are used to record motion and acoustic data, respectively. In order to obtain the characteristics of time series, methods such as sliding window, feature extraction and normalization are used in the pre-processing stage. Then, machine learning models are applied to identify the drinking event. In post-processing, the window-based predicted sequences are transformed back into sample-based sequences. Finally, the output is the time series of information on drinking/non-drinking.

2.1. Data Acquisition

Twenty participants were recruited into this study, including ten males and ten females (age: 22.91 ± 1.64 years, height: 167.82 ± 7.29 cm, weight: 61.73 ± 16.25 kg). Three Opal sensors (published by APDM, Portland, OR, USA) are utilized to acquire motion signals. Two of them are worn on both wrists of the subject, and the other is attached to the bottom of a 3D-printed container, as shown in Figure 2a–c. Each of the sensors consists of a triaxial accelerometer, gyroscope, and magnetometer. However, only the accelerometer (range ± 16 g) and gyroscope (range ± 2000 degree/s) with a sampling rate of 128 Hz are utilized for drinking activity identification. In addition, a condenser in-ear microphone with a sampling rate of 44.1 kHz is placed in the right ear to acquire an acoustic signal, as shown in Figure 2d.

2.2. Experimental Protocol

In this study, we rigorously design eight everyday situations of drinking events, as shown in Table 1. The main differences between these situations are the adoption of posture (standing or sitting), the hand holding the cup (left or right hand), and the amount of water consumed (small or large sips). Moreover, it is crucial to consider non-drinking activities while developing a fluid intake monitoring system to ensure a broad variability. Hence, we design seventeen non-drinking activities, as shown in Table 2. Then, we assign all events into four identical trials. The drinking and non-drinking activities are interleaved in each trial when participants perform the behaviors, as shown in Figure 3.

2.3. Data Pre-Processing

In the data acquisition stage, triaxial acceleration, triaxial angular velocity and acoustic signals are recorded through wearable devices. In order to extract features from signals, we further process the motion and acoustic data in the signal pre-processing stage.

2.3.1. Motion Signals

After the subjects performed the experiment, we acquired motion signals from the triaxial accelerometer and gyroscope. Firstly, to describe the spatial variation of acceleration and angular velocity, the Euclidean norm of acceleration (

a_{n o r m}

) and angular velocity (

ω_{n o r m}

) are defined, as calculated by Equations (1) and (2):

a_{n o r m} = \sqrt{a_{x}^{2} + a_{y}^{2} + a_{z}^{2}},

(1)

ω_{n o r m} = \sqrt{ω_{x}^{2} + ω_{y}^{2} + ω_{z}^{2}},

(2)

where

a_{x}

is the acceleration of the x-axis,

a_{y}

is the acceleration of the y-axis,

a_{z}

is the acceleration of z-axis,

ω_{x}

is the angular velocity of the x-axis,

ω_{y}

is the angular velocity of the y-axis, and

ω_{z}

is the angular velocity of the z-axis.

A fixed-size sliding window technique has been widely utilized in sensor-based activity studies [28]. Hence, the sliding window with 5 s size and 75% overlapping is applied to the continuous sensing data. Then, time-domain features are selected, including mean, variance, standard deviation, maximum, minimum, range, kurtosis, skewness, and correlation coefficient. Therefore, 70 features are extracted for a single sensor, as shown in Table 3, and a total of 210 features (70 features per sensor

\times

3 sensors) are extracted in the experiment.

Since data consists of different scales, each of them can potentially have different distributions. However, this variability can affect the performance of machine learning classification. Therefore, it is necessary to scale features proportionally, ensuring all features fall within a specific range, to avoid the impact of these differences on classification outcomes. Normalization is a common approach. In this work, min–max normalization is utilized to scale features, as calculated by Equation (3):

n_{i j} = \frac{x_{i j} - x_{m i n}^{i}}{x_{m a x}^{i} - x_{m i n}^{i}},

(3)

where

{x_{i j} | i \in [1, 210], j \in [1, n u m b e r o f s l i d i n d w i n d o w]}

is the feature matrix for motion data,

x_{m a x}^{i}

is the maximum at the row

i

, and

x_{m i n}^{i}

is the minimum at the row

i

.

2.3.2. Acoustic Signals

The acoustic signal is collected by a microphone placed in the right ear. Firstly, we down-sample the acoustic signal from 44.1 kHz to 16 kHz [29]. Then, a spectrogram-based denoising method is utilized [30]. This method requires the original signal and a piece of pure noise signal as input. After converting the two into a spectrogram, the noise segment is used to calculate the threshold mask, and the similar noise in the original acoustic signal is removed with this mask. Since the microphone may receive continuous noise in the environment and the sound of breathing or heartbeat from the subjects. Thus, a segment collected when the subject is motionless is defined as noise. The denoising effect is shown in Figure 4.

The pre-processing of acoustic signals is similar to that of motion signals. The sliding window with a 5 s size and 75% overlapping is utilized. To avoid scale variance between different subjects and different activities, z-score standardization is used, as calculated by Equation (4):

z = \frac{x - μ}{σ},

(4)

where

x

is the time domain acoustic signal after processing the sliding window,

μ

is the mean value of the signal amplitude, and

σ

is the standard deviation of the signal amplitude. In addition, a fast Fourier transform (FFT) is applied to obtain the power spectrum. Meanwhile, the Hamming window [31,32] with zero padding is utilized in each window to improve computing ability. As features of acoustic data, four types of frequency-domain features are extracted from the segmented data, including total spectrum power [33], four sub-band powers [32,33], brightness [32,33], and spectral roll-off [32]. In addition, Mel-frequency cepstral coefficients, representing the characteristics of the general human ear’s sensitivity to different frequencies, are commonly used in acoustic recognition. Hence, we also take 17 Mel-frequency cepstral coefficients as features. Finally, we obtained 24 acoustic features, as shown in Table 4, and a min–max normalization is utilized to scale features.

Through the pre-processing stage, we acquire a motion feature matrix with size

210 \times n

and an acoustic feature matrix with size

24 \times n

from each trail, where

n

is the number of sliding windows. Thus, a size of

234 \times n

feature matrix consisting of motion and acoustic features is obtained.

2.4. Machine Learning Classification Models

Four classic machine learning models are utilized to identify drinking events. Previous works have shown reliable performance based on these models. In this section, the machine learning algorithms are described briefly.

2.4.1. Support Vector Machine (SVM)

Support Vector Machine (SVM) is a supervised machine learning algorithm. SVM performs well not only in processing high-dimensional data but also on small sample data. Its main goal is to find an optimal hyperplane and effectively use this plane to separate different data types. In addition, SVM can also use different kernel functions to cope with linear and nonlinear problems, including linear, polynomial, radial basis functions, and sigmoid functions. According to the training data distribution, the polynomial kernel function is applied to the model in this study.

2.4.2. k-Nearest Neighbors (k-NN)

k-Nearest Neighbors (k-NN) is also a supervised machine learning algorithm. Its principle is to classify testing data into the category of the nearest neighbor through the distance between the data. The nearer the distance is, the higher the probability of similar types. An important parameter of this model is the k value, which determines the number of nearest neighbors used for prediction. The k value directly affects the prediction results of the model. Specifically, a smaller k value will focus on local details, while a more considerable k value will concentrate on the overall model. Therefore, choosing an appropriate k value is a critical step in k-NN. Obtaining a proper k value requires using techniques such as cross-validation to evaluate the model’s performance under different k values and choosing a k value that performs well on training and test data. This work explores a range of k values from 1 to 9 during the training stage. The results show that the best performance appears when k = 5.

2.4.3. Random Forest (RF)

Random forest (RF) is an ensemble learning algorithm with multiple decision trees (DT). It utilizes bootstrapping from the original training data to choose random samples, with the samples not selected being termed as out-of-bag samples (OOB). Then, each DT is trained using a randomly selected subset of features from the OOB samples. This design ensures the independence and diversity of DTs. Finally, each DT can make predictions and generate the results through voting. RF can handle high-dimensional and large volumes of training samples, demonstrating good scalability. In this work, we obtained the best performance when the number of decision trees was 50.

2.4.4. Extreme Gradient Boosting (XGBoost)

Extreme Gradient Boosting (XGBoost) is a supervised learning algorithm based on Gradient Boosting Decision Tree (GBDT) with algorithmic and efficiency improvements. Each XGBoost constructs multiple GBDTs. The construction of each GBDT is based on the gradient boosting algorithm, optimizing the model by minimizing the loss function. Each iteration combines a new GBDT with the previous tree to tune its predictive results. In this study, the number of GBDTs was 150, which is the best performance.

2.5. Data Post-Processing

The unidentified event data is processed into feature matrixes following the previous method. Then, the unidentified data are distinguished from drinking and non-drinking events through the trained drinking identification classifier. In post-processing, it is necessary to transform window-based sequences back into sample-based sequences to facilitate performance evaluation and align with practical applications. In this work, we propose a method based on the sliding and overlapping properties of the sliding window technique to transform the window-based sequences back into sample-based sequences. Since the proportion of the sliding window overlapped is 75%, at least one to four windows simultaneously contain the period corresponding to a particular sliding step. Therefore, a majority voting strategy is employed on each sliding step to determine the event in the sample-based sequences, as shown in Figure 5. In addition, when the tie vote occurs, we define it as a non-drinking event.

2.6. Performance Evaluation Criteria

Two evaluation approaches are applied to validate the performance of the drinking identification. One is traditional frame-based performance evaluation, by calculating confusion matrix to assess the performance, including recall and precision and F1-score, as calculated by Equations (5)–(7):

R e c a l l = \frac{T P}{T P + F N},

(5)

P r e c i s i o n = \frac{T P}{T P + F P},

(6)

F 1 - s c o r e = \frac{2 \times R e c a l l \times P r e c i s i o n}{(R e c a l l + P r e c i s i o n)},

(7)

where TP, FP, TN, and FN are denoted as the correctly identified drinking event (true positive), incorrectly identified drinking event (false positive), correctly identified non-drinking event (true negative), and incorrectly identified non-drinking event (false negative), respectively.

The other approach is an event-based performance evaluation. It is used to assess the performance of activity recognition [16,34]. This method evaluates performance according to different error types, as shown in Figure 6.

A ground truth event can be scored as correct (C/C′), deletions (D), fragmented (F), fragmented and merged (FM), and merged (M). Similarly, a predicted event can be scored as merging (M′), fragmenting and merging (FM′), fragmenting (F′) and insertions (I′). In event-based performance evaluation, the two most widely used metrics are also applied for evaluation, including recall, precision and F1-score, as calculated by Equations (8)–(10):

R e c a l l = \frac{C}{C + F + M + F M + D},

(8)

P r e c i s i o n = \frac{C}{C + F^{'} + M^{'} + F M^{'} + I ’},

(9)

F 1 - s c o r e = \frac{2 \times R e c a l l \times P r e c i s i o n}{(R e c a l l + P r e c i s i o n)} .

(10)

In this study, a Leave-One-Subject-Out (LOSO) cross-validation approach is utilized for both frame-based and event-based evaluation methods to validate performance.

3. Results

In this section, we provide a concise and precise description of the results of experimental signal and performance analysis of drinking activity identification.

3.1. Experimental Signal

Figure 7, Figure 8, Figure 9 and Figure 10 depict multimodal signals collected in each trial. The figures respectively show the acceleration and angular velocity signals collected from the sensors of both wrists and the smart container, the time-domain acoustic signals from an in-ear microphone, and the spectrogram converted from the time-domain of acoustic signals.

Participants are asked to return to the initial posture before and after each kind of motion in each trial; therefore, there is a period of motion between each motion. In Figure 7, Figure 8, Figure 9 and Figure 10, complete drinking activities and non-drinking activities are respectively marked with blue and green dashed outlines. The action numbers are marked at the top of the figures.

Figure 11 and Figure 12 depict the multimodal information diagrams of taking a small and large sip with the right hand while sitting (D05 and D06) in the third trial. Since the motion does not utilize the left hand, the wrist motion signal diagram only presents the signals collected from the inertial sensor on the right wrist. The blue background area in the figures shows the intervals labeled as drinking in this study, which presents the period from touching the cup to the lips to swallowing. Other segments of non-drinking activities and periods of stillness are labeled as non-drinking in the classification algorithm. There is an obvious peak in the spectrogram of Figure 11, and four similar peaks can also be observed in the spectrogram of Figure 12. These peaks are the fragments of swallowing water.

The acoustic signal diagram of the participant’s natural swallowing behavior captured in trials is highly similar to the signal of swallowing water, as shown in Figure 13.

3.2. Drinking Activity Identification Using Multimodal Signal

The sample-based evaluation results mark the highest and lowest standard deviation in each column in bold, as shown in Table 5. All classifiers in the table demonstrate better performance in classifying multimodal signals than in classifying single-modal signals. Among the results of single-modal signals classification, the performance of single-modal motion signals is better than that of single-modal acoustic signals. Among the results of identifying drinking activities on multimodal signals with different classifiers, support vector machine (SVM) and extreme gradient boosting (XGBoost) show better performance and the average F1-score of the two reaching 83.7% and 83.9%, respectively, with 3.1% and 2.7% higher than that of single motion signals, and significantly higher than single acoustic signals by 47.5% and 56.2%. The standard deviations of SVM and XGBoost are only 3.8% and 3.4%, respectively, which is 0.3% and 1.3% lower than single motion signals and 10.2% and 12.2% lower than single acoustic signals.

Table 6 shows the results of the event-based method, which better indicates whether the classifier correctly identifies drinking activities and allows for further evaluation of the result of misclassified events. Similar to the results of the sample-based approach, the event-based evaluation results also show better performance in classifying multimodal signals. Among them, the results of the support vector machine are superior. Accuracy, recall, and F1-score are 95%, 98.1%, and 96.5%, respectively. The lowest F1-score of other classifiers is 94.7%.

From the above, it can be inferred that using multimodal signals for drinking activity recognition performs better than using single-motion or acoustic signals alone. To further investigate the necessity of sensors at each position in multimodal signals, motion signals are divided into both wrists (W) and cups (C) based on the placement of inertial measurement units. Table 7 and Table 8 present the results of drinking activity recognition using a support vector machine with different combinations of input signals from the inertial sensors on both wrists and the cup, as well as signals received from the microphone. Among them, the combination of signals from both wrists and the cup (WC) corresponds to the performance of the motion signals (M), as shown in Table 5 and Table 6, whereas the combination of signals from both wrists, the cup, and the acoustic (WCA) presents the performance of multimodal signals (MA). The results of the sample-based approach are shown in Table 7. WCA obtains the highest F1-score of 83.7% and the lowest standard deviation. WA achieves the second-highest F1-score of 82.9%. The results of the event-based method are shown in Table 8, which also demonstrates WCA performs the best, and WA comes second.

4. Discussion

In this study, we have proposed two evaluation methods to assess drinking activity recognition, one of which is the event-based method, which shows significantly better outcomes. This difference may stem from the fact that the sample-based approach is more susceptible to errors between predicted event start points and actual event start points. In the trial of Figure 14, all drinking activities are identified successfully, and an F1-score of 100% is achieved in this case with the event-based method. However, in the sample-based approach, a small number of misclassifications at event boundaries can lead to a certain extent of impact on the confusion matrix, resulting in a decrease in evaluation metrics. In comparison, the event-based method presents a more accurate performance evaluation in the recognition of drinking activities.

Based on the results of this study, when using multimodal signals for drinking activity recognition, the performance of single acoustic signals is greatly lower than that of single motion signals. This result may be attributed to the fact that swallowing is one of the most prominent features for identifying drinking activities in acoustic features. Nevertheless, there are certain differences in the swallowing features among different individuals. Additionally, other activities designed in this study, such as speaking and eating, increase the opportunities for participants to swallow. These non-drinking behaviors share similar features with swallowing during drinking in acoustic signals. Hence, when training classifiers utilizing single acoustic signals, these similar features may cause classifiers to confuse the two.

In Figure 15, the ground truth and predicted drinking activity segments in a trial are marked with different colors. In the recognition results based on single acoustic signals, a segment mistakenly identified as a drinking activity is outlined in red. However, this segment actually corresponds to the swallowing in the latter part of eating, and the sole use of acoustic signals leads to misclassification. Since there is no action of raising the cup and pouring water into the mouse, the swallowing of the non-drinking behavior is not misclassified as drinking when utilizing multimodal signal recognition. Furthermore, in the processing of utilizing single acoustic signals for drinking activity recognition, this study observes that some drinking activities in the window sequence appear discontinuous. These window sequences are converted into sample sequences due to the fact that only one or two windows out of the four overlapping windows are labeled as drinking. These samples of drinking activities will be deleted during the voting process, thereby affecting the recognition results. As a consequence, these factors may explain why single acoustic signals perform poorly in drinking activity recognition.

The event-based method aims to statistically compute the number of labels for each event category in the sample, including drinking and non-drinking activities, and then calculate evaluation metrics based on it. Take Figure 14 as an example; after statistical analysis, the trial will obtain four labels of type C, while the quantities of other labels are all zero. Since each participant in this study needs to complete sixteen drinking activities, the completely correct result should have 16 labels for type C, with the quantities of other labels all being zero. When utilizing a support vector machine for drinking activity recognition based on multimodal signals, the event-based method recognizes an average of 15.7 labels of type C per participant, with a standard deviation of 0.56. The average quantities and standard deviations (mean ± deviation) of other label categories are 0.25 ± 0.54 for F, 0.5 ± 1.07 for F′, 0.05 ± 0.22 for D, and 0.4 ± 0.66 for I′. The quantities of all other labels are zero. After calculation, a high F1-score of up to 96.5% can be achieved.

Among these results, we can observe that almost all drinking activities are successfully identified, with only an average of 0.05 drinking activities excluded. In other words, only one drinking activity is misclassified as a non-drinking activity out of 320 drinking activities. The remaining cases that are not entirely accurately classified are segmented into several predicted events. Judging from the quantities of F and F′, in most cases, one true event is split into two predicted events. This issue in this part may be further improved through some post-processing methods.

The multi-sensor fusion approach is only used to identify the drinking activity. There are several limitations to this work. For example, the proposed method cannot estimate the intake volume and recognize the liquid type in the container. Previous research [25] has used wrist-worn sensors and containers with built-in sensors to estimate the amount of fluid intake. In the future, we can assess the hydration status of individuals through long-term monitoring of fluid intake volume. Additionally, the acoustic signals are easily influenced by the surrounding sound. After ensuring that the multimodal fusion approach can improve the performance of drinking activity recognition, we will further explore the issue of noise removal so that the identification system can be more complete. Therefore, the processing of denoising will be researched further in the future.

5. Conclusions

Adequate fluid intake is essential for the normal functioning of the human body. However, hydrating appropriately is often ignored in daily life. Therefore, a system for long-term monitoring of drinking activities can help people improve their fluid intake habits to prevent symptoms related to insufficient water intake. Currently, in the field of automatically recording drinking activities, common methods include IMUs worn on the body or built into containers and acoustic sensors worn near the neck. Drinking event identification is performed through motion signals collected by inertial sensors or sound signals from acoustic sensors. Nevertheless, these single-modal methods still have some limitations in recognition. Hence, this study proposes a fluid intake identification approach using multimodal signals. In this study, the multimodal signals include the motion signals of the wrist-worn device and the container, as well as the acoustic signals of the right ear microphone. The recognition performance of single-modal and multimodal signals is compared through traditional machine learning algorithms. In the evaluation stage, whether event-based or sample-based, the multi-sensor fusion method achieves the best performance. In terms of sample-based, the best F1-score of 83.9% on XGBoost is achieved. On the other hand, in event-based evaluation, the multi-sensor fusion approach shows the greatest F1-score of 96.5% on SVM. The results demonstrate that using multimodal signals to identify drinking activities is better than a single motion signal or a single acoustic signal. This illustrates the feasibility of the proposed approach and can improve the situation where single-modal signals may cause misjudgments in the past.

Author Contributions

Conceptualization, J.-H.L. and P.-W.Y.; methodology, J.-H.L., P.-W.Y., H.-C.W., C.-Y.L., Y.-C.L., C.-P.L., C.-Y.H. and C.-T.C.; software, J.-H.L. and P.-W.Y.; validation, J.-H.L., P.-W.Y., C.-Y.H. and C.-T.C.; formal analysis, J.-H.L. and P.-W.Y.; data curation, P.-W.Y., C.-Y.H., and C.-T.C.; writing—original draft preparation, J.-H.L., P.-W.Y., H.-C.W. and C.-Y.L.; writing—review and editing, C.-Y.H. and C.-T.C.; supervision, C.-Y.H. and C.-T.C.; project administration, C.-Y.H. and C.-T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council, grant number NSTC 112-2221-E-A49-013-MY2 and The APC was funded by NSTC 112-2221-E-A49-013-MY2.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of National Yang Ming Chiao Tung University (protocol code: YM111087E and date of approval: 14 June 2022).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical restrictions.

Acknowledgments

The authors would like to thank the volunteers who participated in the experiments for their efforts and time.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Standing Committee on the Scientific Evaluation of Dietary Reference Intakes, Panel on Dietary Reference Intakes for Electrolytes, and Water. Dietary Reference Intakes for Water, Potassium, Sodium, Chloride, and Sulfate; National Academies Press: Cambridge, MA, USA, 2005. [Google Scholar]
Vieux, F.; Maillot, M.; Rehm, C.D.; Barrios, P.; Drewnowski, A. Trends in tap and bottled water consumption among children and adults in the United States: Analyses of NHANES 2011–2016 data. Nutr. J. 2020, 19, 10. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Onufrak, S.; Cradock, A.; Patel, A.; Hecht, C.; Merlo, C.; Blanck, H.M. Correlates of infrequent plain water intake among US high school students: National youth risk behavior survey, 2017. Am. J. Health Promot. 2020, 34, 549–554. [Google Scholar] [CrossRef] [PubMed]
Popkin, B.M.; D’Anci, K.E.; Rosenberg, I.H. Water, hydration, and health. Nutr. Rev. 2010, 68, 439–458. [Google Scholar] [CrossRef] [PubMed]
Shaheen, N.A.; Alqahtani, A.A.; Assiri, H.; Alkhodair, R.; Hussein, M.A. Public knowledge of dehydration and fluid intake practices: Variation by participants’ characteristics. BMC Public Health 2018, 18, 1346. [Google Scholar] [CrossRef] [PubMed]
Chiu, M.-C.; Chang, S.-P.; Chang, Y.-C.; Chu, H.-H.; Chen, C.C.-H.; Hsiao, F.-H.; Ko, J.-C. Playful bottle: A mobile social persuasion system to motivate healthy water intake. In Proceedings of the 11th International Conference on Ubiquitous Computing, Orlando, FL, USA, 30 September–3 October 2009; pp. 185–194. [Google Scholar]
Welch, J.L.; Astroth, K.S.; Perkins, S.M.; Johnson, C.S.; Connelly, K.; Siek, K.A.; Jones, J.; Scott, L.L. Using a mobile application to self-monitor diet and fluid intake among adults receiving hemodialysis. Res. Nurs. Health 2013, 36, 284–298. [Google Scholar] [CrossRef] [PubMed]
Schoeller, D.A.; Bandini, L.G.; Dietz, W.H. Inaccuracies in self-reported intake identified by comparison with the doubly labelled water method. Can. J. Physiol. Pharmacol. 1990, 68, 941–949. [Google Scholar] [CrossRef] [PubMed]
Vance, V.A.; Woodruff, S.J.; McCargar, L.J.; Husted, J.; Hanning, R.M. Self-reported dietary energy intake of normal weight, overweight and obese adolescents. Public Health Nutr. 2009, 12, 222–227. [Google Scholar] [CrossRef] [PubMed]
Chua, J.-L.; Chang, Y.C.; Jaward, M.H.; Parkkinen, J.; Wong, K.-S. Vision-based hand grasping posture recognition in drinking activity. In Proceedings of the 2014 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Kuching, Malaysia, 1–4 December 2014; pp. 185–190. [Google Scholar]
Kassim, M.F.; Mohd, M.N.H.; Tomari, M.R.M.; Suriani, N.S.; Zakaria, W.N.W.; Sari, S. A non-invasive and non-wearable food intake monitoring system based on depth sensor. Bull. Electr. Eng. Inform. 2020, 9, 2342–2349. [Google Scholar] [CrossRef]
Dong, B.; Gallant, R.; Biswas, S. A self-monitoring water bottle for tracking liquid intake. In Proceedings of the 2014 IEEE Healthcare Innovation Conference (HIC), Cancun, Mexico, 9–11 November 2014; pp. 311–314. [Google Scholar]
Griffith, H.; Shi, Y.; Biswas, S. A dynamic partitioning algorithm for sip detection using a bottle-attachable IMU sensor. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 1–10. [Google Scholar] [CrossRef]
Jayatilaka, A.; Ranasinghe, D.C. Towards unobtrusive real-time fluid intake monitoring using passive UHF RFID. In Proceedings of the 2016 IEEE International Conference on RFID (RFID), Shunde, China, 21–23 September 2016; pp. 1–4. [Google Scholar]
Jayatilaka, A.; Ranasinghe, D.C. Real-time fluid intake gesture recognition based on batteryless UHF RFID technology. Pervasive Mob. Comput. 2017, 34, 146–156. [Google Scholar] [CrossRef]
Liu, K.-C.; Hsieh, C.-Y.; Huang, H.-Y.; Chiu, L.-T.; Hsu, S.J.-P.; Chan, C.-T. Drinking event detection and episode identification using 3D-printed smart cup. IEEE Sens. J. 2020, 20, 13743–13751. [Google Scholar] [CrossRef]
Wellnitz, A.; Wolff, J.-P.; Haubelt, C.; Kirste, T. Fluid intake recognition using inertial sensors. In Proceedings of the 6th international Workshop on Sensor-Based Activity Recognition and Interaction, Rostock, Germany, 16–17 September 2019; pp. 1–7. [Google Scholar]
Jayatilake, D.; Ueno, T.; Teramoto, Y.; Nakai, K.; Hidaka, K.; Ayuzawa, S.; Eguchi, K.; Matsumura, A.; Suzuki, K. Smartphone-based real-time assessment of swallowing ability from the swallowing sound. IEEE J. Transl. Eng. Health Med. 2015, 3, 1–10. [Google Scholar] [CrossRef] [PubMed]
Kalantarian, H.; Alshurafa, N.; Pourhomayoun, M.; Sarin, S.; Le, T.; Sarrafzadeh, M. Spectrogram-based audio classification of nutrition intake. In Proceedings of the 2014 IEEE Healthcare Innovation Conference (HIC), Cancun, Mexico, 9–11 November 2014; pp. 161–164. [Google Scholar]
Cohen, R.; Fernie, G.; Roshan Fekr, A. Fluid intake monitoring systems for the elderly: A review of the literature. Nutrients 2021, 13, 2092. [Google Scholar] [CrossRef] [PubMed]
Gomes, D.; Sousa, I. Real-time drink trigger detection in free-living conditions using inertial sensors. Sensors 2019, 19, 2145. [Google Scholar] [CrossRef] [PubMed]
Hamatani, T.; Elhamshary, M.; Uchiyama, A.; Higashino, T. FluidMeter: Gauging the human daily fluid intake using smartwatches. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 2, 1–25. [Google Scholar] [CrossRef]
Ortega Anderez, D.; Lotfi, A.; Pourabdollah, A. A deep learning based wearable system for food and drink intake recognition. J. Ambient Intell. Humaniz. Comput. 2021, 12, 9435–9447. [Google Scholar] [CrossRef]
Weiss, G.M.; Timko, J.L.; Gallagher, C.M.; Yoneda, K.; Schreiber, A.J. Smartwatch-based activity recognition: A machine learning approach. In Proceedings of the 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Las Vegas, NV, USA, 24–27 February 2016; pp. 426–429. [Google Scholar]
Hsieh, C.-Y.; Huang, H.-Y.; Chan, C.-T.; Chiu, L.-T. An Analysis of Fluid Intake Assessment Approaches for Fluid Intake Monitoring System. Biosensors 2023, 14, 14. [Google Scholar] [CrossRef] [PubMed]
Huang, H.-Y.; Hsieh, C.-Y.; Liu, K.-C.; Hsu, S.J.-P.; Chan, C.-T. Fluid intake monitoring system using a wearable inertial sensor for fluid intake management. Sensors 2020, 20, 6682. [Google Scholar] [CrossRef] [PubMed]
Rahman, T.; Adams, A.T.; Zhang, M.; Cherry, E.; Zhou, B.; Peng, H.; Choudhury, T. BodyBeat: A mobile system for sensing non-speech body sounds. In Proceedings of the MobiSys, Bretton Woods, NH, USA, 16–19 June 2014; pp. 2–592. [Google Scholar]
Banos, O.; Galvez, J.-M.; Damas, M.; Pomares, H.; Rojas, I. Window size impact in human activity recognition. Sensors 2014, 14, 6474–6499. [Google Scholar] [CrossRef]
Subramani, S.; Rao, M.A.; Giridhar, D.; Hegde, P.S.; Ghosh, P.K. Automatic classification of volumes of water using swallow sounds from cervical auscultation. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1185–1189. [Google Scholar]
Sainburg, T.; Thielk, M.; Gentner, T.Q. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. PLoS Comput. Biol. 2020, 16, e1008228. [Google Scholar] [CrossRef]
Merck, C.A.; Maher, C.; Mirtchouk, M.; Zheng, M.; Huang, Y.; Kleinberg, S. Multimodality sensing for eating recognition. In Proceedings of the PervasiveHealth, Cancun, Mexico, 16–19 May 2016; pp. 130–137. [Google Scholar]
Yatani, K.; Truong, K.N. Bodyscope: A wearable acoustic sensor for activity recognition. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, Pittsburgh, PA, USA, 5–8 September 2012; pp. 341–350. [Google Scholar]
Guo, G.; Li, S.Z. Content-based audio classification and retrieval by support vector machines. IEEE Trans. Neural Netw. 2003, 14, 209–215. [Google Scholar] [PubMed]
Ward, J.A.; Lukowicz, P.; Gellersen, H.W. Performance metrics for activity recognition. ACM Trans. Intell. Syst. Technol. (TIST) 2011, 2, 1–23. [Google Scholar] [CrossRef]

Figure 1. The functional diagram of the drinking activity identification.

Figure 2. The orientation and position of wearable devices in the experiment: (a) The inertial sensors worn on both wrists; (b) the prototype of the 3D-printed container; (c) the sensor attached to the bottom of the 3D-printed container; (d) the condenser in-ear microphone placed in the right ear.

Figure 3. The procedure for each trial: (a) first trial; (b) second trial; (c) third trial; (d) fourth trial.

Figure 4. Denoising effect: (a) time domain signal diagram before denoising; (b) time domain signal diagram after denoising; (c) spectrogram transform by time domain signal before denoising; (d) spectrogram transform by time domain signal after denoising.

Figure 5. The illustration of transformation sequences.

Figure 6. The demonstration of matched event error types for event-based evaluation: correct (C), deletions (D), fragmented (F), fragmented and merged (FM), merged (M), merging (M′), fragmenting and merging (FM′), fragmenting (F′), and insertions (I′).

Figure 7. Multimodal information diagram of the first trial. The blue frame and the green frame, respectively, indicate the intervals when the participant performs drinking motion and non-drinking motion. The remaining time corresponds to the periods while the participant is motionless. Action numbers of the periods are annotated above the image.

Figure 8. Multimodal information diagram of the second trial. The blue frame and the green frame, respectively, indicate the intervals when the participant performs drinking motion and non-drinking motion. The remaining time corresponds to the periods while the participant is motionless. Action numbers of the periods are annotated above the image.

Figure 9. Multimodal information diagram of the third trial. The blue frame and the green frame, respectively, indicate the intervals when the participant performs drinking motion and non-drinking motion. The remaining time corresponds to the periods while the participant is motionless. Action numbers of the periods are annotated above the image.

Figure 10. Multimodal information diagram of the fourth trial. The blue frame and the green frame, respectively, indicate the intervals when the participant performs drinking motion and non-drinking motion. The remaining time corresponds to the periods while the participant is motionless. Action numbers of the periods are annotated above the image.

Figure 11. Multimodal information diagram of taking a small sip with the right hand while sitting (D05). The blue background area presents the intervals labeled as drinking in the trial, while the remaining white areas are marked as non-drinking.

Figure 12. Multimodal information diagram of taking a large sip with the right hand while sitting (D06). The blue background area presents the intervals labeled as drinking in the trial, while the remaining white areas are marked as non-drinking.

Figure 13. The acoustic frequency diagram of saliva swallowing.

Figure 14. Recognition results of drinking activities with multimodal signals.

Figure 15. Recognition results of drinking activities with multimodal signals and single-modal signals.

Table 1. The list of drinking activities.

No.	Description
D01	Take a small sip with right hand while standing
D02	Take a large sip with right hand while standing
D03	Take a small sip with left hand while standing
D04	Take a large sip with left hand while standing
D05	Take a small sip with right hand while sitting
D06	Take a large sip with right hand while sitting
D07	Take a small sip with left hand while sitting
D08	Take a large sip with left hand while sitting

Table 2. The list of non-drinking activities.

No.	Description
N01	Grasp cup with right hand and speak while standing
N02	Grasp cup with right hand while walking
N03	Pre-sip with right hand and speak while standing (no fluid intake)
N04	Cough while standing
N05	Grasp cup with left hand and speak while standing
N06	Grasp cup with left hand while walking
N07	Pre-sip with left hand and speak while standing (no fluid intake)
N08	Sniff while standing
N09	Type while sitting
N10	Scratch neck while sitting
N11	Comb hair from front to back while sitting
N12	Move cup aside while sitting
N13	Take a note with a pen (using the dominant hand) while sitting
N14	Answer a phone call and speak while sitting
N15	Push glasses while sitting
N16	Eat while sitting (open the cookie and eat it)
N17	Pour water into a cup while sitting

Table 3. The feature vectors for motion data.

Feature Vector	Description
$f_{1}^{M} ~ f_{8}^{M}$	Mean of $a_{x}$ , $a_{y}$ , $a_{z}$ , $a_{n o r m}$ , $ω_{x}$ , $ω_{y}$ , $ω_{z}$ , and $ω_{n o r m}$
$f_{9}^{M} ~ f_{16}^{M}$	Variance of $a_{x}$ , $a_{y}$ , $a_{z}$ , $a_{n o r m}$ , $ω_{x}$ , $ω_{y}$ , $ω_{z}$ , and $ω_{n o r m}$
$f_{17}^{M} ~ f_{24}^{M}$	Standard deviation of $a_{x}$ , $a_{y}$ , $a_{z}$ , $a_{n o r m}$ , $ω_{x}$ , $ω_{y}$ , $ω_{z}$ , and $ω_{n o r m}$
$f_{25}^{M} ~ f_{32}^{M}$	Maximum of $a_{x}$ , $a_{y}$ , $a_{z}$ , $a_{n o r m}$ , $ω_{x}$ , $ω_{y}$ , $ω_{z}$ , and $ω_{n o r m}$
$f_{33}^{M} ~ f_{40}^{M}$	Minimum of $a_{x}$ , $a_{y}$ , $a_{z}$ , $a_{n o r m}$ , $ω_{x}$ , $ω_{y}$ , $ω_{z}$ , and $ω_{n o r m}$
$f_{41}^{M} ~ f_{48}^{M}$	Range of $a_{x}$ , $a_{y}$ , $a_{z}$ , $a_{n o r m}$ , $ω_{x}$ , $ω_{y}$ , $ω_{z}$ , and $ω_{n o r m}$
$f_{49}^{M} ~ f_{56}^{M}$	Kurtosis of $a_{x}$ , $a_{y}$ , $a_{z}$ , $a_{n o r m}$ , $ω_{x}$ , $ω_{y}$ , $ω_{z}$ , and $ω_{n o r m}$
$f_{57}^{M} ~ f_{64}^{M}$	Skewness of $a_{x}$ , $a_{y}$ , $a_{z}$ , $a_{n o r m}$ , $ω_{x}$ , $ω_{y}$ , $ω_{z}$ , and $ω_{n o r m}$
$f_{65}^{M}$	Correlation coefficient between $a_{x}$ and $a_{y}$
$f_{66}^{M}$	Correlation coefficient between $a_{y}$ and $a_{z}$
$f_{67}^{M}$	Correlation coefficient between $a_{x}$ and $a_{z}$
$f_{68}^{M}$	Correlation coefficient between $ω_{x}$ and $ω_{y}$
$f_{69}^{M}$	Correlation coefficient between $ω_{y}$ and $ω_{z}$
$f_{70}^{M}$	Correlation coefficient between $ω_{x}$ and $ω_{z}$

M—motion signal;

a_{n o r m}

—Euclidean norm of acceleration;

ω_{n o r m}

—Euclidean norm of angular velocity.

Table 4. The feature vectors for acoustic data.

Feature Vector	Description
$f_{1}^{A}$	Total spectrum power ( $S P$ )
$f_{2}^{A} ~ f_{5}^{A}$	Sub-band powers ( $S P_{k}, w h e r e k = 0, 1, 2, 3$ )
$f_{6}^{A}$	Brightness
$f_{7}^{A}$	Spectral roll-off
$f_{8}^{A} ~ f_{24}^{A}$	17 Mel-frequency cepstral coefficients

A—Acoustic signal.

Table 5. Comparison of the sample-based approach results for drinking activity recognition utilizing different classifiers on multimodal signals and single-modal signals.

Classifier	Modality	Recall (%)	Precision (%)	F1-Score (%)
SVM	M	87.0 ± 8.5	75.8 ± 5.3	80.6 ± 4.1
	A	27.2 ± 12.2	69.9 ± 17.7	36.2 ± 14.0
	MA	93.2 ± 5.1	76.2 ± 5.4	83.7 ± 3.8
k-NN	M	86.4 ± 6.6	73.7 ± 6.0	79.3 ± 4.7
	A	18.7 ± 9.9	58.9 ± 22.8	27.4 ± 13.2
	MA	88.5 ± 5.8	74.2 ± 5.3	80.5 ± 3.9
RF	M	81.1 ± 10.6	77.3 ± 6.4	78.6 ± 6.2
	A	15.4 ± 8.7	73.2 ± 22.1	24.7 ± 12.6
	MA	84.2 ± 11.5	77.3 ± 6.1	80.2 ± 7.5
XGBoost	M	87.5 ± 8.7	76.5 ± 5.7	81.2 ± 4.7
	A	28.1 ± 13.8	70.8 ± 17.1	37.7 ± 15.6
	MA	92.1 ± 5.4	77.3 ± 5.0	83.9 ± 3.4

M—motion signal from the cup and both wrists; A—acoustic signal; MA—both motion and acoustic signals; bold number—best performance in this column.

Table 6. Comparison of the event-based method results for drinking activity recognition utilizing different classifiers on multimodal signals and single-modal signals.

Classifier	Modality	Recall (%)	Precision (%)	F1-Score (%)
SVM	M	97.2 ± 4.6	91.0 ± 8.5	93.9 ± 6.0
	A	45.6 ± 16.8	79.5 ± 18.5	54.5 ± 15.2
	MA	98.1 ± 3.5	95.0 ± 7.7	96.5 ± 5.5
k-NN	M	98.1 ± 4.5	90.3 ± 9.7	93.9 ± 7.4
	A	35.0 ± 16.1	65.1 ± 28.4	46.1 ± 15.6
	MA	97.2 ± 5.0	92.6 ± 10.5	94.7 ± 7.8
RF	M	96.6 ± 8.7	94.2 ± 12.8	95.2 ± 10.9
	A	29.7 ± 14.9	85.3 ± 21.2	44.2 ± 16.0
	MA	97.2 ± 7.3	94.6 ± 10.9	95.7 ± 9.0
XGBoost	M	96.6 ± 5.8	90.6 ± 10.5	93.4 ± 8.2
	A	45.6 ± 18.7	78.1 ± 20.7	53.8 ± 17.2
	MA	96.9 ± 5.0	93.6 ± 10.5	95.1 ± 7.9

M—motion signal from the cup and both wrists; A—acoustic signal; MA—both motion and acoustic signals; bold number—best performance in this column.

Table 7. Comparison table of the sample-based approach results for drinking activities recognition using support vector machine with different input signal combinations.

Modality	Recall (%)	Precision (%)	F1-Score (%)
WCA	93.2 ± 5.1	76.2 ± 5.4	83.7 ± 3.8
WC	87.0 ± 8.5	75.8 ± 5.3	80.6 ± 4.1
WA	90.9 ± 7.5	76.6 ± 5.3	82.9 ± 4.6
CA	89.4 ± 9.7	73.9 ± 5.8	80.7 ± 6.3
W	82.4 ± 10.4	76.0 ± 6.6	78.4 ± 4.7
C	75.6 ± 15.0	73.2 ± 8.6	73.8 ± 10.9
A	27.2 ± 12.2	69.9 ± 17.7	36.2 ± 14.0

W—motion signal of both wrists; C—motion signal of the cup; A—acoustic signal; bold number—best performance in this column.

Table 8. Comparison table of the event-based method results for drinking activities recognition using support vector machine with different input signal combinations.

Modality	Recall (%)	Precision (%)	F1-Score (%)
WCA	98.1 ± 3.5	95.0 ± 7.7	96.5 ± 5.5
WC	97.2 ± 4.6	91.0 ± 8.5	93.9 ± 6.0
WA	96.9 ± 5.0	94.0 ± 7.0	95.3 ± 5.5
CA	95.0 ± 10.8	91.7 ± 14.9	93.2 ± 12.9
W	95.0 ± 7.6	90.5 ± 7.7	92.3 ± 5.3
C	94.7 ± 12.2	90.9 ± 14.6	92.5 ± 13.3
A	45.6 ± 16.8	79.5 ± 18.5	54.5 ± 15.2

W—motion signal of both wrists; C—motion signal of the cup; A—acoustic signal; bold number—best performance in this column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.-H.; Yu, P.-W.; Wang, H.-C.; Lin, C.-Y.; Lin, Y.-C.; Liu, C.-P.; Hsieh, C.-Y.; Chan, C.-T. Multi-Sensor Fusion Approach to Drinking Activity Identification for Improving Fluid Intake Monitoring. Appl. Sci. 2024, 14, 4480. https://doi.org/10.3390/app14114480

AMA Style

Li J-H, Yu P-W, Wang H-C, Lin C-Y, Lin Y-C, Liu C-P, Hsieh C-Y, Chan C-T. Multi-Sensor Fusion Approach to Drinking Activity Identification for Improving Fluid Intake Monitoring. Applied Sciences. 2024; 14(11):4480. https://doi.org/10.3390/app14114480

Chicago/Turabian Style

Li, Ju-Hsuan, Pei-Wei Yu, Hsuan-Chih Wang, Che-Yu Lin, Yen-Chen Lin, Chien-Pin Liu, Chia-Yeh Hsieh, and Chia-Tai Chan. 2024. "Multi-Sensor Fusion Approach to Drinking Activity Identification for Improving Fluid Intake Monitoring" Applied Sciences 14, no. 11: 4480. https://doi.org/10.3390/app14114480

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Sensor Fusion Approach to Drinking Activity Identification for Improving Fluid Intake Monitoring

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Experimental Protocol

2.3. Data Pre-Processing

2.3.1. Motion Signals

2.3.2. Acoustic Signals

2.4. Machine Learning Classification Models

2.4.1. Support Vector Machine (SVM)

2.4.2. k-Nearest Neighbors (k-NN)

2.4.3. Random Forest (RF)

2.4.4. Extreme Gradient Boosting (XGBoost)

2.5. Data Post-Processing

2.6. Performance Evaluation Criteria

3. Results

3.1. Experimental Signal

3.2. Drinking Activity Identification Using Multimodal Signal

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI