1. Introduction
With the improvement of human living standards, more and more people realize the importance of exercise to health and engage in regular physical exercise to keep healthy. However, proper exercise is good for health, while excessive exercise may bring harm to the body(e.g., muscular and skeletal injuries [
1], overtraining syndrome [
2], atrial fibrillation [
3] and immune function reduction). Excessive exercise is defined as a relative term, implying that a bout(s) of exercise is fine for some individuals while excessive for other individuals due to the differences in a variety of factors, such as physical fitness and genetics [
4]. Physical fatigue, as a common physiological phenomenon during exercise, is a direct reflection of the degree of exercise. Accurate detection and evaluation of physical fatigue levels can effectively prevent excessive exercise and further reduce physical injury caused by excessive exercise.
Fatigue is a term used to describe a subjective feeling of tiredness or lack of energy. Objectively evaluating fatigue is a challenging task as a person’s subjective feelings cannot be easily assessed by other people [
5]. The rating of perceived exertion (RPE) [
6] is a measure of fatigue that has been widely used for fatigue assessment in sport science, specifically in running research [
7,
8,
9]. As the RPE has advantages, such as being noninvasive, unobtrusive, noninterruptive and easy to use, many previous studies on fatigue assessment have used it as the ground truth. Moreover, previous studies [
10,
11] indicated that the RPE represented feedback from cardiovascular, respiratory and musculoskeletal systems and it provides an overall fatigue state assessment of a subject, while a single biomechanical or physiological parameter usually provides very limited information. So, in this study we collected the RPE of subjects at each experiment stage and used it as the physical fatigue ground truth for model training and testing.
With the development of wearable devices [
12], more and more physiological parameters could be easily collected. Many researchers have been trying to use various physiological parameters collected by wearable devices to assess physical fatigue. One of the physiological signals for physical fatigue detection is an electromyogram (EMG) [
13,
14,
15,
16]. EMGs can reflect the electrical activity of local muscle, which is related to the physical fatigue of the local muscle. Physical fatigue not only reduces the body’s exercise ability but also causes neurological function decline. So, electroencephalograms (EEG) are another physiological index used for physical fatigue assessment [
17,
18]. Unfortunately, both EMGs and EEGs are weak bioelectric signals and are easily disturbed by many kinds of noise. Additionally, their acquisitions require professional operations to paste the acquisition electrodes to relevant body parts, so they are not widely used.
Heart rate variability (HRV), a tiny time variation between adjacent heartbeats, was demonstrated to have a relationship with autonomic nervous activities [
19]. In recent years, machine learning algorithms based on HRV signals have become a research hotspot in various applications, such as noise detection [
20], cuff-less blood pressure measurement [
21], mental fatigue evaluation [
22] and exercise-induced physical fatigue evaluation [
23]. Compared with exercise-induced physical fatigue evaluation, there have been more achievements in mental fatigue evaluation. For HRV-based mental fatigue evaluation, one research team used the kernel principal component method to select important HRV features which have a strong relationship with fatigue states [
24]. Their study results show that selected features can easily distinguish between normal samples and fatigue samples. Another research team concentrated on using neural networks and HRV analysis with a power spectral density algorithm to build a driver fatigue detection model, and an accuracy of 90% was achieved [
25]. Moreover, decision trees, support vector machines and K-nearest neighbor classifiers have been also proposed to quantify mental fatigue [
26].
As a comparison, a few recent efforts have been made towards HRV-based physical fatigue assessment. Ramos et al. [
27] combined EMG features and HRV features to build a binary SVM classifier. By analyzing HRV from blood volume pulse signals, Cosoli et al. [
28] evaluated the performance of two machine learning algorithms in distinguishing between nonfatigue and fatigue conditions and presented a fatigue-related index to quantify the physical fatigue. Guan et al. [
29] proposed a bidirectional long- and short-term memory neural network to classify physical fatigue. The model used HRV features and inertial sensor signals as inputs and achieved 80.55% accuracy. Nevertheless, most of these models either used multiple kinds of signals or coarsely classified physical fatigue into two levels. Therefore, the purpose of this study was to investigate whether the machine learning method combined with HRV, which has been proven to be useful in mental fatigue assessment, is also effective for continuous and real-time monitoring of physical fatigue during exercise. Furthermore, the study also aimed to investigate which HRV features were the most significant in the classification. The selection and analysis of relevant features may also be important for improving the interpretability of the physical fatigue assessment model.
Based on the results of previous studies and the advancements in machine learning technology, we proposed a novel method for the automatic and objective classification of physical fatigue. First, we proposed a feature selection method to remove useless HRV features that have a low correlation with physical fatigue and redundant HRV features that have a high correlation with the selected features. Then, four machine learning algorithms were trained to classify fatigue using the selected features. Experimental results for 80 healthy subjects indicate that the model trained using the selected features could classify physical fatigue with a high accuracy of 85.5%.
The remainder of this paper is organized as follows.
Section 2 introduces the data collection experiment and the physical fatigue evaluation modeling methods.
Section 3 provides the physical-fatigue-related HRV feature selection results and the physical fatigue classification results. The obtained fatigue classification results with different machine learning methods are presented and discussed in
Section 4. Finally, the study is concluded in
Section 5.
2. Materials and Methods
2.1. Data Collection
A total of 80 healthy subjects were recruited for participation in the data collection experiments; the statistical anthropometric characteristics of all subjects are summarized in
Table 1. The subjects were asked to perform a preset treadmill exercise, which was modified from the Bruce protocol [
30], and during the test, their ECG data were collected. The experiment process is shown in
Table 2; it started with a 5 min pre-rest, during which the subjects were asked to stand still on the treadmill. Following this was the exercise stage, during which the subjects began to run at a speed of 3 km/h, and the speed increased to the next preset value every 5 min until reaching the maximum preset speed, and the subjects would run at this maximum preset speed until they were physically exhausted. It was not necessary to reach the maximum speed during the exercise stage, and the exercise could be terminated at any time the participant signaled that he was exhausted.
The data collection scenario is shown in
Figure 1. The subjects were asked to wear a 12-lead ECG device (GE Medical System Information Technologies, INC, CardioSoft Cardiac Testing System). The ECG device used in our study was specially designed for exercise ECG monitoring, and hardware anti-noise design and software filtering algorithms were made to ensure the high quality of the collected exercise ECG signal. In addition, we further compared and analyzed the quality of 12-lead ECG signals, and finally selected a V3-lead ECG signal, which had the best signal quality for subsequent HRV extraction and fatigue evaluation.
The RPE scale ranged from 6 to 20, and the subject was free to choose any integer value within this range. Prior to starting the run, we explained the RPE scale listed in
Table 3 to each subject. During the running experiments, the fatigue states of subjects were recorded with the RPE. At the end of each stage, the subjects were asked to report their RPE. Three classes were defined based on the RPE values: (i) “Rested” for values from 6 to 10; (ii) “A bit tired” for values from 11 to 16; and (iii) “Tired” for values from 17 to 20.
Each participant participated in at least 1 session and at most 3 sessions (with an interval of 1 week) of data acquisition experiments. Each session lasted 20 min to 60 min, and a total of 207 sessions were collected.
Figure 2 shows the fatigue state distribution of the dataset collected in the 207 sessions of experiments. It can be seen that the perception of tiredness had individual differences. Under the same exercise intensity and time, e.g., at the EX-3 stage, participants reported being “Rested” in 20 sessions, “A bit tired” in 128 sessions and “Tired” in 59 sessions. In addition, all 80 subjects finished the first 3 exercise stages (to EX-3 stage), while from the Ex-4 stage onwards, there were subjects who stopped the exercise because of exhaustion.
The study was approved by the Institutional Review Board of Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences. All subjects signed their written informed consent before the experiments.
2.2. Preprocessing and Feature Extraction
The sampling rate of ECG data was 200 Hz. In order to eliminate electronic noise and motion artifacts, the sampled signals were preprocessed using a Butterworth low-pass filter with a cutoff frequency of 50 Hz and a nine-level wavelet decomposition with the order 8 Daubechies wavelet. As the guideline [
31] recommended that the ECG records used for HRV analysis should last for at least 5 min, we segmented the raw ECG with a 5 min sliding window without overlap and extracted HRV features from each segment.
Over a specific time period, time domain features are just measurements of the mean and variability in the time interval between heartbeats, which is alternately referred to as normal-to-normal intervals (NN). The common time domain features include the mean of NN interval sequence (meanNN), the mean of heart rate sequence (meanHR), the standard deviation of NN interval sequence (SDNN) and the root mean square of successive differences in NN interval sequence (RMSSD). Another two features calculated by successive differences in NN interval sequence are the number of these differences greater than 50 ms (NN50) and the percentage of NN50 in total intervals (pNN50). By segmenting the long NN interval sequence into several nonoverlapping chunks with a chosen time window (1 min in this work), two types of HRV features, including the standard deviation of the averages of segmented chunks (SDANN) and the average of the standard deviations of segmented chunks (SDNNi), can be calculated. In addition to these statistical features, there are two geometric HRV features based on the NN interval sequence histogram with bins of 1/128 s. The HRV triangular index (HRVTi) is the ratio of the total number of all intervals to the height of the histogram. Additionally, the triangular interpolation of the histogram (TINN) is the baseline width of the minimum square difference triangular interpolation of the highest peak of the histogram.
For HRV frequency domain analysis, power spectrum density (PSD) was computed using the Lomb–Scargle method. Three main components were derived from the heart rate power spectrum, namely the very low frequency band (VLF) ranging between 0.0033 Hz and 0.04 Hz, the low = frequency band (LF) ranging between 0.04 Hz and 0.15 Hz and the high-frequency band (HF) ranging between 0.15 Hz and 0.4 Hz. The VLF component has been reported to be associated with arrhythmic death [
32] and high inflammation [
33]. The LF component appears to be sensitive to both sympathetic and parasympathetic activities, whereas the HF component is primarily mediated by the parasympathetic nervous activity [
34]. Therefore, the LF/HF ratio has been regarded as a measure of physical workload and stress [
35].
Aimed at the nonlinearity of heart rate signal, a number of nonlinear techniques have been applied to HRV analysis, which was thought to be an effective way to describe the changes in the biological signal. Three nonlinear methods were used for HRV analysis in this work, namely sample entropy (sampen), Poincare plot (SD1, SD2 and SD1/SD2) and detrended fluctuation analysis (α, α1 and α2).
Using MATLAB R2018a with the help of Physionet Cardiovascular Signal toolbox [
36], a total of 24 features, including 10 time domain features, 7 frequency domain features and 7 nonlinear features, were computed for further processing. All the obtained HRV features are listed in
Table 4.
2.3. Feature Selection
Feature selection is essential for training an effective model. There is no doubt that any unnecessary features, including unimportant features and redundant features, will increase the computational cost of model training and the risk of model overfitting, decrease the interpretability of the model and reduce the generalization performance of the model on the test set. Thus, dropping features with a weak correlation with physical fatigue (unimportant features) and removing highly redundant features are two steps needed during feature selection. The proposed feature selection method is described in Algorithm 1.
Algorithm 1 Feature selection. |
Input: Original features , input data |
Output: Selected features |
1: Initialize thresholds , |
2: Initialize number of repeats |
3: for do |
4: Compute actual Gini importance from according to Equation (1) |
5: end for |
6: for do |
7: Shuffle the labels , which is referred to as |
8: for do |
9: Compute new Gini importance of from according to Equation (1) |
10: end for |
11: end for |
12: for do |
13: Compute score of the feature according to Equation (4), which is referred to as |
14: end for |
15: Select the features with a score lower than , which is referred to as |
16: Delete the features from |
17: Compute the correlation matrix of features according to Equation (5) |
18: Select the features with less actual Gini importance in each pair of features with a correlation above which is referred to as |
19: Delete the features from |
20: Obtain the remaining features in , which is referred to as selected features |
2.3.1. Dropping Unimportant Features
Dropping low-correlated features requires the scores of all the features at first. The method based on feature importance of random forest (RF) and permutation importance [
37] was proposed to score the importance of features. RF provides a Gini importance for the assessment of feature importance. Suppose that we have an input dataset with
instances
where each
is a vector with
features and
is the corresponding label. First, RF uses the dataset to establish its model. Then, Gini importance of feature
is defined as the sum of the impurity improvement of all the nodes
in all trees
using the feature during the training phase, according to Equation (1):
The decrease in Gini impurity resulting from optimal split
is defined as
where
denotes Gini impurity at the node
;
and
denote the child nodes of
;
and
denote the ratio of the child nodes’ sample size to the total sample size; and
denotes the ratio of class
in node
.
Unlike the common permutation importance, the label rather than the feature was permuted in this method. After shuffling the labels for the first time, the new dataset can be expressed as where is a permutation of the actual labels. After training the RF model with the new dataset, we could compute the Gini importance of feature , which is referred to as . By repeating times on permutation of labels at random, the null importance distributions of various features were created to demonstrate how the model can make sense of a feature disregarding the original labels.
Randomly reordering labels could reduce the Gini importance of all features, because the input data no longer correspond to the real labels obtained in the real world. If the model relies heavily on a feature in its prediction, its importance is particularly affected. Thus, a metric to score a feature is calculating the percentage of the feature’s null importance distribution as less than the actual importance. The formula for calculating this score of the feature
is given by Equation (4).
where count() denotes counting the elements that meet the criteria,
is a set of null importance distributions of
with the number of repeats of
, and
is the actual Gini importance of
.
From another viewpoint, the score is a kind of quantitative index adapted from the original feature importance. Compared with the original Gini importance, the score is more effective at selecting features with high importance. The features with a score lower than the threshold will be dropped as “unimportant features”, which contribute little to the model.
2.3.2. Removing Redundant Features
The Pearson correlation coefficient was implemented to compute the correlation matrix and measure the redundancy between two features. Pearson correlation coefficient r provides an indicator to quantitatively evaluate the linear correlation between two variables [
38]. It has a value ranging from −1 to +1, where −1 indicates a perfect negative linear relationship, 0 indicates no linear relationship, and +1 indicates a perfect positive linear relationship. The closer the absolute value of
r to 1, the stronger the correlation. Given two variables,
X and
Y, the Pearson correlation coefficient r between
X and
Y is defined as Equation (5):
where
N is the number of the variable,
and
are the values of
and
for the i
th individual, and
and
are the averages of
and
.
In consequence, the larger the absolute value of the correlation coefficient between two features, the higher the mutual substitutability and the redundancy of the two features, and vice versa. As for each pair of features whose correlation is higher than a threshold , the less important one will be regarded as the “redundant feature” and removed.
2.4. Physical Fatigue Classification
The last step is using machine learning algorithms to accurately classify physical fatigue levels. The classification model adopted four supervised machine learning algorithms, namely decision tree (DT), support vector machine (SVM), K-nearest neighbor (KNN) and light gradient boosting machine (LightGBM). These classification models were trained with the best features obtained by the feature selection method described in
Section 2.3.
The performance metrics of the evaluation model were accuracy (ACC), precision, recall and F1 score (F1), and their definitions are listed in Equations (6)–(9)
where TP refers to the number of correctly classified samples in a certain class, FP refers to the number of samples misclassified as a certain class when they belong to other classes, TN refers to the number of correctly classified samples in other classes, and FN refers to the number of samples belonging to a certain class that was misclassified as other classes. The average of these metrics among classes was calculated to obtain a final evaluation of the model’s performance.
The 10-fold cross validation method was used to evaluate the performance of these models. In order to prevent a subject’s data from being used partly for training and partly for testing, each iteration was trained with the data of 72 subjects and tested with the data of the remaining 8 subjects. The average performance of the 10 iterations was used as the final result.
5. Conclusions
The application of HRV and machine learning algorithms in physical fatigue assessment was studied in this paper. First, 24 HRV features of four domains were computed from the original ECG. Then, a two-step feature selection method was proposed to select the best features. After feature selection, 13 original features were removed, and 11 optimal features were selected and used as the input of the model. These selected HRV features identified for physical fatigue detection are meanHR, NN50, SDANN, HRVTi, TINN, aVLF, sampen, SD2, SD1/SD2, α and α1. Four machine learning algorithms, DT, SVM, KNN and LightGBM, were used to build classifiers that automatically detect the fatigue state. LightGBM achieved the best performance and had an accuracy of 0.855 and an F1 score of 0.801. The results verify the feasibility of using HRV to evaluate physical fatigue states. By using the features selected by our feature selection method, the proposed model achieved superior performance in assessing the physical fatigue state. Furthermore, the selected features can be applied to wearable ECG devices for physical fatigue assessment during exercise in real time.
In future works, more subjects with diseases (e.g, diabetes, hypertension, heart disease) and subjects with different levels of physical fitness and different ages will be included to increase the reliability of physical fatigue assessment. Finally, other machine learning or deep learning models and features based on other physiological signals would be considered.