1. Introduction
Blood pressure is considered one of the most significant vital signs, and it impacts the health state directly since high blood pressure is one of the most serious risk factors for cardiovascular disease (CVD), which is the leading cause of death in many cases [
1]. Based on this fact, blood pressure monitoring became an essential procedure that must be performed periodically. On the other hand, the most common methods for accurately measuring blood pressure are cuff-based, expensive, and cumbersome and they should not be used frequently without a doctor’s supervision [
2]; therefore, noncontact blood pressure sensing has received a lot of recognition recently because it enables simple, convenient, and nonrestrictive estimation of blood pressure. One of the techniques to achieve noninvasive measuring of vital signs is a photoplethysmogram (PPG), which measures volumetric variations in blood circulation using a light source and a photodetector to measure color variations at the skin’s surface [
3,
4]. Derived from the fact that the changes in the color intensities of the channels carry information about variation in the blood that flows beneath the skin, we propose our novel approach to estimate blood pressure by using only a video of the subject’s face. Our approach is based on cropping regions of interest (ROIs), which are the portions of the image (frame) that we want to operate, and feeding them directly into a convolutional neural network (CNN) [
5], which is a type of artificial neural network that can automatically learn and extract the features of each frame channel followed by long short-term memory (LSTM) [
6], which is a type of recurrent neural network (RNN) that is able to learn long-term dependencies, especially in sequence prediction problems. We used LSTM to learn how the changes in the intensities throughout the recording duration lead us to estimate systolic and diastolic blood pressure (SBP and DBP, respectively). We intended to make our approach simple, convenient, and straightforward without any need for equipment other than the camera. It can be used for scenarios such as vehicle driver monitoring or operator monitoring where it is not convenient to attach special devices (e.g., electrocardiography (ECG) monitor) to a person. Due to this fact, we emphasized designing a cuff-less, contactless, and comfortable approach that depends only on image analysis. In other words, our method is inexpensive, manageable, and distinguished from other previous methods since it does not need to extract any signal from the face or any other regions and it does not use any additional information from electrocardiogram (ECG) and arterial blood pressure (ABP) signals, which contain information of cardiac status; hence, our approach is less complex and time consuming than other techniques.
The rest of the paper is organized as follows.
Section 2 contains an overview of existing approaches and methods that are proposed to achieve noninvasive estimation of blood pressure.
Section 3 consists of two main subsections; the first one includes information about the datasets we used, whereas the second one contains all the details about our experiments on building, training, and testing our models.
Section 4 shows the results of testing our models and comparing them with other published models.
Section 5 discusses our approach and the results with the limitations of our technique. The paper finishes with
Section 6, which includes the conclusion and future scope of the proposed approach.
2. Related Work
Recent years have seen a substantial increase in interest in blood pressure estimation using extracted remote PPG (rPPG), which is an advancement of PPG technology that does not require physical touch and instead relies on ambient light reflected from the skin and collected remotely by a complementary metal–oxide–semiconductor (CMOS) camera followed by supervised machine learning algorithms [
7]; therefore, this section includes outlines of the recently proposed methods and algorithms to accomplish cuff-less prediction of blood pressure using videos of the subject.
Luo et al. [
8] used transdermal optical imaging (TOI) to process imperceptible facial blood flow changes from 17 different ROIs in the face followed by an advanced machine learning algorithm. They managed to score a mean error + SD of 0.39 ± 7.30 mmHg for SBP and −0.20 ± 6.00 mmHg for DBP, while the mean accuracy was 94.81% and 95.71% for SBP and DBP, respectively. Jain et al. [
2] assumed that any intensity variation observed in the red channel should be due to variation in the blood that flows beneath the face skin under constant lighting conditions and camera settings throughout the recording; hence, they used principal component analysis (PCA) to extract these variations from the video. They used the detected peaks of the preprocessed signal to extract the time and frequency domain parameters, based on which SBP and DBP were estimated using a linear regression model. Their model achieved mean absolute errors (MAEs) of 3.9 mmHg and 3.7 mmHg for SBP and DBP, respectively.
Other approaches such as that of Secerbegovic et al. [
3] used the pulse transit time to estimate blood pressure after applying independent component analysis (ICA) [
9] on the raw source signals extracted from the forehead; their linear regression model achieved a total mean average error (MAE) of 9.48 mmHg for SBP and 4.48 mmHg for mean arterial pressure. ICA and linear regression were also used in [
10], where Oiwa et al. tried to increase the accuracy of the estimated blood pressure by using ICA for processing the RGB signals of five obtained ROIs and by using facial PPG amplitude and nasal skin temperature as inputs of linear regression models, which predicted blood pressure with an MAE within the range of 1.5–4.5 mmHg and 1.72–4.75 mmHg, respectively.
On the other hand, using a convolutional neural network may look reasonable since we are dealing with videos and images, and that is what Iuchi et al. [
11] proposed to use the spatial description of the subject’s face as an input of their model by extracting time–space information of pulse waves on the face. They managed to predict SBP and DBP with MAEs of 6.7 mmHg and 5.4 mmHg, respectively. Another interesting approach was mentioned in [
12], where Wu et al. proposed a blood pressure estimator based on two-channel rPPG signals of the upper and the lower face obtained by chrominance-based (CHROM) [
13] rPPG extraction besides a generative adversarial network [
14] in order to get over the obstacle of the lack of data. In addition, they came up with the idea of having multiple models, each one corresponding to small nonintersecting intervals, and used a combination of age and body mass index (BMI) of the subject to choose the appropriate model to estimate the blood pressure. Schrumpf et al. [
15] adopted different neural network architectures such as AlexNet [
16], Resnet [
17], and the architecture published by Slapnicar et al. to predict blood pressure (BP) values [
18]. The training stage had multiple phases in order to find a suitable strategy to crop the signal into windows followed by transfer learning to train the models on rPPG signals extracted by the plane-orthogonal-to-skin (POS) algorithm [
19]. As a result, the ResNet model achieved the lowest SBP MAE of 13.02 mmHg, while AlexNet had the lowest DBP MAE of 8.27 mmHg.
Some researchers studied the correlation between blood pressure and image-based pulse transit time (iPTT), which is calculated as a time lag between two rPPGs obtained from simultaneous recordings of two body locations. These experiments were for the purpose of extracting features that may work efficiently in the task of predicting blood pressure. Jeong et al. [
20] observed that SBP has a strong correlation with the extracted PTT from the green color intensities of the regions of interest (face and palm) through videos obtained by a high-speed camera (420 fps). After some years, the same authors reimplemented their methodology in [
20] but with an infrared light source along with a high-speed camera [
21]. In addition, they summated the red component pixels of the regions of interest, which were detrended, filtered, and differentiated to find the correct maximum derivative points of the photoplethysmogram to obtain relevant information for BP estimation.
Other papers used other equipment to obtain the PPG signal to estimate blood pressure, as in [
22], where Gaurav et al. used a PPG sensor of Samsung Galaxy Note 5. The signals were preprocessed over many stages to extract 46 features in total that were fed into three weighted artificial neural network (ANN) regression models to determine DBP, which was used as an input of three other weighted regression models besides the aforementioned 46 features in order to predict SBP. Their models achieved an MAE of 4.47 mmHg for SBP and 3.21 mmHg for DBP.
Finally, some authors considered this task as a classification problem instead of regression, and an example of this approach is what Visvanathan et al. [
23] worked on. They used demographic features such as height, weight, and age besides 14 extracted features from the PPG signal to improve the accuracy of the predictions with both linear regression and SVM algorithms, which classified the output into five categories (from very low BP to very high BP) and achieved accuracies of 100% and 99.29% for SBP and DBP, respectively.
Since it is obvious that most papers are focused on the derivation of pulse waves taken from one or more regions of the subject’s face, we developed in this paper a novel method for blood pressure estimation with no need to extract any signal from the face or any other region; we propose a less computationally costly and time-consuming approach that predicts blood pressure by feeding images of regions of interest cropped from each frame of the video into our models, which have special architecture (CNN + LSTM + fully connected layers). In addition, we mention that our approach does not need any auxiliary signals such as ECG or developed equipment except a smartphone or any device that has a camera.
Since none of the papers we included above worked with raw frames without the preprocessing or extraction of signals, we had to investigate and find suitable models to compare our models with to prove the validity of our work. After searching and exploring, we accessed the models mentioned in [
15], and we managed to apply the adopted signal extraction method on the datasets used in this paper followed by testing their models to compare their performance with the results we achieved by our models on the same test set.
4. Results
While training the models there were many options to try in the feature extraction part. Based on what we obtained previously in
Section 3.2.2, our work was focused on finding the best combination of different versions of EffecientNet and Resnet50V2 to include in the architecture of our models. In
Table 3, we listed six (three SBP models and three DBP models) of our best models based on the MAE and mean accuracy. The mean accuracy was calculated by subtracting the predicted value from the actual value and taking the absolute value of the output and dividing it over the actual value to obtain the percentage of error. By applying this equation over all the test samples and taking the average and subtracting the output from 1, we obtain the mean accuracy. This metric is used to give a general idea of the accuracy of our predictions and decide whether the subject’s health condition is stable or may be at risk.
The architectures of these models are described in
Figure 6 and
Figure 7 for SBP models and DBP models, respectively.
Since we had a decent number of fine estimators and we were inspired by the ensemble learning concept [
34], we decided to create some combinations of our models in order to have a better “learner” that can estimate blood pressure (SBP or DBP) with a lower MAE. By looking at
Table 3, it is clear that the performances of the models are so close to each other based on the standards we went with; hence, our combinations were based on averaging the estimations of two models from the same type to satisfy the simplicity and, due to the ambiguity of determining the best model among the others, to have higher weight. Nevertheless, we list the best four combinations according to the lowest MAE and highest mean accuracy in
Table 4.
To compare with other models, “V4V test set 1” has been changed due to the terms of compatibility with other models. Therefore, we gathered an objective set “V4V test set 2” that includes common and unusual values of SBP and DBP and satisfies the SNR condition for the models in [
15]. The result came as it is shown in
Table 5. The results indicate that three of our models performed better than all others regarding DBP, whereas one SBP model had a lower MAE and higher mean accuracy than the other models.
Regarding the correlation with respiratory rate, we tested our 10 models on the 60 videos of the Operator dataset (5 SBP models and 5 DBP models) and the other 4 models provided from [
15] (these models predict SBP and DBP together). Therefore, as a result, we have 9 models that predict SBP and 9 models that predict DBP and, by averaging the output Pearson’s correlation coefficients over the 60 videos, we obtained the results shown in
Table 6.
All five out of five SBP models have shown higher correlation than the other models, and four out of five DBP models had stronger correlation than all other models.
Table 6 also shows that one SBP model and two DBP models had a high correlation with respiratory rate [
32] (coefficient > 0.5 and
p-value less than 0.05), which is strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct. Therefore, we reject the null hypothesis and accept the alternative hypothesis that there is a relationship between the respiratory rate and the predicted blood pressure values obtained by our models, indicating a good response provided by our models, which were able to confirm the connection between blood pressure and respiratory rate more than the models we compared them with based on Pearson’s correlation coefficient.
By considering the results listed in the tables and the fact that we used “V4V test set 2” which contains new subjects from the V4V dataset, we would say that SBP Model_3 and DBP Model_1 are the best models for the following reasons: First, they almost performed the same on “V4V test set 1” and “V4V test set 2” with stable performance. Second, they both achieved Pearson’s correlation coefficients that indicated moderate to high correlation and were higher than the correlation coefficients that were obtained by the models we compared our performance with, and, finally, they are lighter to use if we decide to modify them to work through an application or website than using combinations that contain more than one trained model. There was a tradeoff decision to choose SBP Model_3 to be the best because we considered the stability in performance and gave it higher priority over the correlation, but on the other hand, the correlation coefficient of SBP Model_3 indicates a moderate correlation, which is not bad compared to the other models (e.g., SBP Model_2).
5. Discussion
This paper proposes a novel approach to estimate the blood pressure of subjects without the need for any equipment but the camera of any smartphone. Our methodology includes the extraction of ROIs from each frame in each video and passing these sequential images into a convolution neural network to obtain the spatial features. The outputs are then passed into LSTM to extract the temporal features within the image sequence. Our results indicate that changes in blood pressure can be detected from the right cheek and left cheek, which is reasonable due to the fact that these regions contain the main and largest blood vessels in the face [
35]. By using these regions as inputs of our models, we were able to predict SBP and DBP with a decent MAE and acceptable mean accuracy considering the fact that no signal is extracted and no equipment but the camera is needed. Our models achieved this by determining the best pretrained models to use and the most useful regions of interest to extract, which we did before the training process began, including the procedure of upsampling the data, which let our models learn to estimate unusual values of SBP and DBP. We take into account that the SBP value is not related to the DBP value; hence, we trained our models separately, which means that we used the true SBP to train a part of the models, whereas the other part of the models was trained using the true DBP; therefore, there are no mutual weights or layers between the models that may affect the results and each model generates one value, which is either SBP or DBP based on the model.
The main challenge was the dataset; there is not enough diversity in the subjects’ skin color, since most of the subjects have light to medium color of skin, so our models may not be able to predict the blood pressure of subjects who have dark skin tones accurately. Another limitation was that the dataset is unbalanced. In other words, there is a lack of unusual values of SBP and DBP. We overcame this issue partially by upsampling the dataset, but even with this technique, our models have been trained on a limited range of features that indicate uncommon SBP or DBP, which may affect their estimations if they are being tested on subjects with unusual values of SBP or DBP where the filming or lighting conditions are different from the V4V dataset.
We also had a problem comparing our results with other models. This came from three factors: The first one is that no previous work has handled raw images as input like we did; instead, the main focus has been on extracting signals out of a cascade of frames, which made the extraction of the signal out of the videos a mandatory stage to perform the comparison. The second factor was that the datasets used in the papers were different with dissimilar ranges of the ground truth, since some papers created their models based on very limited ranges of SBP and DBP, while others had wider ranges based on which their models were built up. The final factor was the availability of the other models; most of the algorithms or models mentioned in the related work section are not available or published; hence, this was one major obstacle to achieving an objective comparison. Nevertheless, we were able to compare our results with models that have been trained over similar ranges of SBP and DBP to the ground truth of the V4V dataset. In general, some of our models had better performance than these models in estimating blood pressure, and their predictions achieved higher correlation with respiratory rate. This may be related to the fact that the CNN obtained the spatial features from all three channels, which have more useful information to estimate blood pressure than the signal extracted using the POS algorithm due to spatial averaging and other mathematical operations that may lead to loss of information, and in addition, the chosen threshold of SNR may have been not the best option for our dataset and it led to including many noisy signals from all our datasets. Regarding the inner comparison among our models, the MAE of the SBP models and the DBP models are very close, but we may attribute the small difference to the distributions of the training samples; as it is shown in
Figure 2, the distribution of samples used to train the models to estimate DBP has a normal distribution even with upsampling the uncommon values; hence, we can say that SPB models are trained on a more generalized dataset than that on which DBP models were trained.
Concerning the idea of using Pearson’s correlation coefficient, we would like to mention that the metric is not new itself but using it to compare blood pressure values predicted by an algorithm or a model with another vital sign (respiratory rate in our case) has not executed before to the best to our knowledge.
Finally, we proposed in this paper a method that would give the subject an approximate value of the subject’s blood pressure with no need to attach any devices to his/her skin. This is not a method for medical use for diagnosis (e.g., hypertension or not). The main idea is to alarm the subject or responsible person (e.g., car park dispatcher) about fluctuations in his/her blood pressure that might indicate fatigue, stress, etc. However, our approach can be easily modified into an application, which will be extremely useful for blood pressure monitoring at home, school, office, or anywhere.
6. Conclusions
In this paper, we presented an innovative, inexpensive, and time-efficient method to estimate blood pressure using only a smartphone camera. Hybrid deep learning models including a convolutional neural network and LSTM were trained using cropped images of the right cheek and left cheek of the subjects of the V4V dataset. The results showed that our models are able to estimate blood pressure with a lower MAE and better performance than other published models. We also proposed a new criterion to evaluate our results by engaging the relation between blood pressure and respiratory rate in our testing process using our own dataset that provides the respiratory rate at every minute. We tested the correlation between the estimations of our models and the published ones with the respiratory rate. As a result, our models achieved a higher and stronger correlation than the models we compared our work with; hence, the task of estimating blood pressure has been completed according to the basic standards (i.e., MAE and mean accuracy) and our modern metric (i.e., Pearson correlation coefficient). In the future, we will try to extend the training set with subjects who are older or have darker skin tones with unfamiliar SBP and DBP values due to some medical conditions that the V4V dataset did not cover. In addition, our approach may be extended to estimate other vital signs such as heart rate, oxygen saturation, body temperature, etc.