Next Article in Journal
An Enhanced Feature Pyramid Object Detection Network for Autonomous Driving
Next Article in Special Issue
Improving Accuracy of Contactless Respiratory Rate Estimation by Enhancing Thermal Sequences with Deep Neural Networks
Previous Article in Journal
Towards Efficiently Provisioning 5G Core Network Slice Based on Resource and Topology Attributes
Previous Article in Special Issue
Video-Based Contactless Heart-Rate Detection and Counting via Joint Blind Source Separation with Adaptive Noise Canceller
 
 
Article
Peer-Review Record

3D Convolutional Neural Networks for Remote Pulse Rate Measurement and Mapping from Facial Video

Appl. Sci. 2019, 9(20), 4364; https://doi.org/10.3390/app9204364
by Frédéric Bousefsaf *, Alain Pruski and Choubeila Maaoui
Reviewer 1:
Reviewer 2: Anonymous
Appl. Sci. 2019, 9(20), 4364; https://doi.org/10.3390/app9204364
Submission received: 31 August 2019 / Revised: 10 October 2019 / Accepted: 11 October 2019 / Published: 16 October 2019
(This article belongs to the Special Issue Contactless Vital Signs Monitoring)

Round 1

Reviewer 1 Report

In the manuscript, a 3D CNN is proposed to estimate pulse rates from facial video streams.   The authors compared their method with previous four ones, and a better performance was showed in the proposed method.     

 

MAJOR REVISION

1.      In the abstract, numerical data (i.e., performance indexes, or accuracy) regarding the proposed method should be presented to show its advantage over other methods.  

2.      Correct sentence tenses should be used in both “3. Materials and Methods” and “4. Results and discussion.”  In general, in Methods and in Results parts, a past tense should be employed, except those sentences that are used to directly describe the figures or tables or equations.

3.      How to make sure that the outcome with synthetic data is similar to that with real data.    

 

MINOR REVISION (Here, only list some of them)

1.      LINE 7

“the network ensure…” should be “the network ensures….”

   

2.      LINE 76

“and 2008 respectively”“ should be ” and 2008, respectively.”

 

3.      LINE 81

Is it ‘four’ or ‘five’ basic procedures?

 

4.      LINE 172

“If not choose properly, the …” should be “If not chosen properly, the…”

 

5.      LINE 195

“We develop, ….” should be “We developed, ….”

6.      LINE 248

“The slope (curve parameters) were….” should be “The slope (curve parameters) was….”

7.      LINE 255

“This value gradually rise and fall …” should be “This value gradually rises and

Author Response

In the abstract, numerical data (i.e., performance indexes, or accuracy) regarding the proposed method should be presented to show its advantage over other methods.

Additional information and RMSE values were reported in the abstract.

 

Correct sentence tenses should be used in both “3. Materials and Methods” and “4. Results and discussion.” In general, in Methods and in Results parts, a past tense should be employed, except those sentences that are used to directly describe the figures or tables or equations.

The manuscript has been revised accordingly to the comment.

 

How to make sure that the outcome with synthetic data is similar to that with real data.

The precise definition of imaging PPG phenomenon is still a matter of debate (several authors like Kamshilin et al. and de Haan et al. published several papers on this topic). In addition, we today cannot assess and model all the physiological and physical parameters (e.g. diffuse and specular light distribution over face regions) that occur during iPPG.

At first sight and to the best of our knowledge, there is no direct way to check if the synthetic generator produces video patches similar to real ones. We, nevertheless, control the synthetic output by re-computing the PPG signal from some synthetic video patches and check if they resemble to the original PPG signal (used to produce the video). In this direction, we developed a curve fitting procedure to ensure that the generator delivers PPG video in which the corresponding PPG signal (computed over the whole video patch using a spatial averaging operation) resembles to real signals (measured on the face). An illustration is presented in the paper (figure 5).

We also tried to generate data based on simple sinusoidal signals instead of fitted-to-data signals. The results are significant but we chose to not present them in the paper in order to avoid overburden. Niu et al. [61] developed a similar approach: their generator produces images simulated using simple sine functions.

The relevance of the results presented in section 4 can also indicate that the synthetic generator produces data that are close enough to real one. The model should not be able to predict pulse rate values with such accuracy otherwise.

 

 

MINOR COMMENTS

LINE 7

“the network ensure…” should be “the network ensures….”

 

LINE 76

“and 2008 respectively”“ should be ” and 2008, respectively.”

 

LINE 81

Is it ‘four’ or ‘five’ basic procedures?

 

LINE 172

“If not choose properly, the …” should be “If not chosen properly, the…”

 

LINE 195

“We develop, ….” should be “We developed, ….”

We have kept the present form: in this context “develop” is equivalent to “present” (e.g. We present in this section…). It refers to the presentation of the method and not to the fact that we have developed the method. Note that we are not English native speakers though: we can of course modify the verb if the reviewer judges that past tense is mandatory here.

 

LINE 248

“The slope (curve parameters) were….” should be “The slope (curve parameters) was….”

 

LINE 255

“This value gradually rise and fall …” should be “This value gradually rises and

 

We integrated the corrections (6 over 7) proposed above. Thanks for these feedbacks.

Reviewer 2 Report

This is an interesting paper that applies the deep learning model to measure pulse rate from facial video.

I have a question about the CNN the authors used. The authors divided the [55, 240] bpm into 75 intervals each with step=2.5bpm, and then used the 75 pulse-rate intervals plus an extra "No PPG" class as the 76 classes for prediction. I was wondering why not to use the continuous pulse rate value as the predicted output. This can be implemented by replacing the cross-entropy loss of the CNN model with the MSE loss, or more fancilyusingXGBoost to run a regression based on the features extracted from CNN. Do you have tried it? What is the advantage of your treating the pulse rate as a categorical variable?

Author Response

I have a question about the CNN the authors used. The authors divided the [55, 240] bpm into 75 intervals each with step=2.5bpm, and then used the 75 pulse-rate intervals plus an extra "No PPG" class as the 76 classes for prediction. I was wondering why not to use the continuous pulse rate value as the predicted output. This can be implemented by replacing the cross-entropy loss of the CNN model with the MSE loss, or more fancilyusingXGBoost to run a regression based on the features extracted from CNN. Do you have tried it? What is the advantage of your treating the pulse rate as a categorical variable?

 

We tried to predict numerical pulse rate values using an almost identical network architecture: we changed the last layer (one neuron), set Linear or ReLU as activation function and set MSE as loss function. We created synthetic videos with random pulse rate values (range: 55 - 240 bpm) and use these numerical (integer) values as output. Training the model is nevertheless not straightforward because we observe no convergence (see figure below). An in-depth study is necessary to fully understand the cause of this non-convergence. We did not try XGBoost for the classification part of the network. It can indeed be an effective trail for future improvements by proposing a hybrid model (3D CNN + XGBoost). We have integrated this point into the discussion (see section 4.3 + ref. 84).

 

Left: training using a linear activation function. Right: training using a ReLU activation function. We can observe that learning does not converge because the loss and the accuracy are completely stuck.

 

A comparison with results provided by a categorical representation with a step of 1 bpm could be of interest in the case we successfully train a NN model with continuous pulse rate values. Here, only the output dimension differs: one neuron in the case of MSE versus 185+1 neurons in the case of categorical cross-entropy.

Allowing the neural network to understand noise (i.e. video patch without PPG fluctuation) is an important point of the method, which is not straightforward in the case of a continuous approach. We should research more deeply in this direction. Please note that we plan to release the codes on github so that other researchers of the community can test and / or improve our developments (https://github.com/frederic-bousefsaf).

Round 2

Reviewer 1 Report

MINOR REVISION

The Abstract part can be modified as follows:

Remote pulse rate measurement from facial video has gained particular attention over the last few years. Researches exhibit significant advancements and demonstrate that common video cameras correspond to reliable devices that can be employed to measure a large set of biomedical parameters without any contact with the subject. A new framework for measuring and mapping pulse rate from video was presented in this pilot study. The method, which relied on convolutional 3D networks, was fully automatic and did not require any special image pre-processing. In addition, the network ensured concurrent mapping by producing a prediction for each local group of pixels. A particular training procedure that employed only synthetic data was proposed. Preliminary results demonstrated that this convolutional 3D networks could effectively extract pulse rate from video without the need of any processing on frames. The trained model was compared with other state-of-the-art methods on public data. Results exhibited significant agreement between estimated and ground-truth measurements: the root mean square error computed from pulse rate values assessed with the convolutional 3D network was equal to 8.64 bpm while being superior to 10 bpm for the other state-of-the-art methods. Robustness of the method towards natural motion and increase in performances correspond to the two main avenues that will be considered in future works.

 

In Figures 8 and 11, the subjects’ faces are clearly showed and each subject can be easily identified. So, it is maybe necessary to mask or blur some parts of the faces to make them unidentified.

 

LINE 247

“using an uniform” should be “using a uniform….”

LINE 250

“Linear, quadratic or cubic tendency were added …” should be “Linear, quadratic or cubic tendency was added .…”

 

LINE 318

“Map of predictions were formed”“ should be ” Maps of predictions were formed...”

 

LINE 414

“Only video patch of size 25×25×60 were analyzed by the neural network.” should be “Only video patches of size 25×25×60 were analyzed by the neural network.”

Comments for author File: Comments.doc

Author Response

The Abstract part can be modified as follows:

Remote pulse rate measurement from facial video has gained particular attention over the last few years. Researches exhibit significant advancements and demonstrate that common video cameras correspond to reliable devices that can be employed to measure a large set of biomedical parameters without any contact with the subject. A new framework for measuring and mapping pulse rate from video was presented in this pilot study. The method, which relied on convolutional 3D networks, was fully automatic and did not require any special image pre-processing. In addition, the network ensured concurrent mapping by producing a prediction for each local group of pixels. A particular training procedure that employed only synthetic data was proposed. Preliminary results demonstrated that this convolutional 3D networks could effectively extract pulse rate from video without the need of any processing on frames. The trained model was compared with other state-of-the-art methods on public data. Results exhibited significant agreement between estimated and ground-truth measurements: the root mean square error computed from pulse rate values assessed with the convolutional 3D network was equal to 8.64 bpm while being superior to 10 bpm for the other state-of-the-art methods. Robustness of the method towards natural motion and increase in performances correspond to the two main avenues that will be considered in future works.

After comparing with abstracts recently published in the journal, we chose to keep the present tense in the entire abstract.

 

In Figures 8 and 11, the subjects’ faces are clearly showed and each subject can be easily identified. So, it is maybe necessary to mask or blur some parts of the faces to make them unidentified.

We have contacted Y. Benezeth who manage the dataset and he confirmed that the subjects faces can be used in scientific articles (the participants gave their approval).

 

LINE 247: “using an uniform” should be “using a uniform….”

We revised the manuscript in accordance with the comment. In the same way, we changed “an unique” to “a unique”.

 

LINE 250: “Linear, quadratic or cubic tendency were added …” should be “Linear, quadratic or cubic tendency was added .…”

We revised the manuscript in accordance with the comment.

 

LINE 318: “Map of predictions were formed” should be “Maps of predictions were formed...”

We revised the manuscript in accordance with the comment.

 

LINE 414: “Only video patch of size 25×25×60 were analyzed by the neural network.” should be “Only video patches of size 25×25×60 were analyzed by the neural network.”

We revised the manuscript in accordance with the comment.

 

Once more, many thanks for these feedbacks.

Back to TopTop