3.1. Region of Interest Extractor
The construction of the algorithm aimed at clipping the regions of interest from the videos so that they would resemble the data from the cameras of an HMD. In addition, the technique of threshold was used as a stabilizer for the videos processed by the algorithm. The use of threshold also enabled less occurrence of unwanted motion, providing more stable images to be processed, without adding other problems in the cropped area.
Upon starting the algorithm, a stream will be run through each frame of the original video, with the previous frame removed from memory to avoid an overload of the system. After the frame is loaded, the facial landmarks are detected with either FaceMesh or the Support Vector Machine (SVM), shown in
Figure 2.
The ROIs of the face of the individual in the video are represented in the
Figure 2, the blue square refers to the ROI of the right eye, the red to the left eye and the green to the mouth region. The yellow rectangle selects the region in the middle of the face that starts from the bottom of the eyes to the top of the mouth; this region was not used in the experiments of the present work and only serves to represent the ROI that the EVM authors use in [
26]. During the detection phase of the ROIs, in case of error, the algorithm keeps the last detected coordinates, following the normal flow of execution, except when the error occurs consecutively, causing the interruption of the video processing.
The detector receives the frame and predicts the facial points, where each point is composed of two values indicating its coordinate in the image. The center point of each ROI is calculated from the average of the coordinates of the points in the region. With
C being the list of points in an ROI,
and
being its coordinates and
N being its total size, we can represent the center point
of an ROI as in Equation (
1).
The Euclidean distance calculation is performed with
in Equation (
2) and
in Equation (
3), corresponding to the center points of the same ROI.
The greater the calculated distance, the greater the evidence of face motion, which should cause the ROI to move in the same direction. The smaller the calculated distance, the greater the evidence of stability in the movement of the face. However, the distance is not zeroed and does not remain constant, due to natural movement of the face, camera shake, or variation in the prediction of facial landmarks. Therefore, even if the face remains stationary, the cropped ROIs will suffer shaking if stabilization is not applied.
After determining the center point
of the ROI, the algorithm uses half the length of the ROI to create a square of center, and
is used to crop the frame in
Figure 3. An approximation distortion in the mouth region is applied after its cropping, considering a radius between the mouth and the nose, obtaining a center, so that the result looks like the face is closer to the camera.
Figure 3 shows the ROI of the mouth without the distortion and the same region with the distortion. When cropped, the ROIs are resized to the final size of the eyes and mouth, to 100 × 100 and 400 × 400, respectively.
The position that the frame occupies in the video is registered in the name of the saved file of type Portable Gray Image when the image is captured with an infrared sensor, and as Portable Network Graphics when the image is captured with a visible light sensor.
To define when the coordinates of the ROI should be updated, the calculated Euclidean distance d can be submitted to a threshold to determine the behavior of the algorithm when the distance exceeds the threshold or not. If the distance is less than the threshold, the last points of the ROI are kept, ignoring the new detection that would cause a jitter. If the distance is greater than the threshold, the new detected points will be used to move the ROI according to the movement of the face. The behavior of the algorithm should also consider the distance from the camera, since, for example, the movement of a face five meters away results in a smaller Euclidean distance than the movement of the face of someone who is half a meter away from the camera. So, a fixed threshold would not have the same quality and, to consider the distance of the subject from the camera, the threshold is calculated at 5% of the ROI length.
There are two actions to take when changing face landmarks when the threshold is exceeded. The first uses an average between the old and the new detection points, resulting in a smooth transition. The second action replaces the new detection with the previous one without the average, resulting in a less smooth transition.
The video magnification [
31] was applied in the extractor of the ROI, an optional parameter, in order to highlight subtle movements in the videos that will be used after extraction. Amplification of periodic color variations was not used in the videos of our experiments. The motion amplification used is based on the present frames, without having the influence of future frames. This technique follows the Eulerian approach, which consists of three states: the decomposition of frames into an alternative representation, the manipulation of it and the manipulated reconstruction for the magnified frames. The decomposition of the frames results in edge-free videos with better amplification characteristics.
3.2. Adaptation of Techniques from Related Work to Work with the Regions of Interest
The implementation of the adaptation of EVM-CNN was divided into two steps: extraction of the features and processing through the adapted MobileNet network. The technique of feature images proposed in [
26] was reproduced based on the pseudocode and descriptions provided.
The adaptation of the MobileNet network was implemented based on the summary available in the EVM article [
26]. In the absence of information on optimizer choice, the Adam algorithm was chosen. The only divergence from the proposed technique was the threshold technique over the Scalable Kernel Correlation Filter Tracking technique [
32].
After implementing the EVM model, the first improvement was the change in the original prediction target, previously performed directly with BPM prediction to PPG prediction; this change was accomplished by the possibility of its conversion to BPM using HeartPy library, which handles heart rate analysis tasks that have noisy data. In addition, the videos used are rescaled to 30 FPS. Other than that, the adaptations made to the EVM enabled the prediction of the remote heart rate by clipping new ROIs, being: the right eye, the left eye and the lower part of the face.
The Meta-rPPG has a source code available on the GitHub platform. However, despite being promising, its training and testing were associated with a single dataset, making it exhaustive to make any changes. To make it work in different datasets, some modifications were made, the first in the data input and the second in the code scalability and separation of pre-training and training data.
The input is linked to the output of the regions of interest extractor, which provides three frames. One of them being from the lower face region and two from the eye regions. With this, upon receiving these frames, the code concatenates the frames of the eyes horizontally and resizes the frames of the lower face region with the frames of the eyes vertically, so that they are positioned according to the human face. Once they are all in one frame, a final resize is performed to the size of 64 × 64 pixels.
Another point of adaptation of the Meta-rPPG was the adjustment of the input values of the dataset data, previously static for a 60 s duration, then passed to dynamic, since the static values of the input imply data loss in training, since videos shorter than 60 s were discarded and videos longer than 60 s were reduced. In addition, the videos used are rescaled to 30 FPS.
For the training and testing steps, the dataset was split into 88% and 12%, respectively. The data separation is performed for the pre-training, where 22% of the training dataset is forwarded to the pre-training to be divided into 55% for query and 45% for support, because in the pre-training stage, the few-shot methodology is used for initial learning.
To demonstrate this diversity, tests were performed on two datasets, with the facial landmark detection technique FaceMesh. Three additional versions of the dataset UBFC-rPPG were also generated in order to obtain better results by applying video magnification, using the SVM facial landmark detection technique and changing the RGB channels to grayscale.
All assessment metrics are based on the error (E), measuring the difference between the predicted (P) and ground truth (GT) values, respectively: Mean Error (ME), Standard Deviation (SD), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and Pearson’s correlation ().
Error is measured as the difference between predicted values and ground truth values, as expressed in Equation (
4).
The mean error is the average of all errors in a set. Therefore, a better answer for this metric is if the result is closer to zero, the equation can be seen in Equation (
5).
The standard deviation is the average of how far each value is from the mean, that is, the average amount of variability in the dataset used. Therefore, an SD close to zero indicates that any given data are closer to the mean. So, the equation is expressed in Equation (
6).
RMSE can be considered a standard deviation of predictions errors. So this metric is the average distance between the model’s predicted values and the actual values in the dataset, so if it is closer to zero, it is a better result. Thus, the equation is expressed in Equation (
7).
MAPE measures the average of the predicted quantities and the actual quantities of each entry in a dataset. Thus, the closer to zero, the better the results, and the equation can be seen in Equation (
8).
Pearson’s correlation ranges from −1 to 1, so a higher value indicates a relationship between the variables. The equation can be seen below, in Equation (
9).