**4. Experiments**

In the following section, a description of the experiments performed for the evaluation of the proposed approach is provided. First, the datasets used for the evaluation are briefly described, followed by a depiction of the conducted data preprocessing steps. The experimental settings as well as the performed experiments are described subsequently. This section is finally concluded with a description and discussion of the experimental results.

#### *4.1. Datasets Description*

The presented approach is evaluated on both the *BioVid Heat Pain Database (Part A)* (BVDB) [4] and the *SenseEmotion Database* (SEDB) [6]. Both datasets were recorded with the principal goal of fostering research in the domain of pain recognition. In both cases, several healthy participants were submitted to a series of individually calibrated heat-induced painful stimuli, using the exact same procedure. Whereas the BVDB consists of 87 individuals submitted to four individually calibrated and gradually increasing levels of heat-induced painful stimuli (*T*1, *T*2, *T*3 and *T*4), the SEDB consists of 40 individuals submitted to three individually calibrated and gradually increasing levels of heat-induced stimuli (*T*1, *T*2 and *T*3). Each single level of heat-induced pain stimulation was randomly elicited a total of 20 times for the BVDB and 30 times for the SEDB. Each elicitation lasted 4 s, followed by a recovery phase of a random length of 8 to 12 s during which a baseline temperature *T*0 (32◦*C*) was applied (see Figure 5). Whereas the elicitations were performed uniquely on one specific hand for the BVDB, the experiments were conducted twice for the SEDB, with the elicitation performed each time on one specific arm (left forearm and right forearm). Therefore, with the inclusion of the baseline temperature *T*0, the dataset specific to the BVDB consists of a total of 87 × 20 × 5 = 8700 samples, whereas the SEDB consists of a total of 40 × 30 × 4 × 2 = 9600 samples. During the experiments, the demeanour of each participant was recorded using several modalities consisting of video and bio-physiological channels

for the BVDB, while the SEDB included audio, video and bio-physiological channels. The current work focuses uniquely on the video modality, and the reader should refer to the work in [10,14–16,33,59–64] for more experiments including the other recorded modalities.

#### *4.2. Data Preprocessing*

The evaluation performed in the current work is undertaken in both cases (BVDB and SEDB) on video recordings performed by a frontal camera. The recordings were performed at a frame rate of 25 frames per second (fps) for the BVDB and 30 fps for the SEDB. Furthermore, the evaluation is performed uniquely on windows of length 4.5 s with a shift of 4 s from the elicitation's onset, as proposed in [16] (see Figure 5). Once these specific windows are extracted, the facial behaviour analysis toolkit OpenFace [65] is used for the automatic detection, alignment and normalisation of the facial area (with a fixed size of 100 × 100 pixels) in each video frame. Subsequently, MHI sequences and OFI sequences are extracted using the OpenCV library [66]. Both MHIs and OFIs are generated relatively to a reference frame, which in this case is the very first frame of each video sequence. Concerning MHIs, the temporal extent parameter *τ* (see Equation (1)) was set to the length of the sequence of images (25 × 4.5 ∼= 113 frames for the BVDB and 30 × 4.5 = 135 frames for the SEDB). Furthermore, the threshold parameter *ξ* (see Equation (2)) was set to 1 to capture any single motion from two consecutive frames (in this case, the fluctuation of pixel intensities between the reference frame and the *t*th frame). Finally, to reduce the computational requirements, the number of samples in each sequence is reduced by sequentially selecting each second frame of an entire sequence for the BVDB (resulting in sequences with a total length of 57 frames), and each third frame of an entire sequence for the SEDB (resulting in sequences of length 45 frames). The dimensionality of the tensor input specific to the BVDB is, respectively, (*bs*, 57, 100, 100, 3) for OFI sequences and (*bs*, 57, 100, 100, 1) for MHI sequences (*bs* representing the batch size). The dimensionality of the tensor input specific to the SEDB is, respectively, (*bs*, 45, 100, 100, 3) for OFI sequences and (*bs*, 45, 100, 100, 1) for MHI sequences.

**Figure 5.** Video Signal Segmentation (BioVid Heat Pain Database (Part A)). Experiments are carried out on windows of length 4.5 s with a temporal shift of 4 s from the elicitations' onsets.

#### *4.3. Experimental Settings*

The evaluation performed in the current work consists of the discrimination between high and low stimuli levels. Therefore, two binary classification tasks are performed for each database: *T*0*vs*.*T*<sup>4</sup> and *T*1*vs*.*T*<sup>4</sup> for the BVDB, and *T*0*vs*.*T*<sup>3</sup> and *T*1*vs*.*T*<sup>3</sup> for the SEDB. Furthermore, the assessment of the proposed approach is conducted by applying a *Leave-One-Subject-Out* (LOSO) cross-validation evaluation, which means that a total of 87 experiments were conducted for the BVDB (40 experiments for the SEDB), during which the data specific to each participant is used once to evaluate the performance of the classification architecture optimised on the data specific to the remaining 86 participants (the data specific to 39 participants is used to optimise the architecture for the SEDB, and the data specific to the remaining participant is used to evaluate the performance of the architecture).

The feature embedding CNN used for the evaluation is adapted from the one proposed by the Visual Geometry Group of the University of Oxford *VGG16* [67]. The depth of the CNN model is substantially reduced to a total of 10 convolutional layers (instead of 13 as in the *VGG16* model), and the number of convolutional filters is gradually increased from one convolutional block to the next starting from 8 filters until a maximum of 64 filters. The activation function in each convolutional layer consists of the *elu* activation function (instead of the rectified linear unit (*relu*) activation function as in the *VGG16* model). Max-pooling and Batch Normalisation [68] are performed after each convolutional block. A detailed description of the feature embedding CNN architecture can be seen in Table 1. The coupled BiLSTM layer consists of two LSTM RNNs with 64 units each. The resulting sequence of spatio-temporal features is further fed into the attention layer in order to generate a single aggregated representation of the input sequence. The classification is further performed based on this representation and the architecture of the classification model is described in Table 2. The exact same architecture is used for the two input sequences (MHIs and OFIs). The outputs of the classifiers are further aggregated based on both Equation (8) and Equation (9). The whole architecture is subsequently trained in an end-to-end manner, using the Adaptive Moment Estimation (Adam) [69] optimisation algorithm with a fixed learning rate set empirically to 10−5. The categorical cross entropy loss function is used for each network ( L*mhi* = <sup>L</sup>*ofi* = <sup>L</sup>*agg* = L), and is defined as follows,

$$\mathcal{L} = -\sum\_{j=1}^{c} y\_j \log(\hat{y}\_j) \tag{21}$$

where *y*ˆ*j* represents the classifier's output, *yj* is the ground-truth label value and *c* ∈ N is the number of classes for a specific classification task.


**Table 1.** Feature embedding CNN architecture.

The size of the kernels is identical for all convolutional layers and is set to 3 × 3, with the convolutional stride set to 1 × 1. Max-pooling is performed after each block of convolutional layers over a 2 × 2 window, with a 2 × 2 stride.

The regularisation parameters of the loss function in Equation (10) are set as follows: *λmhi* = *<sup>λ</sup>ofi* = 0.2 and *<sup>λ</sup>agg* = 0.6. The value of the regularisation parameter specific to the aggregation layer's loss is set higher than the others in order to enable the architecture to compute robust aggregation weights. The whole architecture is trained for a total of 20 epoches with the batch size set to 40 for the BVDB and 60 for the SEDB. The implementation and evaluation of the whole architecture is done with the Python libraries Keras [70], Tensorflow [71] and Scikit-learn [72].


**Table 2.** Classifier Architecture.

The dropout rate is empirically set to 0.25. The first fully connected layer uses the *elu* activation function, while the last fully connected layer consists of a *softmax* layer (whereby *c* depicts the number of classes of the classification task).
