**1. Introduction**

An individual's affective disposition is often expressed throughout facial expressions. Human beings are therefore able to assess someone's current mood or state of mind by observing his or her facial demeanour. Therefore, an analysis of facial expressions can provide some valuable insight about one's emotional and psychological state. Thus, facial expression recognition (FER) has been attracting a lot of interest from the research community in the recent decades and constitutes a steadily growing area of research, particularly in the domains of machine learning and computer vision. The current work focuses on the analysis of facial expressions for the assessment and recognition of pain in video sequences. More specifically, a two-stream attention network is designed, with the objective of combining both temporal and spatial aspects of facial expressions, based on sequences of motion history images [1] and optical flow images [2], to accurately discriminate between neutral, low, and high levels of nociceptive pain. The current work is organised as follows. An overview of pain recognition approaches based on facial expressions is provided in Section 2, followed by a thorough description of the proposed approach in Section 3. In Section 4, a description of the datasets used for the assessment of the proposed approach as well as the performed experiments is provided, followed by a description of the corresponding results. The current work is subsequently concluded in Section 5 with a short discussion and description of potential future works.

#### **2. Related Work**

Recent advances in both domains of computer vision and machine learning, combined with the release of several datasets designed specifically for pain-related research (e.g., *UNBC-McMaster Shouder Pain Expression Archive Database* [3], *BioVid Heat Pain Database* [4], *Multimodal EmoPain Database* [5] and *SenseEmotion Database* [6]), have fostered the development of a multitude of automatic pain assessment and classification approaches. These methods range from unimodal approaches, characterised by the optimisation of an inference model based on one unique and specific input signal (e.g., video sequences [7,8], audio signals [9,10] and bio-physiological signals [11–13]), to multimodal approaches that are characterised by the optimisation of an information fusion architecture based on parameters stemming from a set of distinctive input signals [14–16].

Regarding pain assessment based on facial expressions, several approaches have been proposed, ranging from conventional supervised learning techniques based on specific sets of handcrafted features, to deep learning techniques. These approaches rely on an effective preprocessing of the input signal (which in this case consists of a set of images or video sequences) and involves the localisation, alignment and normalisation of the facial area in each input frame. Moreover, further preprocessing techniques include the localisation and extraction of several fiducial points characterising specific facial landmarks, and in some cases, the continuous extraction of facial Action Units (AUs) [17,18]. The preprocessed input signal, as well as the extracted parameters, are subsequently used to optimise a specific inference model based on different methods. In [19], the authors use an ensemble of linear Support Vector Machines (SVMs) [20] (each trained on a specific AU), in which inference scores are subsequently combined using Logistical Linear Regression (LLR) [21] for the detection of pain at a frame-by-frame level. The authors in [22] apply a *k*-Nearest Neighbours (*k*-NN) [23] model on geometric features extracted from a specific set of facial landmarks for the recognition of AUs. Subsequently, the pain intensity in a particular frame is evaluated based on the detected AUs by using a pain evaluation scale provided by Prkachin and Solomon [24]. Most recently, the authors in [25] improve the performance of a pain detection system based on automatically detected AUs by applying a transfer learning approach based on neural networks to map automated AU codings to a subspace of manual AU codings. The encoded AUs are subsequently used to perform pain classification, using an Artificial Neural Network (ANN) [26].

In addition to AU-based pain assessment approaches, several techniques based on either facial texture, shape, appearance and geometry or on a combination of several of such facial descriptors have been proposed. Yang et al. [27] assess several appearance-based facial descriptors by comparing the pain classification performance of each feature with its spatio-temporal counterpart using SVMs. The assessed spatial descriptors consist of Local Binary Patterns (LBP) [28], Local Phase Quantization (LPQ) [29], Binarized Statistical Image Features (BSIF) [30] as well as each descriptor's spatio-temporal counterpart extracted from video sequences on three orthogonal planes (LBP-TOP, LPQ-TOP and BSIF-TOP). In [8,31], the authors propose several sets of spatio-temporal facial action descriptors based on both appearance- and geometry-based features extracted from both the facial area, as well as the head pose. Those descriptors are further used to perform the classification of several levels of pain intensities using a Random Forest (RF) [32] model. Similarly, the authors in [7,14,15,33], propose several spatio-temporal descriptors extracted either from the localised facial area or from the estimated head pose, including, among others, Pyramid Histograms of Oriented Gradients (PHOG) [34] and Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) [35], to perform the classification of several levels of nociceptive pain. The classification experiments are also performed with RF models and ANNs.

*Sensors* **2020**, *20*, 839

Alongside handcrafted feature-based approaches, several techniques based on deep neural networks have also been proposed for the assessment of pain induced facial expressions. Such approaches are characterised by the joint extraction of relevant descriptors (from the preprocessed raw input data) and optimisation of an inference model, based on neural networks in an end-to-end manner. In [36–38], the authors propose a hybrid deep neural network pain detection architecture characterised by the combination of a feature embedding network consisting of a Convolutional Neural Network (CNN) [39] with a Long Short-Term Memory (LSTM) [40] Recurrent Neural Network (RNN), to take advantage of both spatial and temporal aspects of facial pain expressions in video sequences. Soar et al. [41] propose a similar approach by combining a CNN with a Bidirectional LSTM (BiLSTM), and using a Variable-State Latent Conditional Random Field (VRS-CRF) [42] instead of a conventional Multi-Layer Perceptron (MLP) to perform the classification. In [43], the authors also use a similar hybrid approach as in [36,37]; however, in this case, the feature embedding CNN is coupled to two distinct LSTM networks. The outputs of the LSTM networks are further concatenated and a MLP is used to perform the classification of the pain intensities in video sequences. Furthermore, Zhou et al. [44] propose a Recurrent Convolutional Neural Network (RCNN) [45] architecture for the continuous estimation of pain intensity in video sequences at the frame-level, whereas Wang et al. [46] propose a transfer learning approach, consisting of the regularisation of a face verification network, which is subsequently applied to a pain intensity regression task.

The current work focuses on the analysis of facial expressions for the discrimination of neutral, low and high levels of nociceptive pain in video sequences. Thereby, an end-to-end hybrid neural network characterised by the integration of spatial and temporal information at both the representational level of the input data (OFI and MHI) and the structural level of the proposed architecture (hybrid CNN-BiLSTM) is proposed. Furthermore, frame attention parameters [47] are integrated into the proposed architecture to generate an aggregated representation of the input data based on an estimation of the representativeness of each single input frame, in relation with the corresponding level of nociceptive pain. An extensive assessment of the proposed architecture is performed on both *BioVid Heat Pain Database (Part A)* [4] and *SenseEmotion Database* [6].

#### **3. Proposed Approach**

A video sequence can be characterised by both its spatial and temporal components. The spatial component represents the appearance (i.e., texture, shape and form) of each frame's content, whereas the temporal component represents the perceived motion across consecutive frames due to dynamic changes of the content's appearance through time. Most of the deep neural network approaches designed for the assessment of pain-related facial expressions generate spatio-temporal descriptors of the input data in two distinct and conjoint stages: a specific feature embedding neural network (which is commonly a pre-trained CNN) first extracts appearance based descriptors from the individual input frames (which are greyscale or colour images), and a recurrent neural network is subsequently used for the integration of the input's temporal aspect based on sequences of previously extracted appearance features, thus generating spatio-temporal representations of video sequences that are used for classification or regression tasks. Therefore, both the temporal and spatial aspects of video sequences are integrated uniquely at the structural level (e.g., the architecture of the neural network) of such approaches. The current approach extends this specific technique by additionally integrating motion information at the representational level (e.g., input data) of the architecture throughout sequences of motion history images [1] and optical flow images [2].

#### *3.1. Motion History Image (MHI)*

Introduced by Bobick and Davis [48], a MHI consists of a scalar-valued image depicting both the location and direction of motion in a sequence of consecutive images, based on the changes of pixel intensities of each image through time. The intensity of a pixel in a MHI is a function of the temporal motion history at that specific point. A MHI *Hτ* is computed using an update function Ψ (*<sup>x</sup>*, *y*, *t*), and is defined as follows,

$$H\_{\pi} \left( \mathbf{x}, \mathbf{y}, t \right) = \begin{cases} \pi & \text{if } \mathbb{1} \left( \mathbf{x}, \mathbf{y}, t \right) = 1 \\ \max \left( 0, H\_{\pi} \left( \mathbf{x}, \mathbf{y}, t - 1 \right) - \delta \right) & \text{otherwise} \end{cases} \tag{1}$$

where (*<sup>x</sup>*, *y*) represents the pixel's location, *t* the time and *τ* the temporal extent of the observed motion (e.g., the length of a sequence of images); Ψ (*<sup>x</sup>*, *y*, *t*) = 1 is synonym of motion at the location (*<sup>x</sup>*, *y*) and at the time *t*; and *δ* represents a decay parameter. The update function Ψ (*<sup>x</sup>*, *y*, *t*) is defined as follows,

$$\Psi \left( \mathbf{x}, \mathbf{y}, t \right) = \begin{cases} 1 & \text{if } D \left( \mathbf{x}, \mathbf{y}, t \right) \ge \xi \\ 0 & \text{otherwise} \end{cases} \tag{2}$$

where *ξ* is a threshold; *D* (*<sup>x</sup>*, *y*, *t*) represents the absolute value of the difference of pixel intensity values of consecutive frames and is defined as follows,

$$D\left(\mathbf{x}, \mathbf{y}, t\right) = \left| I(\mathbf{x}, \mathbf{y}, t) - I(\mathbf{x}, \mathbf{y}, t \pm \Delta t) \right| \tag{3}$$

where *I* (*<sup>x</sup>*, *y*, *t*) represents the pixel intensity at the location (*<sup>x</sup>*, *y*) and at the time *t*; Δ*t* represents the temporal distance between the frames.

Therefore, the computation of a MHI consists in first performing image differencing [49] between a specific, preceding frame and the current *t*th frame, and detecting the pixel locations where a substantial amount of movement has occurred (depending on the value assigned to the threshold *ξ*) based on Equation (2). Subsequently, Equation (1) is used to assign pixel values to the MHI. If a motion has been detected at the location (*<sup>x</sup>*, *y*) of the *t*th frame, a pixel value of *τ* is assigned at that location. Otherwise, the previous pixel value of that location is reduced by *δ*, thereby indicating the temporal occurrence of the motion at that specific location, according to the actual time *t*. This whole process is conducted iteratively until the entire sequence of images has been processed. The temporal history of motion is thereby encoded into the resulting MHI. Therefore, a whole sequence of images can be encoded into a single MHI. However, in the current work, a sequence of MHIs is generated from each single sequence of images by saving each single MHI generated at each single step of the iterative process described earlier. The resulting sequence of images is used as input for the designed deep neural networks. A depiction of such a sequence of MHIs can be seen in Figure 1b, with the corresponding sequence of greyscale images depicted in Figure 1a.

(**a**) Greyscale input sequence.

(**b**) MHI input sequence.

#### (**c**) OFI input sequence.

**Figure 1.** Data preprocessing. Following the detection, alignment, normalisation and extraction of the facial area in each frame of a video sequence, the images are converted into greyscale. MHI and OFI sequences are subsequently generated.

#### *3.2. Optical Flow Image (OFI)*

Optical flow refers to the apparent motion of visual features (e.g., corners, edges, textures and pixels) in a sequence of consecutive images. It is characterised by a set of vectors (optical flow vectors) defined either at each location (*<sup>x</sup>*, *y*) of an entire image (dense optical flow [50,51]), or at specific locations of a predefined set of visual features (sparse optical flow [2,52]). The orientation of an optical flow vector depicts the direction of the apparent motion, whereas the magnitude of an optical flow vector depicts the velocity of the apparent motion of the corresponding visual feature in consecutive frames. Thus, an OFI provides a compact description of the location, direction and velocity of a specific motion occurring in consecutive frames. The estimation of the optical flow is based on the brightness constancy assumption, which stipulates that pixel intensities are constant between consecutive frames. If *I* (*<sup>x</sup>*, *y*, *t*) is the pixel intensity at the location (*<sup>x</sup>*, *y*) and at the time *t*, the brightness constancy assumption can be formulated as follows,

$$I(\mathbf{x}, y, t) = I\left(\mathbf{x} + \delta \mathbf{x}, y + \delta y, t + \delta t\right) \tag{4}$$

where (*<sup>δ</sup><sup>x</sup>*, *δy*, *δt*) represents a small motion. By applying a first-order Taylor expansion, *I* (*x* + *δ<sup>x</sup>*, *y* + *δy*, *t* + *δt*) can be written as follows,

$$I\left(\mathbf{x} + \delta \mathbf{x}, \mathbf{y} + \delta \mathbf{y}, t + \delta t\right) \approx I\left(\mathbf{x}, \mathbf{y}, t\right) + \frac{\partial I}{\partial \mathbf{x}} \delta \mathbf{x} + \frac{\partial I}{\partial y} \delta y + \frac{\partial I}{\partial t} \delta t. \tag{5}$$

Thus,

$$
\frac{\partial I}{\partial \mathbf{x}} \delta \mathbf{x} + \frac{\partial I}{\partial y} \delta y + \frac{\partial I}{\partial t} \delta t \approx 0 \tag{6}
$$

and by dividing each term by *δt*, the optical flow constraint equation can be written as follows,

$$
\frac{
\partial I
}{
\partial x
} \frac{dx}{dt} + \frac{
\partial I
}{
\partial y
} \frac{dy}{dt} + \frac{dI}{dt} \approx 0.\tag{7}
$$

Resolving the optical flow constraint equation (Equation (7)) consists of the estimation of both parameters *u* = *dxdt* and *v* = *dydt* . Several methods have been proposed to perform this specific task. The authors in [53,54] propose an overview of such approaches. In the current work, dense optical flow is applied, using the method of Farnebäck [50], to compute OFIs from consecutive greyscale images. A depiction of such a sequence of images can be seen in Figure 1c (both motion direction and motion velocity are color encoded).

#### *3.3. Network Architecture*

As opposed to still images, the motion component of a video sequence is integrated into both MHIs and OFIs, therefore providing more valuable information for facial expressions analysis. Therefore, the proposed architecture consists of a multi-view learning [55] neural network with both OFIs and MHIs as input channels. An overall illustration of the proposed two-stream neural network can be seen in Figure 2. In a nutshell, an attention network specific to each input channel (OFIs and MHIs) first generates a weighted representation from the *j*th input sequence (*hofi j* and *hmhi j* ). The generated representation is subsequently fed into a channel specific classification model (which in this case is a MLP). The resulting class probabilities of each channel (*scoreofi j* and *scoremhi j* ) are further fed into an aggregation layer with a linear output function, where a weighted aggregation of the provided scores is performed as follows,

$$\mathbf{score}\_{j} = \boldsymbol{\alpha}\_{ofi} \cdot \mathbf{score}\_{j}^{ofi} + \boldsymbol{\alpha}\_{mhi} \cdot \mathbf{score}\_{j}^{mhi} \tag{8}$$

with the constraint

$$
\pi\_{off} + \pi\_{mbi} = 1.\tag{9}
$$

The entire architecture is trained in an end-to-end manner by using the following loss function,

$$\mathcal{L} = \lambda\_{ofi} \cdot \mathcal{L}\_{ofi} + \lambda\_{mhi} \cdot \mathcal{L}\_{mhi} + \lambda\_{agg} \cdot \mathcal{L}\_{agg} \tag{10}$$

where the loss functions of each input channel and of the aggregation layer are respectively depicted with <sup>L</sup>*ofi*, L*mhi* and <sup>L</sup>*agg*. The parameters *<sup>λ</sup>ofi*, *λmhi* and *<sup>λ</sup>agg* correspond to the regularisation parameters of each respective loss function. Once the network has been trained, unseen samples are classified based on the output of the aggregation layer.

**Figure 2.** Two-Stream Attention Network with Weighted Score Aggregation.

The attention network (see Figure 3) consists of a CNN coupled to a BiLSTM with a frame attention module [47]. The CNN consists of a time distributed feature embedding network which takes a single facial image *imk*,*<sup>j</sup>* as input and generates a fixed-dimension feature representation *Xk*,*<sup>j</sup>* specific to that image. Therefore, the output of the *j*th input sequence of facial images {*imk*,*j*}*lk*=<sup>1</sup> consists of a set of facial features {*Xk*,*j*}*lk*=1. The temporal component of the sequence of images is further integrated by using a BiLSTM layer. A BiLSTM [56] RNN is an extension of a regular LSTM [40] RNN, to enable the use of context representations in both forward and backward directions.

It consists of two LSTM layers, one processing the input sequence *<sup>X</sup>*1,*j*, *<sup>X</sup>*2,*j*,..., *Xl*,*<sup>j</sup>*sequentially forward in time (from *<sup>X</sup>*1,*j* to *Xl*,*j*) and the second processing the input sequence sequentially backward in time (from *Xl*,*<sup>j</sup>* to *<sup>X</sup>*1,*j*). A LSTM RNN is capable of learning long-term dependencies in sequential data, while avoiding the vanishing gradient problem of standard RNNs [57]. This is achieved throughout the use of cell states (see Figure 4), which regulate the amount of information flowing through a LSTM network throughout the use of three principal gates: forget gate (*ft*), input gate (*it*) and output gate (*ot*). The cell's output *ht* (at each time step *t*) is computed, given a specific input *xt*, the previous hidden state *ht*−1, and the previous cell state *Ct*−1, as follows,

$$f\_t = \sigma \left( \mathcal{W}\_f \mathbf{x}\_t + \mathcal{U}\_f \mathbf{h}\_{t-1} + \mathbf{b}\_f \right) \tag{11}$$

$$\mathbf{i}\_l = \sigma \left( \mathbb{W}\_l \mathbf{x}\_l + \mathbb{U}\_l \mathbf{h}\_{t-1} + \mathbf{b}\_i \right) \tag{12}$$

$$\bar{\mathbf{C}}\_{t} = \tanh\left(\mathcal{W}\_{\mathbf{c}}\mathbf{s}\_{t} + \mathcal{U}\_{\mathbf{c}}\mathbf{h}\_{t-1} + \mathbf{b}\_{\mathbf{c}}\right) \tag{13}$$

$$\mathbf{C}\_{t} = f\_{t} \otimes \mathbf{C}\_{t-1} + i\_{t} \otimes \mathbf{C}\_{t} \tag{14}$$

$$\circ\_t = \sigma \left( \mathcal{W}\_o \mathbf{x}\_t + l L\_o \mathbf{h}\_{t-1} + b\_o \right) \tag{15}$$

$$h\_t = o\_t \odot \tanh(\mathcal{C}\_t) \tag{16}$$

where *σ* represents the sigmoid activation function *σ*(*x*)=(<sup>1</sup> + *exp*(−*<sup>x</sup>*))−<sup>1</sup> and *tanh* represents the hyperbolic tangent activation function. The element-wise multiplication operator is represented by the symbol ⊗. The weight matrices for the input *xt* are represented by *Wi*, *Wf* , *Wo* and *Wc* for the input gate, forget gate, output gate and cell state, respectively. Analogously, the weight matrices for the previous

hidden state *ht*−<sup>1</sup> for each gate are represented by *Ui*, *Uf* , *Uo* and *Uc*. The amount of information to be further propagated into the network is controlled by the forget gate (Equation (11)), the input gate (Equation (12)) and the computed cell state candidate *C* ˜ *t* (Equation (13)). These parameters are subsequently used to update the cell state *Ct* based on the previous cell state *Ct*−<sup>1</sup> (Equation (14)). The output of the cell can subsequently be computed using both Equation (15) and Equation (16).

**Figure 4.** Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN).

In the current work, the hidden representation stemming from the forward pass −→*h*1,*j*, −→*h*2,*j*,..., −→*hl*,*j* and the one stemming from the backward pass ←−*h*1,*j*, ←−*h*2,*j*,..., ←−*hl*,*j* are subsequently concatenated −→*h*1,*j*, ←−*h*1,*j* , −→*h*2,*j*, ←−*h*2,*j* ,..., −→*hl*,*j*, ←−*hl*,*<sup>j</sup>* and fed into the next layer. For the sake of simplicity, the output of the BiLSTM layer will be depicted as follows, *<sup>h</sup>*1,*j*, *<sup>h</sup>*2,*j*,..., *hl*,*<sup>j</sup>* (with *hk*,*<sup>j</sup>* = −→*hk*,*j*, ←−*hk*,*<sup>j</sup>*). The next layer consists of an attention layer, where self-attention weights {*ak*}*lk*=<sup>1</sup> are

computed and subsequently used to generate a single weighted representation of the input sequence. The self-attention weights are computed as follows,

$$\alpha\_k = \operatorname{elu}\left(\mathcal{W}\_k \mathfrak{h}\_{k,j} + b\_k\right) \tag{17}$$

$$n\_k = \frac{\exp(\alpha\_k)}{\sum\_{i=1}^l \exp(\alpha\_i)}\tag{18}$$

where *Wk* are the weights specific to the input feature representation *hk*,*<sup>j</sup>* = −→*hk*,*j*, ←−*hk*,*j* and *elu* represents the exponential linear unit activation function [58], which is defined as

$$\operatorname{col}\_{\mathfrak{a}}(\mathbf{x}) = \begin{cases} \mathfrak{a} \cdot (\exp(\mathbf{x}) - 1) & \text{if } \mathbf{x} < 0 \\ \mathbf{x} & \text{if } \mathbf{x} \ge 0 \end{cases} \tag{19}$$

with *α* = 1. Each self-attention weight expresses the relevance of a specific image for the corresponding emotional state expressed within the video sequence. Thereby, relevant images should be assigned significantly higher weight values as irrelevant images. The final representation of the input sequence is subsequently computed by performing a weighted aggregation of the BiLSTM output *<sup>h</sup>*1,*j*, *<sup>h</sup>*2,*j*,..., *hl*,*<sup>j</sup>* based on the computed self-attention weights {*ak*}*lk*=<sup>1</sup> as follows,

$$h\_{\vec{j}} = \sum\_{k=1}^{l} a\_k \cdot h\_{k,\vec{j}} \tag{20}$$

and is further used to perform the classification task.
