**3. Methodology**

In this work, a comparison between the feature extraction techniques and the classification algorithms is presented to find the best combination that can be used for emotion intensity recognition. Figure 2 shows an overview of the experiment in the form of a generalized architecture, where training and testing layers are shown in detail. The first layer, called the training layer, has the following stages:


The second layer, testing layer, primarily has two stages. First, similar to the training stages, image sequencing, pre-processing, and feature extraction & selection was performed. Second, these features were passed through the trained model and finally an emotion intensity decision was made based on the AUs. Before we further discuss the intricate details of our comparative study, it is important to understand that this comparative study evaluates multiple techniques at both the intermediary step of feature extraction as well as the final step of classification.

HOG and LBP both create histograms to express features. HOG uses gradients to build spatial and orientation cells and assembles histograms of these gradients using overlapping spatial blocks while LBP considers a neighborhood block and computes and normalizes the histograms by converting the binary-threshold code to an integer. On the other hand, Gabor features are extracted using Gabor filters and use frequency patterns of regions of interest to extract features for segmentation and texture analysis. Gabor filter uses functions that relate filter size, oscillation frequency/phase, and orientation. Although technically Gabor filter is closest to the human visual perception system, LBP is known to be computationally simpler and work better in various illuminations. HOG, on the other hand, comes with the advantage of using different block sizes and number of histogram bins, unlike LBP.

**Figure 2.** Generalized Architecture for Intensity of Emotion Recognition.

The ML techniques we used in this work include kNN, SVM, and RF. While SVM is known to have a generalization ability by mapping inputs non-linearly to higher dimensional feature spaces through its capability of separating training data with a hyperplane. kNN, a type of instance-based learning, involves the neighbors deciding the class (among k classes) a specific data point belongs to. Closest neighbors are assigned using popular methods such as Euclidean or Hamming distance. RF is a collection of several decision trees which do not need linear features or even features that interact linearly. These three classification algorithms are known to perform well for high-dimensional spaces as well as a large number of training samples. Each of these algorithms works well under specific circumstances, kNN for noisy data, SVM for linearly inseparable data, and RF for categorical features. Due to these specific features, we chose to use these methods. All these techniques have been widely used in the literature, as discussed in Section 2. This section further discusses the essential details of the experiment performed.

#### *3.1. AU Intensity Feature Extraction and Correlation Analysis*

This section consists of the description of the observed AU intensity feature extraction model, which consists of facial image registration and representation, dimensionality reduction, feature extractors, and classifications as shown in Figure 3. In this paper, we represent and capture the semantic AUs relations, as well as the correlation between the intensities of the AUs. This is done to measure the intensities of facial emotions more robustly.


**Figure 3.** Relation between scale of evidence and intensities of facial action units.

Due to the variety, the dynamics of facial actions, and the ambiguity, it is challenging to measure the intensities of AUs in a single frame. Mostly, databases are created with posed and spontaneous expressions, where it is a challenge to measure the intensity of spontaneous expressions as they occur more randomly. AUs significantly occur in combinations, where they are not always additive. This implies that the occurrence of an AU can be different than its original standalone type. A perfect example is shown in Figure 4, where AU12 occurs alone in Case A and the lip corner are pointing straight-slightly upwards. In Case B, AU15 lip corners appear a bit angled towards the ground, and in Case C, both co-occur, they are non-additive and hence, recognizing that emotion and intensity of the emotion become more difficult.

**Figure 4.** AU Combination: (**A**) AU12 occurs alone; (**B**) AU15 occurs alone; (**C**) AU12 and AU15 occur together—non-additive.

FACS manual gives an insight into the inherent relationship between the AUs that can produce the required information for measuring and analyzing the emotional intensity. The manual mentions that the inherent relationships can be subdivided into two classes—the class of mutual exclusions and the class of co-occurrences. Here, the class of co-occurrences is a class of groups of AUs which generally and most frequently appear together to give meaning to the depicted facial emotions. For example, AU6 + AU12 + AU25 sugges<sup>t</sup> "happy" while AU4 + AU15 + AU17 depict "sad". In the case of mutually exclusive AUs, the FACS manual provides alternative rules. Mutually exclusive AUs rarely occur together in spontaneous emotions in day-to-day life. FACS mentions that it is difficult to demonstrate two AUs such as AU25 (lips apart) and AU24 (pressed lips) together at all. This suggests that the mutually exclusive cases are very much possible, but with very low probability. There are still few limitations to the co-occurrence class in terms of intensity levels of emotion AUs. For example, when AU6 (raised cheeks) occur with AU12 (lip corner puller) as shown in Figure 5, both the AUs present a high/low-intensity level of one another.

**Figure 5.** AU Combination: (**Case A**) AU6 + AU12 + AU25, (**Case B**) AU4 + AU15 + AU17.

#### 3.1.1. Face Registration and Representation

This step aligns the data of a similar kind, such as input facial images plus the referenced facial images. Landmark points were used to mark to represent the important location of facial components such as the eyes, nose, and lips. To obtain the landmark points, an averaging solution was used, and this averaging was done over the entire training data set. The images are finally masked for the extraction of important facial regions and re-sized to 128 × 108 pixels. After this step, three well-known algorithms such as Histogram of Oriented Gradient, Gabor Features, and LBP were used for feature extraction for the reason that they are highly capable of representing the appearance-based information accurately.

#### 3.1.2. Feature Extraction through Gabor Features

Gabor features are comparative to the human visual system because of their frequency and orientation representations. A 2D Gabor feature, in the spatial domain, is a Gaussian kernel function which is modulated by a sinusoidal plane wave. These filters can be generated by one major wavelet by rotation and filter. These are the best among the other existing relevant image features such as the edge orientation histograms, box filters. In our experimentations, we extracted magnitudes on 96 × 96 images sizes using the directions of eight wavelets and scales of nine so that the Gabor wavelengths vary from the range of 2 to 32 pixels in half octave intervals. Although the resulting feature vector has 9 × 8 × 96 × 96 = 663,552 components, not all of them are useful. In fact, in our experiment, a very small number of informative components are selected. To perform Gabor analysis, first the eye centers are located, and then the images are aligned accordingly. This alignment is done by performing transformation, rotation, and scaling. This is a typical procedure for 2D images for registration. Normalization is done using manual determination of landmarks. This is done to preclude any misalignment effects from the registration schemes.

#### 3.1.3. Local Binary Pattern Method

The LBP method is based on a texture descriptor that is useful in extracting features from any textured image. We used an LBP for extracting facial features that are used for estimating the intensity of the emotion depicted in the image. The LBP is non-overlapping and uniform when applied to an image. Initially, a user specified number of uniform blocks are used to segmen<sup>t</sup> the image. For each patch on the image, the LBP matches the center pixel to its surrounding neighboring pixel to generate an *LBP* value. Equations (1) and (2) mentioned below are used for computation of *LBP*, where *N* represents the adjacent pixels, *k* is the neighboring size, and C is the center pixel. For this research, we have considered the value of *k* = 8.

$$LBP(N, \mathbb{C}) = \sum\_{k=0}^{7} P(N\_k - \mathbb{C}) 2^k \tag{1}$$

$$\mathbb{E}[y] = \begin{cases} 1 & \text{for } y \ge 0 \\ 0 & \text{for } y < 0 \end{cases} \tag{2}$$

The function *LBP* (*N*, *C*) (from Equation (1)) uses the P (Nk-C) as seen in Equation (2), generates a 1 or a 0 depending on the difference between the center pixel and the neighbor. Figure 4 shows an example of a neighboring pixel with their intensity values. Later the differences are calculated considering the center pixel. Equation (1) is used for the transition from difference matrix to Bit String Matrix (is a sequence of 0's and 1's). The most important step in LBP is that the starting position must be arbitrarily chosen for calculation. This is done by unwrapping the bit string and decoding it.


The number of bit string pattern within a patch is counted to create a feature vector that is used in a distance measure. For an 8-bit string, there are a total of 256 possible bit strings. Furthermore, for simplification of the process, the string is either considered to be uniform or non-uniform. A string is considered to be uniform when its bits, parsed in a circular sequential manner, has a shift of values two

or fewer times. Similarly, a string is non-uniform when its bits have changed more than two times. e.g., consider the string 00011110. Here only two shifts occur. One between the third and fourth position and one between the seventh and eighth position. Out of the total 256 patterns, only 58 are uniform. For every patch of an image, a histogram is created which is composed of 59 bins. All the 58 uniform patterns are assigned to those 58 bins in the histogram, where each bin stores the frequencies of the patterns. The one bin which is left (59th bin) keeps an account of all the non-uniform patterns found in the patch. Furthermore, all the histogram vectors from patches are concatenated to represent a histogram representing the features extracted by the LBP.

#### 3.1.4. Histogram of Oriented Gradient Features

This method was initially used in the human detection area, further used as object detectors and finally, they are now used for analyzing and representing the facial emotions. The descriptor HOG can quickly and efficiently describe the local shape and appearance of objects by counting the occurrences of gradient orientations in a localized portion of the images. In this study, the images are divided into small cells, and for every single cell the histogram of the gradient is calculated. This is done to represent the spatial information of the face image. For every image, in our study, 48 cells are constructed out of every image by building a cell with 18 × 16 pixels. A horizontal gradient filter [−1 0 1] was applied with 59 orientation bins in the study. Final step done was the concatenation of all the HOG representations of each cell to form a HOG feature vector (size of 2832 (48 × 59)).

#### 3.1.5. Dimensionality Reduction

High-dimensional features of an image make the analysis of the samples more complicated in the real-world applications where ML and pattern recognition algorithms are used. When extracting and selecting features, several features extracted are redundant and should be removed. e.g., in ML, univariate feature selection is made to avoid the use of redundant features for training. Literature review above has shown that facial expression and intensity of those expressions are embedded along a low dimensional manifold in a high-dimensional space. In our study, we have implemented nonlinear techniques for preserving the local information which is further useful in the classification of the intensity of facial emotions and their representation. Manifold learning is a technique which presumes that the sample data points are collected from a low dimensional manifold and embedded into a high-dimensional space. Quantitatively, Consider, a set of points (6), find a set of points (7), such that *yi* represents *xi* efficiently.

$$\mathbf{x}\_1 \dots \mathbf{x}\_n \in \mathbb{R}^D \tag{3}$$

$$\exists\_1 y\_1...y\_n \in \mathbb{R}^D(d \backslash D) \tag{4}$$

The Laplacian eigenmap algorithm was used in our study to reduce the dimensionality of the data. Furthermore, the high-dimensional data was mapped to a 29-dimensional space. The basics of the algorithm are to map the closest points of the high-dimensional space into the close points of the low dimensional space. For problem-solving, the generalized eigenvector problem is applied. Further to describe the embedded d-dimensional Euclidean space the first d eigen vectors in correspondence to the first d eigen values are used. Spectral regression algorithm was used to find a projection function which can map the high-dimensional data, in our study the HOG, Gabor features, and LBPs into low dimensional space.
