Due to its genuineness and diversified use, research on micro expression have gained momentum in the recent years. The field of computer vision and pattern recognition has attracted many researchers to work on this topic due to its sparse usage in the commercial and psychological spaces.
The pattern recognition of the micro expressions has been mainly analyzed based on major six emotions. Micro Expression testing was first done on the database presented by Polikovsky [
4], York Deception Test [
5], and USF-HD [
6]. But these datasets being insufficient were soon overtaken by SMIC [
3], CASME II [
7], CASME [
7], and CAS(ME)2 [
8]. The main reason the former did not gain popularity because the datasets were created by asking the participants to mimic or create emotions which as explained before does not generate micro expression. These were mainly artificial type of emotions and not the real ones. Hence, no fruitful results can be concluded using the former datasets. The York DDT contained very few expressions which were clearly insufficient for the research. The dataset SAMM, which stands for spontaneous actions and micro movements, consisted of 32 participants from nearly 13 different cultures. These datasets, rather than focusing on emotion recognition, focused on micro movement identification.
2.2. Deep Learning Approaches
Deep learning-based approaches have gained attention in face forensics recently, particularly in the detection fields. A high-level representation of micro expressions can be extracted from Convolution Neural Network (CNN)-based algorithms. Patel et al. [
15] were the first to introduce a CNN model in facial micro expressions detection. Due to fewer usable datasets, the researchers used pre-trained ImageNet weights with the Visual Geometry Group (VGG) architecture model. Mayya et al. [
16] introduced another method in their proposed model by combining temporal interpolation with a deep CNN (DCNN) for recognition. Later, it was fed to support vector machine (SVM) for classification and for faster performance using a Caffe [
17] library, which was used for feature extraction along with a Graphics Processing Unit (GPU) unit. The advantages of image classification using transfer learning containing feedforward convolution networks are using very deep structures [
15,
18,
19] and decoder functionality in auto encoder which is later taken from the feedforward mechanism. Further, several methods have been proposed for improving the discriminative ability of deep convolutions, such as VGG [
15], Inception [
19], and residual learning [
18]. To avoid overfitting and to exploit regularization for convergence, functions, such as stochastic depth [
20], batch normalization [
21], and dropout [
22], have been initialized. However, all of the above models could not capture critical micro-scale movements in micro expressions datasets.
Hence, deep learning-based approaches have gained potential in the face forensics in the recent past. The first framework in the field of face recognition was introduced by Jones-Viola [
23]. Their framework detected faces in an image using machine learning approach in real time. After that a large number of CNN-based face detection methods have been developed including Normalized Pixel Difference (NDP) face [
5]. Among them was one proposed by Ranjan et al. [
24] which used a selective search algorithm for face detection. It was although not able to localize well with the actual face region. The deep learning mechanisms have gained lot of attraction in various detection fields. Facial recognition and micro expression field is not less in this. The high-level representation of micro expressions is extracted from convolution neural networks-based algorithms. Patel et al. [
15] were the first to introduce CNN model in facial micro expressions detection. Due to less usable datasets, the researchers used pre-trained ImageNet weights with VGG architecture model. Mayya et al. [
16], in their proposed model, introduced another method by combining temporal interpolation with deep convolutional neural network (DCNN) for recognition. Later, it was fed to SVM for classification for a faster performance using Caffe [
17] library which was used for feature extraction along with GPU unit. Recent advantages on image classification using transfer learning containing feedforward convolutions networks are using very deep structures [
15,
18,
19] and the decoder functionality in auto encoder which is later taken from the feedforward mechanism. Several methods have further been proposed to improve the discriminative ability of deep convolutions, such as VGG [
15], Inception [
19], and residual learning [
18]. To avoid overfitting, functions, like stochastic depth [
20], batch normalization [
21], and dropout [
22], have been initialized and to exploit regularization for convergence. However, all the above models could not capture the critical micro-scale movements of micro expression datasets.
In recent times, region proposal networks [
25,
26,
27,
28,
29] has been successfully adopted in object detection applications. In image classification, an additional region proposal stage [
30] is added before feedforward mechanism. The proposed regions contain useful information and are hence used for feature learning in the further stages. Unlike object detection, in which its region proposals rely the ground truth bounding boxes or detailed segmentation masks [
31], unsupervised learning [
32] is usually used to generate region proposals for image classification. But, due to the heavy complexity of bringing-in segmentation masks and boundary boxes, especially for image classification tasks, this model is completely unnecessary.
Peng et al. [
33] proposed a model called dual temporal scale CNN for recognizing spontaneous micro expressions. This network works in two streams. These streams are used to process multiple frame rates of a micro expressions dataset. Each stream contains an independent shallow network to estimate overfitting. Inputs can be optical flow sequences, so that features can be produced by a shallow network. After learning, a linear SVM feature classifier is used to classify the output. The model has been proven to show decent performance compared with the conventional naive SVM and LBP methods, but it experiences the same problem with lagging in the extraction of critical micro-scale features in the model because of which its accuracies is not high enough to proceed.
Kim et al. [
34] proposed a model consisting of CNN and long short-term memory (LSTM) to manage spatial and temporal information. Instead of using full movement intensity, each expression stage is learned by the network in the spatial domain. The variation in expression classes, state, and state continuity results in making features resistant to variation in illumination. LSTM helps in learning the CNN spatial information and its temporal characteristics. The LSTM approach can extract temporal information through distinct frame rate video datasets. The developed model obtained better accuracy than the old LBP techniques and subsequent variant models. Although, the imbalance in the dataset samples affected the confusion matrix results. Control gates have been used extensively in LSTM networks. In the process of feedforward training, updates are made in control gates for neurons using the helpful information. Further, the control gates have a direct influence in this process [
25,
26]. Choi et al. [
35] proposed LFM-based CNN-LSTM hybrid method to recognize facial micro expressions from video frames. Landmark feature maps (LFM) extracts landmarks from all parts of the face and is then fed to the CNN-LSTM hybrid architecture to compute and classify the facial micro-expressions. Although the architecture is computationally strong enough to dig deeper into the frames, the major drawback is it equally focus on all parts instead of the parts which change with respect to emotions frame-wise.
Recently, Yu et al. [
36] introduced a deep cascaded peak pilot network to learn and determine weak expressions. Apex, i.e., peak expressions were used to supervise onset/offset non-peak expressions. The addition of backpropagation and a cascaded fine-tuned algorithm improved the overfitting problem and performance simultaneously. However, the authors tested macro expressions, which resulted in a best performance of approximately 90%.
Soft attention networks [
37,
38] developed in recent times [
39,
40] and soft attention modules are employing residual attention networks to develop a feedforward neural network. This approach has been adopted by the authors for this work. Recently proposed spatial transformer modules by Jaderberg et al. [
40] achieved contemporary results on almost all visual recognition tasks. An affine transformation is produced by a residual network that captures useful information available in the encoder section. Then, the input image patch is processed with the affine transformation to determine the attended region. Further, it is fed to the residual network for feature extraction.
This process is performed in an end-to-end residual attention framework that performs spatial transformations. This work has been inspired by Wang et al. [
41] regarding the design of soft attention networks with encoders and decoders as the pipeline for extracting top feature maps from both global and local information. Long et al. [
42] performed skip connections, which were used within the top and bottom features and reached state-of-the-art image segmentation results. Although this approach works satisfactorily, image classification does not require high weight structures that consume high computation power. Hence, much into local information as image segmentation, this work focuses on global and local information as far as micro-scale features from the face are included. The dataset consists of several videos, and each video is only a few seconds long, i.e., when a specific expression is seen, a video is recorded. This temporal information is considered for model training. Hence, the dataset is well refined, as micro expressions cannot be easily identified by cropping a video to the particular segment which contains the expression.