1. Introduction
A micro-expression (ME) is an involuntary facial movement lasting less than one fifteenth of a second which can betray “hidden” or secretive emotions that the expresser attempts to conceal [
1]. ME analysis has become an effective means to detect deception, and has many potential applications in the security field. Unlike macro-expressions, physiological studies have shown that MEs occur over a remarkably brief duration with only minor muscle changes, which makes them very challenging to detect and recognize [
2]. To date, most facial micro-expressions research has focused on the study of six emotions typically seen in most cultures: happiness, surprise, anger, sadness, fear, and disgust [
3,
4,
5,
6].
Psychologists agree that understanding different types of facial expressions is essential for developing human emotional cognition and generating human-computer interaction models [
7]. Though the majority of individuals can indeed recognize many different kinds of emotions, researchers have been limited to the six basic emotional categories [
8,
9]. Individuals create an abundance of facial expressions throughout their daily lives as they express their inner states and psychological needs [
10]. For example, people regularly produce a happily surprised expression and observers do not have any problem distinguishing it from a facial expression of angrily surprised
Martinez et al. [
11] created a group of compound expressions in an effort to expand human emotions based on computer vision. For example, as shown in
Figure 1, “happily surprised” and “angrily surprised” are two typical complex emotions. They share the same emotion “surprise” but there are sufficiently different of muscle activations between each other. Martinez and his team defined 12 different compound emotions most typically expressed by humans, and generated a database called Compound Facial Expressions of Emotion (CFEE) [
11]. Here, compound means that the emotion category is constructed as a combination of two basic emotion categories. Compound expressions are as important to human communication as basic expressions. Individuals through, and are acutely affected by, multiple emotions in their daily routine. Therefore, the usage frequency of compound expressions may be much higher than the research community has previously believed. For example, if somebody receives an unexpected gift, they will create a mixture of surprise and happiness which corresponds to the complex facial expression “happily surprised”.
In general, most ME recognition focuses on six basic emotions common in popular culture [
1]. Unlike ordinary expressions (i.e., macro expressions), the psychological state reflected by MEs is often the closest to the real emotion of human beings. Consequently, MEs should not only be associated with single basic emotional categories, but also expand the diversity in emotional classes. Therefore, we need to build a ME database based on complex emotions. That is to say, “sadly angry” is a discrete category with unique facial representation compared to “angry” or “sadness”. Compound MEs are generally composed of two basic MEs. Furthermore, each externalized emotion has a corresponding unique facial expression, which is composed of the same muscle movements in individuals of different cultures and backgrounds [
11]. In this paper, we generated a compound ME database (CMED) on the basis of five basic ME databases [
12,
13,
14,
15,
16] which are cross-culturally applicable. As a result, the CMED database will aid researchers to train and evaluate their algorithm.
The extant research on ME recognition has centered on feature extraction [
17,
18]. Existing feature extraction methods can be roughly divided into two categories: hand-crafted feature-based methods [
3,
4,
17] and deep-learning-based methods [
19,
20,
21]. The former algorithms have been widely used in ME recognition systems, especially the LBP-family feature extraction method. The LBP family includes Local Binary Pattern on Three-Orthogonal Planes (LBP-TOP) [
3], Local Binary Pattern with Six Intersection Points (LBP-SIP) [
17], Spatiotemporal Local Binary Pattern with Integral Projection (STLBP-IP) [
22], and Spatiotemporal Completed Local Quantization Patterns (STCLQP) [
23]. Researchers have also applied optical flow methods to the study of MEs [
5,
6]. Optical flow involves the use of a strain mode to process continuous and changing videos, ultimately resulting in segmentation and extraction of ME sequences [
24]. This method is robust to non-uniform illumination and large amounts of movement, but the optical flow estimation is strongly dependent on precise quantities of facial surface deformation [
25]. Most of manual feature extraction methods, such as LBP-family and optical flow, are designed according to the characteristics of ME images. So they have high pertinence, weak generalization ability, and cannot well reflect the very small muscle changes associated with MEs.
In recent years, the emergence of deep learning methods has brought new vitality to the field of computer vision. Deep learning has been applied in research on face recognition [
26], emotion recognition [
27], and semantic segmentation [
28] with remarkable results. There are two main methods for applying the Convolutional Neural Network (CNN) to macro-expression recognition. One involves constructing a complete recognition framework with the CNN [
29], the other uses the CNN only for feature extraction [
30]. Deep learning may be a very valuable tool to resolve the problem of ME recognition under traditional methodologies. However, most deep learning algorithms require a large number of databases for training in order to obtain representative features at the risk of over-fitting. As of now, there are not enough ME database samples for deep learning training. To this effect, most CNN-based ME recognition methods are based on the second method mentioned above. In most cases, the CNN is used to select ME deep features as proposed by Patel et al. [
31], who first applied a CNN model to “transfer” information from object to facial expression, then secure useful depth features for recognition. Kim et al. [
32] also proposed a multi-objective based spatial-temporal feature representation method for ME which functions well on the expression database MMI and the ME database CASME II.
The automatic ME recognition system is operated in roughly three steps: (1) image preprocessing, which serves to eliminate useless information from ME images and simplify the data; (2) feature extraction, which serves to extract useful information for machine learning from the original image; and (3) facial expression classification to recognize emotional categories based on extracted features. Each step plays an important role in the ME recognition process. Many researchers have established three-step ME recognition systems, the three steps may differ among them according to different requirements. Automatic ME recognition systems based on CNNs are still being only preliminarily explored.
This paper first proposes and synthesis the compound MEs based on computer vision methods and psychological studies. Moreover, a novel and robust feature extraction method for basic and compound MEs is presented. This algorithm not only effectively displays very slight muscle contraction motions, but also extracts the high-level features of MEs. An overview of the recognition system is provided in
Figure 2. The main contributions of this study can be summarized as follows:
- (1)
The EVM algorithm is applied on the basic ME sequence to make the muscle movements more visible. In this way, we clearly understand the elicitation mechanism of MEs and use the corresponding action units to synthesize the compound MEs.
- (2)
The apex frames which include main information of ME sequences are extracted by 3D-FFT algorithm. The optical flow feature map between this frame and the onset frame was used to express the motion characteristics.
- (3)
A shallow CNN is used to extract high-level features from the flow feature map and classify the emotion classes.
- (4)
The proposed method was compared with several state-of-the-art methods to test the consistency and validity of the algorithm.
The remainder of the paper is organized as follows:
Section 2 discusses the related works which include the generation of compound ME database, EVM technology, and the extraction of optical flow features and deep features.
Section 3 discusses our experimental settings and results.
Section 4 contains a brief summary and conclusions.
3. Results and Discussion
3.1. Basic ME Databases
The development of automatic ME recognition systems depends largely on the spontaneous and representative databases. In this study, we investigated five representative databases (CASME I, CASME II, CAS(ME)
2, SMIC, SAMM) (
Table 3) in terms of frame rate, emotional categories, and resolution. We also tested the CMED on these databases. In addition, we also introduced the CMED database produced by this paper in detail.
3.2. CMED
According to the spontaneous ME databases described above, we generated a novel compound ME database in this paper. The CMED is composed of four spontaneous MEs databases: CASME I, CASME II, CAS(ME)2, and SAMM (the SMIC database lacks the corresponding AU coding, so we could not use it to generate composite expressions).
Firstly, we standardized images by detecting, aligning, and tailoring [
42] and finally adjusting their resolution to 150 × 190. Then, the EVM algorithm is applied to magnify the muscle movements of these ME images for AU-analysis and synthesis of composite MEs. However, the sample sizes and emotional distribution in these databases are very uneven, which makes it difficult for recognition. For example. in CASME II, 64 samples express the emotion of “disgust” while only 32 samples express “happiness”. In order to solve this problem, it is necessary to divided the traditional ME databases into new classes according to the emotional states. In basic MEs part, we use three emotional classes for recognition: Negative (Neg), Positive (Pos), and Surprise (Sur). As shown in
Table 4, we divide compound emotions into four categories: Positively Surprised (PS), Negatively Surprised (NS), Positively Negative (PN), and Negatively Negative (NN). In summary, seven emotional classes in CMED are supplied for recognition and data analysis.
In this paper, we will experiment on two databases: the basic ME database (including CASME I, CASME II, CAS(ME)2, SMIC, SAMM) and CMED (generating by CASME I, CASME II, CAS(ME)2, SAMM). The former has three classes (Neg, Pos, Sur), and the later has seven classes (Neg, Pos, Sur, PS, NS, PN, NN).
3.3. Experiment Settings
Firstly, we adjust the resolution of basic and compound ME images to 150 × 190 according to the above experiments. Secondly, we set the parameters of the EVM method for enlarging the subtle motions. Both Laplacian pyramid and Gauss pyramid can be used for spatial filtering. The resulted show that this article needed to enlarge the muscle movements corresponding to MEs, so we chose the Laplace pyramid to construct the baseband of different spatial frequencies. Similarly, time-domain filtering can select different bandpass filters according to different requirements. Therefore, we chose the Butterworth band-pass filter because the time-frequency analysis of the amplification results is not necessary after the amplification motion. The motion range of MEs is very subtle, so too small magnification factor does not reflect the contraction of muscles. However, if
is too large, the image distortion and noise caused by the camera may lead to the decline of experimental accuracy. As shown in
Figure 9, we compared the effects of different amplification coefficients on the experimental accuracy in basic ME database and CMED. In this experiment, we only input the enlarged images of MEs into CNN, and the results show that
is the appropriate magnification.
In order to prove the validity and discriminability of EVM algorithm, the video magnification technologies based on Lagrange perspective and Euler perspective was compared in
Figure 10. The result shows that the effect of Eulerian amplification is more obvious. The muscle movement of micro-expression is very subtle and hard to be detected by the naked eye. However, the basis of synthesizing compound micro-expression is to show the facial muscle contraction of each emotion. This paper uses the EVM to judge the emotion category and action units (AUs). Although Lagrange technology can reduce the influence of noise and blur, its amplification effect on micro-motion is poor, which deviates from the original intention of compound micro-expression synthesis. Furthermore, the corresponding filter can also avoid the blur and noise amplification. In this paper, the main requirement of video amplification technology is to enlarge the subtle changes of micro-expression sequences, so we choose EVM algorithm.
Next, we used the TV-L1 algorithm to extract optical flow features from the onset frame and the apex frame of enlarged ME images. The parameters of the proposed algorithm were taken from in the literature [
41]:
,
,
,
,
,
, and
. Among them, the parameters λ and
are the two most influential. λ is the weight that affects the regular term and data term. The regular term of the optical flow model hinders changes in the optical flow field to ensure a smooth optical flow field, so if the value λ is larger, the influence of the optical flow constraint term is greater. As long as there is a little deviation from the optical flow constraint, this will lead to functional divergence. On the contrary, a smaller λ results in less influence on the flow constraints.
was used to create the pyramid of images. A small value detects only pixel-level movement and a large value detects larger displacement.
Figure 11 shows the optical flow feature maps corresponding to different λ and
of the emotion “disgust” in the CASME II database. When λ = 0.3, the influence of optical flow constraints is large. The TV-L1 optical flow algorithm is very sensitive to noise, which leads to optical flow graph instability. The small displacement and noise produced by the camera makes these unnecessary movements particularly obvious under this condition. The motion amplitude of MEs is so slight that
cannot detect the muscle contractions, resulting in inaccurate motion representation on the optical flow map.
In the CNN network, we have implemented our model on Anaconda3 with Python3.6 and Tensorflow at the backend. The model was trained and tested on CPU with CORE i5 7th Gen and a GPU Server with GTX 950m (2GB). Before training the network, we need to process the optical flow feature maps. Firstly, we need to change the resolution of these data to
. Then, we randomly mess up the data and adopt the mini-batch training mechanism to improve the convergence velocity and the efficiency of test set. Generally speaking, the larger the batch size is, the more accurate the descending direction is, and the smaller the training vibration is. However, a large batch size value is easy to make the network fall into a local optimum. Hence, both the networks are trained for 1500 epochs with batch size of 10. Finally, we divide the data set into training set, verification set and test set, and transform them into TFRecord format to save memory space. The parameter settings for each layer of the network are shown in
Table 5.
As shown in the
Table 5, a convolution layer of 1 × 1 was added after the input layer to increase the non-linear expression of the input and improve the expressive ability of the model. In this paper, small convolution kernels such as
and
are used to enhance the network capacity and model complexity. The zeros padding operation can make full use of the edge information and avoid the sharp reduction of input size. In order to avoid losing the input responses, the pooling layer is generally set to a smaller value. When setting the learning rate, the initial value should not be too large. In this paper, the value is set to 0.01. And the learning rate slows down with the increase of epochs, so it decreases 10 times smaller after every 10 epochs. At the same time, the Batch Normalization (BN) is used to regulate the input layers and fix the mean and variance of each layer. This measure not only speed up the convergence, but also solve the problem of “gradient dispersion”. To minimize over-fitting, we added a dropout operation after two FC layers and set the ratio to 70%. In the softmax-classification layer, we set up two classifications: basic MEs and compound MEs. In the basic ME recognition stage, we split the output into three classes. In the CMED stage, we split the output into seven classes. We used a left-one cross-validation method to test the robustness of the proposed algorithm. Ultimately, Leave-One-Subject-Out (LOSO) cross-validation is used in our evaluation to prevent subjective errors during learning.
3.4. Analysis and Discussion
In this section, we analyzed the performance of the proposed algorithm including comparison against other state-of-the-art algorithms, and also compared the accuracy of different epoch values, as shown in
Table 6. At same time, F1-Measure was used to evaluate the performance of the proposed model. The result reflects that accuracy rate and F1-Measure value increased with an increasing number of rounds, but when epoch value = 1000, both indicators decreased. This is because the amount of data in basic/compound ME databases is very small. The number of weight-update iterations in the neural network increased as the epoch number increased, while the model entered an over-fit state. The epoch size is related to the diversity of the dataset. The number of emotional categories and samples in the two ME databases discussed in this paper are both small, so we were careful not to over-train the network.
In order to prove the effectiveness of the proposed algorithm, we input four different feature images (original graph, magnified graph, optical-flow graph and the proposed method) into CNN and get the histogram as shown in
Figure 12. The experiment indicates that this algorithm not only enlarges the subtle muscle movements of MEs on the image-display level, but also extracts the optical flow features, which provides a more accurate and robust image basis for subsequent depth feature extraction and recognition.
As shown in
Table 7, the proposed algorithm was tested on CASME II, SMIC, and SAMM databases and compared with other the state-of-the-art algorithms. The tested algorithms can be roughly divided into two categories: manual and deep learning feature extraction methods. The deep learning algorithms outperform the manual extraction algorithms on all three databases. However, their accuracy degrades as the deep network layer quantity increases. This is due to the inherent attributes of ME databases.
The sample sizes and sample classes of the databases we used are very small. Their sample distribution is also very uneven, which prevents the deep network from executing feature extraction as effectively as possible and readily leads to over-fitting. The proposed algorithm has some notable advantages in computational performance. Firstly, we did not directly use ME images to extract deep features but rather input the optical flow feature map generated by the enlarged onset and apex frames into the network, which avoids redundancy. The optical flow graph strengthens muscle contraction recognition during ME movement and is more discriminative. We also used a shallow neural network to extract and recognize optical flow map features, which not only extracts higher-level features from the deep network effectively but also alleviates over-fitting.
Table 8 summarizes the properties of all the neural networks mentioned in
Table 7, including depth, image input size and execution time. The execution time is the testing time of a single subject. This value is related to the depth and complexity of a network. The network proposed in this paper is better than OFF-ApexNet and STSTNet in both depth and model complexity. As a result, our network spends more time on testing a sample. Furthermore, the network designed in this paper combines the characteristics of micro-expression so that it is better than the traditional depth networks in execution time and accuracy.
We calculated the confusion matrices on two databases, the basic ME database and CMED. As shown in
Figure 13, the recognition accuracy of negative emotions is higher than that of the other two emotional categories on the basic ME database, which is due to the largest number of negative emotions in the mixed datasets (Negative samples = 329, Positive samples = 134, Surprise samples = 113). On the other hand, there is a certain degree of misjudgment because surprise generally triggers the initial state of all other emotions.
Figure 13b shows the accuracy of various emotional categories in the CMED. The emotional categories with larger sample sizes were recognized higher recognition rates. The algorithm was particularly accurate on “negative” and “negatively surprised” classes, because of the large number of samples compared with other categories. Compound emotions depend on the AUs of basic emotions, so there is a subordinate relationship between them to a certain extent. It is difficult for machines to distinguish compound MEs and their subsets. For example, “negative” is the dominant component of “negatively surprised”, so these two kinds of emotional states share information and AUs among themselves. The experimental results have indicated that PN had a 10.27% misunderstanding rate on negative emotions and a 7.94% misunderstanding rate on surprise emotions. This is because this compound MEs are composed of these two basic emotions and share the facial movement information. The compound MEs were misrecognized at a higher rate on their main basic MEs, but not on unwanted emotional states. From the
Figure 12, we can also understand that there are some differences between compound MEs and basic MEs. The generation and recognition of CMED broaden human understanding and recognition of emotion. Overall, the proposed method executed satisfactory recognition performance on the basic ME databases and CMED.
In the experimental part, this paper adopts the method of training in one database and testing in the same database. Although good recognition results have been achieved, the experiment does not test the effect of cross dataset. Therefore, the subsequent experiments will focus on the way of training on one dataset and testing on another one, but the expected experimental results of this method may not be satisfactory. Although the deep network technology improves the low accuracy rate of micro-expression recognition, the existing rate is not enough to support the future application requirements. We think there may be two reasons. Firstly, the low accuracy of this paper is due to the lack of enough data to adjust the parameters of the deep model. Obviously, the training and debugging of deep learning network needs a large number of samples, and the sample-number of our databases is too small. Although we can use some methods to increase the number of samples, due to the characteristics of micro-expression, it is difficult for non-professional person to label the emotions, so the new samples are often unsatisfactory. The second drawback of this paper is that only using the optical flow feature map to extract the deep features does not conform to the characteristics of micro-expressions. We need to adjust the input of the network according to the features of the data source. We can use multi-level input mode to train the depth network and refine each level network to meet the needs of micro expression recognition.
4. Conclusions
The experimental results and analysis presented above suggest that compound MEs and their corresponding components (i.e., basic MEs) have a high degree of connectivity and unity. Each compound expression has its corresponding dominant emotional category (for instance, “positive” is the dominant category in the emotion of “positively surprised”) and each basic ME is a constituent element of compound emotion. We performed image processing on basic/compound MEs in this study. EVM technology can magnify subtle muscle contractions, while the optical flow feature well represents the emotional production process. We used a shallow network to extract and classify the features of the optical flow to secure a series of experimental results. We expanded the emotional categories of MEs and used computer vision and deep learning algorithms to test the proposed method as well.
In this study, the mainstream databases of spontaneous MEs were investigated and a corresponding compound ME database was generated. However, this database was not created by inducing emotions by videos in laboratory settings, then recruiting professional psychological researchers to label the emotional categories. In the future, we hope to generate a database of spontaneous MEs based on complex emotions for further analysis. In the future, we will improve the algorithm to adapt to the application of micro-expression recognition. First of all, micro-expression is mainly video data. The follow-up work will focus on extracting temporal and spatial information of micro-expressions using deep network. We will combine the long-term memory network and convolution network to retain the characteristics of the micro-expression sequences and verify them on the database. Next, according to the research of psychologists, micro-expression has the necessary morphological regions corresponding to emotion. These regions are more discriminative in micro-expression recognition, so we aim to design a deep selection network to screen out the necessary morphological regions with discriminative ability and use these regions to recognize basic micro-expression and compound micro-expression.