2.1. Analysis of Emotions in Facial Images
There are eight baseline emotions: neutral, contempt, happiness, fear, surprise, anger, disgust, and sadness, as shown in
Figure 1. When the lower part of the face (nose, mouth, cheeks) is missing, “Surprise” may differ from the other emotions according to the shape and pattern of the eyes and eyebrows. However, “Anger” and “Disgust” are difficult to differentiate from one another. The upper parts of the face are similar, and without the lower part of the face, these two emotions can be misclassified. More importantly, distinguishing “Happiness”, “Fear”, and “Sadness” from one another can be rather agonizing because these three emotions crucially rely on the lower facial regions.
As shown in
Figure 1b, each emotion has different characteristics according to the connections between the action units (AUs) of the facial image. When the lower part is not occluded, all of the facial features and the relationships between each and every landmark are clearly visible. However, when the lower part of the face is completely covered by a facial mask, only the upper part of the face is available for analysis. The challenge is that sadness, contempt, anger, disgust, and fear all appear very similar in the upper part of the face and are very hard to distinguish from one another:
,
,
,
,
,
,
.
A number of the existing approaches use a combination of specific AUs that include the upper and lower parts of the face to classify the emotions. However, as shown in
Figure 1c, surprise, disgust, anger, and sadness are all proper subsets of fear, which is the reason why they can be misclassified. Moreover, contempt fully relies on the lower part of the face and will otherwise be misclassified as neutral. For this reason, the existing approaches that only rely on AUs are disrupted by the lack of information from the lower part of the face, which eventually leads to low accuracy. A comparison of the existing approaches is shown in
Table 1.
Existing emotion recognition methods, such as those in [
13,
14,
16,
17], that use 68-landmark detection require the entire image to be searched in order to find the face and then label the locations of 68 landmarks on the frontal face. Those landmarks and their relationships are then used for analysis. However, those methods mainly rely on the relationship with the lower part of the face; therefore, achievement is disrupted once the lower part is occluded, which means that only approximately 38% of the information is available for those methods.
For those face-based methods, all of the pixels in the detected faces are used to learn and classify emotions, resulting in high complexity and computational time. The problems with face-based methods are that all of the pixels are used, and those unrelated pixels are not relevant to the process. Moreover, they also disrupt the machine learning process and consequently lead to low accuracy and high complexity.
As shown in
Figure 2, different emotional landmarks are detected separately. Due to the fact that each and every CK+ participant performed fewer than the eight total emotions, the participant in
Figure 2 is shown presenting seven of the emotions, excluding contempt. For the existing approaches that use the entire frontal facial landmarks to classify emotions using Equation (2), there are a number of problems. Due to the fixed number of landmarks, the same number of landmarks for the different emotions of a single person is produced. That being said, what makes each emotion distinct from one another is not about the number of landmarks and is instead about the location and other parameters at each landmark.
As shown in
Figure 3, a number of samples were randomly selected from CK+ to analyze the density and distribution of the landmarks in each sample. We observed that both the lower and upper parts of the histograms of distance for the landmarks across all of the emotions observed in Samples 1–6 are right-skewed distributions. On the other hand, Samples 7–9 are not right-skewed; however, the distances between the lower parts of Samples 1–9 are significantly higher than those of the upper parts. More importantly, the histograms of distance between the landmarks across all of emotions in the CK+ samples depicted in
Figure 4a show that the distances between the landmarks in the lower part of the face are significantly higher than those for the upper part of the face between emotions. Additionally, the distances of all of the landmarks across all emotions were sorted and are shown in
Figure 4b. The figure shows that the distances between the landmarks for the lower part of the face are higher than those for the upper parts.
Once all of the landmarks of a person’s emotions are combined together, the lower part of the face either demonstrates high density or high distribution among the landmarks. To be more precise, by only using the landmarks from the lower part of the frontal face, emotions can be easily classified due to the fact that the locations of the landmarks differ significantly from one another within different emotions. However, the characteristics of the landmarks in the lower part do not exist in the upper part of the face, which means that in the upper face, those landmarks either have a low density or distribution. This causes the location of the upper facial landmarks be similar to the landmarks labeled for different emotions, as seen in Equation (1), which determines the Euclidean distance. As such, with a fixed number of landmarks and low density and distribution, it is difficult to distinguish between emotions using the upper part of the face alone. Moreover, this is the only scenario where different emotions from the same person are compared, which means that the problem of distinguishing emotions from different people is even more challenging.
As shown in
Figure 5, in our scenario, the entire lower part of the face is completely occluded by a facial mask; this is represented by Equation (4). More importantly, in the lower part of the face where the landmarks have a higher density and high distribution, as represented by Equation (6), only 26 landmarks or 38% still occur, and 42 landmarks of the original 68 landmarks in Equation (4) have completely vanished. As depicted in the combined upper facial landmarks for all of the emotions, the landmarks on the eyebrows and nose,
as per Equation (3), for all emotions are all close to one another. This means that the distances between the same labeled landmarks are significantly smaller for both the
x and
y coordinates, as shown in Equation (5). Only 12 landmarks in the eye area,
or 18% of the total traditional landmarks, are slightly different between emotions, which means that very little significant information is left for emotion recognition.
From the above-mentioned problems resulting from traditional landmarks, in order to eliminate the problems where the number of landmarks produce insignificant characteristics, we propose an approach that has the ability to flexibly detect a number of significant upper facial landmarks without a fixed number and location of landmarks, as shown in
Figure 6. As per Equation (7), our detected landmarks do not have to be located around the eyebrows, eyes, or nose. Our process does not require us to take all the pixels from the upper face. On the other hand, the upper face happens to be where our generated paths travel through. Without the fixed number of landmarks, the detected number of landmarks representing the different emotions of a single person or even for multiple people can be used as a significant feature in the emotion recognition process. More importantly, rather than focusing on the information around the fixed landmark area, enormous amounts of information from the area of the flexible landmarks that have been previously detected can be studied.
Our approach is flexible, has low complexity as well as simple implementation, and most importantly, it can achieve high accuracy. We would like to determine the important patterns that exist across the upper part of the face that are available when wearing a mask. We thus propose an idea in which the generated path can detect the intensity pattern in a flexible way, reducing the complexity of the existing emotion recognition methods that use all of the pixels from both the upper and lower parts of the face. The generated path creates a shape that looks similar to an infinity shape, and which needs to travel across the eye and eyebrow regions multiple times in slightly different neighborhoods for different periods. For this reason, a single round is not enough to gather essential information. Paths need to be travelled multiple times. We propose a set of equations, Equations (8) and (9), and properties in order achieve this. A small set of points generated from the equations runs across the eye and eyebrow areas, detecting significant amounts of information as well as patterns along the way; the generated points are then able to successfully recognize the emotions with a high degree of accuracy and low complexity.
where
;
is the number of circular-like regions produced by the equations,
defines the height of the shape, and
is the number of generated points to create the shape.
As shown in
Figure 7, Equations (8) and (9) are plotted separately. The blue and red graphs represent the graphs of
and
. We observe that in order to complete a single period (T1) of
, it takes the graph of
two complete periods (T2) to achieve the same length. In this scenario, once we combine Equations (8) and (9) together and define
and
, we obtain a graph with an infinity shape, as shown in
Figure 7. The infinity shape shows that the shape requires 50 points to complete a single period (T1) of
, in which
, and two periods (T2) of
, which means that the graph needs to travel through coordinate x five times and through coordinate
y two times.
From the above-mentioned scenarios, our purpose is to generate multiple periods (rounds) in which the points of each round land on a different path from one another in order to create an ideal area that covers a number of significant regions of the upper face in the next step. As such, the idea is to increase the number of periods so that the graph can travel over multiple rounds. However, if the number of points (n) increases to have the same ratio as the number of periods (c), the result will be a look-alike single graph. From our experiments, the ideal setting for our infinity shape requires that the number of periods be increased to c = 20, which makes this means that the graph depicts 10 periods (rounds). Moreover, the number of points needs to be stable; in this case, n = 50. Hence, there are five points that land on a completely different path from the others in each round.
2.2. Proposed Method
When it comes to recognizing emotions without being able to see the lower part of the face, significant information can only be extracted from the upper part of the face. We focused on the eye and eyebrow regions since the upper part of the face receives the information spreading from the lower-face region. We observed that both of the eyes form a shape that looks similar to an infinity symbol (∞), as shown in
Figure 7. As such, even if we do not have the lower facial features available, with the right technique, the upper area of the face alone can provide enough information for emotion recognition.
The illustration of the proposed method is shown in
Figure 8, and our proposed feature extraction algorithm is shown in Algorithm 1. Moreover, each step will be described in the following sections in detail.
Algorithm 1: Landmark Detection and Feature Extraction |
Input: Upper facial image Output: Landmark coordinate and HOG feature landmarks
t x y points (x, y)
for i=0 to (length of t) − 3 do startPoint points[i] midPoint points[i+1] endPoint points[i+2] connectedLineIntensity[i] connect(startPoint, midPoint, endPoint) end for
height height of peak fwhm full width of half maximum
for each lineIntensity in connectedLineIntensity do (indices, properties) findPeaks(lineIntensity) end for
for i=0 to length of properties do if prominence of properties[i] >= height and width of properties[i]>=fwhm then property properties[i] landmarks[i] property[leftBase] end end for
for i=0 to length of landmarks do landmarkHOG[i] histogramofOrientedGradient(landmarks[i]) features[i] (landmark[i], landmarkHOG[i]) end for |
2.2.1. Synthetic Masked Face and Upper Face Detection
In this step, we wanted to generate a facial image with a synthetic mask using the original facial image from the CK+ [
29] and RAF-DB [
30] data sets. The Dlib face detector [
31] was employed to detect the frontal face area from the entire image. We then obtained the different-sized facial regions according to the facial size of each image. After that, the facial mask was placed over the lower region of the detected face, including the nose, mouth, and cheeks, to generate a synthetically masked facial image that was as similar to a real-life scenario as possible, as shown in
Figure 9. Finally, only the upper part of the masked facial image was detected and used in the next steps. For the CK+ data set, the detected upper parts varied in dimension. However, the average dimensions were approximately 236 × 96. For the RAF-DB dataset, the dimensions of the detected upper part were 100 × 46.
2.2.2. Generated Points Creating Infinity Shape
We wanted to develop a rapid facial landmark detector that was free from issues related to occluded lower facial features. From our observations, the occluded parts (mouth, cheek, and nose area) of the face spread naturally in relation to the unobstructed parts (eyes and eyebrows area) when an emotion is expressed. We also wanted to develop a new landmark detector and extract the facial feature vectors that represent the important relationships between those regions in order to classify emotions. Both of the eyes form a shape that resembles an infinity symbol (∞). A combination of the simple yet effective trigonometric equations, such as those in Equations (8) and (9), were adopted to fulfill our objectives. In this step, we wanted to make sure that the generated points covered the area outside of the eyes and eyebrows. The reason for this is that the important features that occur between the lines connecting neighborhood points need to be discovered instead of the entire pixel area being used to train and classify different emotions. By doing this, the computational complexity is significantly reduced.
For our infinity shape, the number of periods needs to be α = 2, β = 2, and c = 20, which makes
, thus creating 10 graphing periods (rounds). Once β = 2, two circular regions are generated. Moreover, the number of points needs to be stable at
n = 50 so that there are five points in each round landing on a completely different path from the others, as shown in
Figure 7.
2.2.3. Infinity Shape Normalization
The original generated set of points of the infinity shape are different in size from the upper face image. In this step, we need to normalize the size of the original infinity shape so that it is the appropriate size before moving it to the location where it can essentially cover the upper face region, including the eyes and eyebrows.
As shown in Equations (10) and (11), in order to calculate the new coordinates of each and every point, we take the original coordinate value of either
x or
y and transform it into a different range or size according to each individual upper face area. In this case, the new range
is actually the width or height of the upper face. By doing this, we can obtain the normalized
x and
y coordinates, in which the size of the infinity shape is in the same range as the upper face area. This is also one of the reasons why our proposed method is very flexible: the original set of points is normalized as the sizes of different upper face regions independently.
2.2.4. Initial Seed Point of Upper Face
At this point, our challenge was to find the initial seed point and to move the entire normalized infinity shape to its new position. By doing this, the connected lines would cover the important areas along the way, including the eyes and eyebrows. In finding the initial seed point, we discovered that this point is where all of the forces created by the connected lines gather together. Because all of the vectors point to the same direction, the initial seed point has a huge number of accumulated forces from the infinity shape.
On the other hand, when we considered vertical and horizontal projections, we discovered that they actually lead to the same conclusion regarding where the initial seed point is. The vertical projection is the summation of all of the intensity values from each column, while the horizontal projection is the summation of all of the intensity values from each row. The reason for this is that we observed that the nose area is significantly brighter than the other areas in the vertical scenario. On the other hand, the eye and eyebrow areas have a lower intensity compared to the other areas in the horizontal scenario.
We needed to find the coordinates of the seed point
, as per Equations (12) and (13). This worked well for the
because the nose area is always the brightest area compared to the other areas. However, there are three different cases that can be observed from the experiment in the horizontal scenario: When the eye area is darker than the eyebrow, the minimum value of the horizontal projection lands on the eye area. Additionally, it lands on the eyebrow area when the eyebrow area is darker than the eye area. The last and best case happens when the eyes and eyebrows are blended into one region, and when the minimum intensity value lands on the area between the eyes and the eyebrows. In order to overcome this problem, we smoothed the horizontal projection graph using the Savitzky–Golay filter [
32], with the window size set to 31 and the order of the polynomial used during filtering set to 3. By smoothing the horizontal projection graph, the three cases are generalized into a single case, in which Equations (12) and (13) can be applied; we can then successfully determine the coordinates of the seed point
.
2.2.5. Landmark Detection (Eye and Eyebrow Boundary Detection)
In this step, once the generated points have been normalized and moved, the correct seed point automatically covers the edges of the eyes and eyebrows, and the intensity values along all of the connected lines are then extracted to be able to analyze the characteristics of each line, as shown in
Figure 10.
Once the intensity values had all been extracted and concatenated, we then observed that there was a certain pattern that could be observed inside the graph. We determined that patterns with a high prominence value (L1) as well as with a high-prominence width (L2) correspond to our original assumption perfectly, since the high prominence value (L1) means that the intensity is changing from low to high to low, or that the peak value is significantly high, which means that two connected lines of our infinity shape are travelling over a significant edge. A high prominence value alone is not enough, and the wide prominence width (L2) is also taken into consideration. A wide prominence width indicates the presence of a large number of pixels travelling across a significant edge.
We then focused on all of the three-neighborhood points and generated two subsequence graphs of points
to
and points
to
. After the graph had been generated, the search process was adopted in order to search for the pattern. The detected and undetected landmark patterns are shown in
Figure 10. Once the pattern was detected from each connected graph, the index of point A was then considered as the point where the intensity changes from low to high. This means that A is the point of the edge or the boundary of the eye area. We then plotted all of those detected points to the image, and they landed perfectly around the eyes and other significant areas, as shown in
Figure 10.
2.2.6. Feature Extraction (Histograms of the Oriented Gradients)
As shown in
Figure 11, most of the detected upper face landmarks have coordinates that correspond to the boundaries of the eye and eyebrow areas. Besides the coordinates of each and every landmark, we believe that the relationship between those detected landmarks are also the significant features for the classification process. The detected points are considered to be candidate landmarks as there are still a few outliers. The purpose of this step is not only to perform the feature extraction process, but also to eliminate the outliners. As such, we decided to determine the relationship between the candidate landmarks by applying the Histograms of the Oriented Gradients (HOG), as per Equations (14)–(17), to each and every upper face landmark in order to obtain the direction and magnitude value of each direction, as shown in Algorithm 2. The data obtained from each landmark are the coordinates of the landmark
followed by the 72 values of the direction and the magnitude of the 20 × 20 blob size around the landmark. This means that the 74-feature vector can be detected from each landmark. By doing this, the effective information extracted from each landmark will contain information on the relationship that exists between those neighborhood landmarks.
Algorithm 2: Histogram of Oriented Gradient (HOG) |
Input: Upper face image, boundary coordinate, size of blob Output: HOG feature of each boundary coordinate
image Upper face image x_landmark x coordinate of boundary y_landmark y coordinate of boundary distance size_of_blob/2
for i=0 to length of boundary coordinate do if x_landmark[i] > distance then x_start_blob x_landmark[i] − distance else x_start_blob 0 end
if y_landmark[i] > distance then y_start_blob y_landmark[i] − distance else y_start_blob 0 end x_end_blob x_start_blob + size_of_blob y_end_blob y_start_blob + size_of_blob blob image[x_start_blob, x_end_blob][y_start, y_end_blob] hog_feature[i] HOG(blob) end for |
The HOG is one of the quality descriptors that aims to generalize the features of any of the objects, in which the same objects produce the same characteristic descriptor in different scenarios. The HOG works with the numbers in the local gradient directions of time that have a large magnitude value. It utilizes the gradient of an image, the set of histogram directions of a number of locations, and the normalization of the local histograms. Furthermore, the local histograms need to be localized to the block. Without any clues to elucidate the edge positions or the equivalent gradient, the key idea is that the allocation of the border direction (local intensity gradients) characterizes local object manifestation and form, as shown in
Figure 11.
The detected HOG features of each landmark are concatenated into a single vector. The combination of the coordinates of each landmark and the HOG feature vector is considered to be the representative data for each landmark. All of the landmark data from every image with different emotions are then labeled before entering the classification step.
2.2.7. Classification
For the classification step, the features from each landmark are combined together and entered into the training process. As shown in
Figure 12b, we adopted CNN, LSTM, and ANN to train and classify the data. As each facial image has a number of landmarks and each landmark has multiple features, the data are considered in one training window. When it comes to multiple training windows, LSTM is adopted. In this case, Long Short-Term Memory (LSTM) is applied after flattening the layer of our architecture in order to determine the relationship between those HOG features and each detected landmark, since LSTM has the ability to remember the knowledge obtained from the previous sliding window.
In
Figure 12a, the σ are the gates of the LSTM components and are the neural networks that indicate the amount of information obtained from the predictors and the previous cell. These gates pass the information to the tanh functions to update the weight of the nodes. After that, the long-term memory (cell state) and the short-term memory (hidden state) are produced. To be more precise, the forget gate decides how much long-term information should be forgotten. The information that is kept is the input to the cell state. Moreover, for the input gate, it specifies how much short-term information should be forgotten. Then, the information is pushed through to the tanh functions for training. Finally, the output gate uses both the short and long-term information to determine how much information to transfer into the hidden state, as per Equation (18).
where
is the forget gate,
is the prior cell state,
is the input gate, and
is the intermediary cell state.
The forget gate in Equation (19) determines how far back to look in the long-term memory, as t is the index of the current timestep,
is the index of the first timestamp of interest, and
controls the rate of information decay.
As shown in Equation (20), is the reset gate, which is a sigmoid activation function. To be more precise, this means that the gate takes a value between 0 and 1. In the case when the value is 1, is canceled out, and all prior information is removed. If its value is 0, t is canceled out, and the historical information is kept.