**1. Introduction**

Facial expressions carry rich non-verbal information. Machines with the ability to understand facial expressions can better serve humans and fundamentally change the relationship between humans and machines. Therefore, automatic facial expression recognition has attracted attention from many fields, such as virtual reality [1,2], public security [3,4], and data-driven animation [5,6].

The e ffectiveness of facial expression recognition can be easily a ffected by environmental changes, such as changes of light, angle, and distance. Among these, the change of illumination conditions under visible light (VIS) (380–750 nm) has the largest influence [7,8]. To overcome this influence, an active near-infrared (NIR) illumination source (780–1100 nm) is used for the recognition. In this study, an NIR camera, together with the NIR illumination sources, were placed in front of the subjects. The intensity of the NIR illumination source was much higher than that of the ambient NIR light in indoor environments. Therefore, the ambient illumination problem could be solved as long as the active NIR illumination source is constant. The NIR recognition system is resistant to ambient illumination variations, and has been successfully applied to the field of face recognition [9]; it can perform well even in dark environments [10], in which normal imaging systems fail to perform recognition.

Facial expressions manifest themselves as movements of one or several discrete parts of the face, such as tightening the lips to express anger and raising the mouth to express happiness [11]. Some researchers use the features extracted from the entire face, which are called global features [12,13], for recognition, while other researchers use features extracted from specific parts, which are called local features [14–17]. Many researchers have demonstrated that local features improve the performance

of facial expression recognition compared with global features [18,19]. The main reason for this advancement is that the specific local regions contribute more accurate information of facial changes that help to distinguish the expressions, while the global region contains more identity information. Some researchers [20,21] have pointed out that the eyes, eyebrows, and mouth are the most expressive facial parts. However, it is unknown which part of the face should carry more weight in expression recognition or how the correct weight can be allocated to di fferent parts of the face.

In earlier studies, many facial expression recognition systems used static images [22–24] that only contain spatial information as the input. However, facial expression can be a dynamic process, and the dynamic information of the face can better reflect the change of expression. Therefore, it is necessary to extract spatial and temporal information from the image sequences to facilitate recognition.

In the work reported in this paper, we designed a convolutional neural network (CNN) to complete NIR facial expression recognition. The CNN used is a three-stream three-dimensional (3D) CNN, which can learn spatio-temporal information from image sequences. In addition, the three inputs to the CNN are all local features, which not only reduce computational complexity, but also remove information not related to the expressions (such as identity information). A squeeze-and-excitation (SE) block is appended after the 3D CNN, which can automatically assign more weight to the local features that carry more expression information. To overcome the over-fitting problem caused by small data, features are extracted through three identical shallow networks. Finally, we add a global face stream to the local network, further increasing the recognition rate.

The main contributions of this paper are the following: (1) Three local regions of the face are used as the input of the network for the NIR expression recognition, which can not only accurately extract the facial expression information, but also reduce the computational complexity and dimensions; and (2) an SE block is added to model the dependencies between feature channels and adaptively learn the weight of the channel to gain e fficient expression information and attenuate the useless information.

## **2. Related Work**

Facial expressions can be decomposed into movement of one or more discrete facial action units (AUs). Inspired by this theory, Liu et al. [25] located common patches and unique patches of di fferent expressions for recognition. However, this method could cause overlapping of located areas. Liu et al. [26] did further work and proposed a framework called FDM to select the active features of each expression without overlapping. Later, Liu et al. [27] proposed a 3D CNN with deformable action part constraints that can locate and code action units.

To extract temporal features while acquiring spatial features, Ji et al. [28] extended a CNN to a 3D CNN, which can extract the spatio-temporal information from image sequences. Szegedy et al. [29] utilized the 3D CNN to extract temporal information for video-based expression recognition. Chen et al. [30] proposed a new descriptor, the histogram of oriented gradients from three orthogonal planes (HOG-TOP), to extract the dynamic texture features from image sequences, which are fused with the geometric features to identify expressions. Fonnegra et al. [31] proposed a deep learning model and Yan et al. [32] presented collaborative-discriminative-multi-metric-learning (CDMML)-based image sequences for emotion recognition. To make the system more precise, Zia et al. [33] proposed a dynamic weight majority voting mechanism for the construction of ensemble systems. However, since these methods are all based on visible light, the impact of external illumination changes are not considered.

The NIR facial images/videos are hardly influenced by the ambient visible light change. Farokhi et al. [34] proposed a method of extracting global and local features by using Zernike moments (ZMs) and Hermite kernels (HKs), respectively, and then used the fused features to identify the NIR face. Taini et al. [35] assembled a near-infrared facial expression database and completed the first study based on NIR facial expression recognition. Zhao et al. [18] developed the database of NIR facial expressions, called the Oulu-CASIA NIR facial expression database, and used local binary patterns form three orthogonal planes (LBP-TOP) to extract dynamic local features. It was proved in this work that NIR can overcome the influence of visible-light illumination changes on expression recognition. However, these methods must extract facial expression features manually. Jeni et al. [36] proposed a 3D-shape-information-based recognition technique and further proved that an NIR camera configuration is suitable for facial expressions under light-changing conditions. Wu et al. [37] proposed a three-stream 3D convolutional network for NIR facial expression recognition, using a combination of global and local features, but did not consider assigning di fferent weights to local features.

#### **3. Materials and Methods**

## *3.1. 3D CNN*

A 3D CNN is more suitable for spatial-temporal feature extraction. In [28], to process image sequences more e fficiently, a 3D CNN approach is proposed to address action recognition problems. Through 3D convolution and pooling operations, a 3D CNN has the ability to learn temporal features.

A 3D CNN consists of an input layer, 3D convolution, 3D pooling (usually, each convolution layer is followed by the pooling layer), and a fully connected (FC) layer. The dimension of the input image sequences to the 3D CNN is represented as d × l × h × w, where d is the number of the channels, l the number of frames of video clips, and h and w the height and width, respectively, of each frame. In addition, 3D convolution and pooling have a kernel size in t × k × k, where t is the temporal depth and k the spatial size.

#### *3.2. Squeeze-and-Excitation Networks (SENets)*

Hu et al. [38] proposed squeeze-and-excitation networks (SENets). The basic architectural unit of SENets is the SE building block, which is shown in Figure 1.

**Figure 1.** Squeeze-and-excitation (SE) block structure.

Before the SE block operation, input data X is transformed into features U through a series of convolution operations, i.e., F*tr* : X → U, X ∈ *R W* ×*H* ×*C*, U ∈ *R W*×*H*×*C*, where F*tr* represents the transformation from X to U, *H* (*H*) and *W* (*W*) are the frame height and width, respectively, and *C* (*C* ) are the number channels.

The SE block mainly consists of two operations: Squeeze and excitation. Because the filter learned by each channel in the CNN operates on the local receptive field, each feature map in U cannot utilize the context information of other feature maps. The purpose of the squeeze operation is to have a global receptive field, so that the lower layers of the network can also use global information. The global average pooling operation is used to compress U (multiple feature maps) into Z, so that the *C* feature maps eventually become real columns of 1 × 1 × *C*. The squeeze operation is performed by

$$z\_m = F\_{sq}(\mathbf{u}\_m) = \frac{1}{W \times H} \sum\_{i=1}^{W} \sum\_{j=1}^{H} u\_{m(i,j)} \tag{1}$$

where *zm* represents the *m*th element of Z and *um* the *m*th element of U.

The excitation operation is a simple gating with a sigmoid activation. The purpose of this operation is to model the interdependence between feature channels by learning parameters to generate the weight of each feature channel. To meet these requirements and limit the model complexity and auxiliary generalization, two FC layers (1\*1 conv layer) were introduced. One is the dimension reduction layer, in which the parameter is *W*1 and the dimension reduction ratio *r*; the other is a

dimension increase layer with parameter *W*2 followed by a Rectified linear unit (ReLU), W1 ∈ R Cr ×C and W2 ∈ RC<sup>×</sup> Cr . The excitation is performed by:

$$\mathbf{S} = \mathbf{F}\_{\text{ex}}(\mathbf{Z}, \mathcal{W}) = \sigma(\mathbf{g}(\mathbf{Z}, \mathcal{W})) = \sigma(\mathcal{W}\_2 \delta(\mathcal{W}\_1, \mathcal{Z})) \tag{2}$$

where S is the vector after excitation operation, and δ and σ refer to the ReLU function and the sigmoid function, respectively.

Finally, S is combined with U to obtain the final output by:

∼

∼

$$\mathbf{x}\_{m} = \mathbf{F}\_{\text{scale}}(\mathbf{u}\_{m\prime} \text{ s}\_{m}) = \mathbf{s}\_{m} \cdot \mathbf{u}\_{m} \tag{3}$$

∼

where s*m* is the *m*th element of S and x*m* the *m*th element of the final output X; F*scale* refers to channel-wise multiplication.

The goal of the SE block is to greatly improve the expressiveness of the network; it adaptively recalibrates the feature weight by modeling the interdependencies between the channels. In more detail, it allows the network to use global information to selectively enhance the beneficial features of the channel and suppress the useless function channels.
