3.4.2. Data Pre-Processing

In our experiment, a video sequence was pre-processed in the following three steps: (1) Frameby-frame face detection; (2) locating eyes, nose, and mouth; and (3) cropping o ff the eyes, nose, and mouth areas. We found that step 2 had a significant e ffect on the performance of the network, so the choice of area to perform accurate spotting is crucial. To ensure that this was done accurately, the local areas were cropped based on the location of landmark points annotated by a robust landmark detector, discriminative response map fitting (DRMF) [41]. DRMF not only achieves good performance in landmark-detection methods [30], but also consumes very little computation time.

The cropping of these local areas was done by an automatic method. Since some of the cuts are inaccurate, manual cropping was used. Using the facial landmark points annotated earlier, the three regions were identified by using rectangular bounding boxes determined based on the eyes, nose, and mouth landmark points. We segmented the three local regions according to the following eleven points: E1 (x1, y1), E2 (x2, y2), E3 (x3, y3), E4 (x4, y4), E5 (x5, y5), N1 (x6, y6), N2 (x7, y7), M1 (x8, y8), M2 (x9, y9), M3 (x10, y10), and M4 (x11, y11) (shown in Figure 3). The center point of the rectangular bounding box of the eye region is L1 = E5 (x5, y5), and the length and width of the rectangle are 5 3 |x2 −x1| and 4 3 |y4 − y1|, respectively. The center point of the rectangular bounding box of the nose region is L2 = (x5, y7−y6 2 ), and the length and width of the rectangle are |y7 − y6| and |x3 − x4|, respectively. The center point of the rectangular bounding box of the mouth region is L3 = (x5, y11−y9 2 ), and the length and width of the rectangle are 5 3|x10 − x8| and 4 3|y11− y9|, respectively.

**Figure 3.** Positions of 11 points for segmenting three regions.

For the network input, each video sequence is normalized to 32 frames using the linear interpolation method [42]. Each frame of a global face (whole face) and local areas were resized to 88 × 108 and 36 × 64, respectively. To reduce the amount of calculation, all input images were converted to 8-bit grayscale.

#### **4. Results and Discussion**

#### *4.1. Comparisons of Di*ff*erent Streams and Their Fusion*

Table 2 shows the average results of tenfold cross-validation for each local region using a single sub-network (one stream) and a fused network. The feature information of the eye (including eyebrows), nose, and mouth regions is extracted by a single stream and the recognition rates are 35.37%, 42.76%, and 68.35%, respectively. The mouth region has the highest recognition rate, which may indicate that this part is the most expressive part in the database. The recognition rate of the eye region is the lowest among the three regions. This may be due to some of the participants wearing glasses. In the NIR face image, the NIR light reflected by the glasses removes the feature of the eyes, so that the frames with glasses have a grea<sup>t</sup> influence on recognition. At the same time, we can see that the performance of the recognition rate of the three-local-stream-fused networks (TFNets) reaches 78.68%, which is much higher than that of each single stream network (eye, 35.37%; nose, 42.76%; mouth, 68.35%). This indicates that our fusion is very e ffective in improving the recognition rate. After the network was fused, we added the SE block that automatically allocates weights to di fferent streams. Since the SE block can make the entire network adaptively learn the weight of the feature channel, the SETFNet further improves the recognition rate, reaching a recognition rate of 80.34%.


**Table 2.** Comparison of di fferent local and fused networks.

To investigate whether the SETFNet had extracted most of the expression features, we added one more stream to the SETFNet, which takes the frame of the global face as the input. Because each frame of the global face has larger spatial size than that of each local area, we added one more convolution pair to this added stream. The network structure is shown in Figure 4, with the fourth stream being the global face stream. When it is added to the SETFNet, the recognition rate becomes 81.67%. The SETFNet itself can achieve an 80.34% recognition rate. That is to say, after adding the entire face as input, the improvement of the recognition rate is still limited. This may indicate that the SETFNet has extracted most of the expression features.

Table 2 also shows the time consumption of various single sub-networks and fused networks. The time for a single sub-network to process an image sequence is 0.515 s, and the time for TFNet and SETFNet to process a sequence is 1.158 and 1.237 s, respectively. Considering the large improvement in recognition rate made by the TFNet and SETFNet, the increase of computation time is acceptable. However, when a global face stream is added to the SETFNet, the time for the network to process a sequence is 2.142 s. The slight increase in recognition rate (80.34% versus 81.67%) made by the global stream is at the expense of the processing time (1.237 s versus 2.142 s). However, all of the computation time may be within acceptable limits, since the input is 32 frames. Under the hardware settings used (NVIDIA Geforce GTX 1080 GPU (8G) for deep-learning acceleration), the SETFNet can process 32/1.237 = 25.87 frames every second. The frame rate of a normal imaging system is 25–30 fps, and 25.87 fps is within this range, which means that the SETFNet can give the recognition result just 1 s of lag in real-time imaging if the computation is performed in parallel with the imaging. With better hardware, the computation time can be further decreased to or to less than 1 s, which makes the processing a real-time process. Therefore, this network could be used in real applications.

**Figure 4.** Structure of SETFNet plus global face stream.

The recognition rate of the eye region is the lowest among the three regions. One reason may be that the eyes have fewer features than the other parts; another reason could be that some of the subjects wear glasses. To verify the effect of glasses on the recognition rate, we input the eyes with and without glasses into the sub-network separately. The recognition results are shown in Table 3. It is seen that the recognition rate without glasses is better than that with glasses, which indicates that the glasses remove some features of the eyes. Since we divided the dataset into two parts, the recognition rates of wearing glasses and not wearing glasses are lower than that of the single sub-network with all data as the input.

**Table 3.** Comparison of recognition rate with and without glasses.


#### *4.2. Comparison of Embedded SE Block*

The SE block was added to the network after the fusion so that the network could receive the information of the entire network and have a global receptive field. In the SE block, the reduction ratio *r* is an important parameter that can change the capacity and computational cost. We compared different reduction ratios *r* in our network model and the results are shown in the Table 4. When *r* = 16, the accuracy is the highest; therefore, *r* is set as 16.


**Table 4.** Comparison of different network reduction ratios.

#### *4.3. Comparisons with Other Methods*

Table 5 shows the di fferent expression recognition rates of di fferent methods on the Oulu-CASIA NIR facial expression database under dark-lighting conditions. For all of the methods, we used the tenfold cross-validation method to obtain an average recognition rate. The results of Deep Temporal Geometry Network (DTAGN), 3D CNN Deformable Facial Action Parts (DAP), and NIRExpNet were obtained from [37], and the result of LBP-TOP was obtained by implementing the algorithm using MatLab software (MathWorks, Natick, MA, USA). SETFNet and SETFNet + global were implemented by using Ca ffe. It is seen that LBP-TOP and 3D CNN DAP can achieve recognition rates of 69.32% and 72.12%, respectively, which are higher than that of DTAGN. NIRExpNet used the fusion information of local and global features, and therefore can achieve an even higher recognition rate than LBP-TOP and 3D CNN DAP. SETFNet uses only local information of three regions, but it can achieve a higher recognition rate (even higher than NIRExpNet, which uses local and global features). When a global face stream is added to SETFNet, it further improves the recognition rate to 81.67%. This indicates that the automatic allocation of the weight-of-features channel helps improve the recognition performance, which could be a promising method for NIR facial expression.


**Table 5.** Comparison of total recognition rates of di fferent methods.
