*3.3. Proposed System*

In this paper, we propose a three-stream 3D CNN with an SE block called an SE three-stream fusion network (SETFNet). We took three local regions, the eyes (including eyebrows), nose, and mouth, from the facial expression image sequence as inputs to the three-stream network. After fusions of the three streams, an SE block was added to the network to adaptively learn the weight of each feature channel.

To avoid over-fitting problems, a deep CNN requires large amounts of data for training. However, the available database for NIR expression is small in size. To train a CNN model on a small database, researchers use a medium-size CNN [39,40]. Therefore, the SETFNet in this paper was also a medium-size CNN with four convolutional layers.

The structure of the proposed SETFNet is shown in Figure 2. It is a three-stream 3D CNN consisting of three identical sub-networks. Each sub-network consists of four convolutional layers and has the same parameters. The number of convolution kernels for the four convolution layers, first through fourth, is 16, 32, 64, and 128, respectively. The kernel size of the first convolution layer is 3×3×8, and a large temporal stride here is used to eliminate some useless information. The kernel size of the other three convolution layers is 3×3×3. The three streams were fused and followed by an SE block to recalibrate the weight of each stream. The details of each subnetwork are shown in Table 1.


**Table 1.** Configuration of each stream.

**Figure 2.** Overall structure of the proposed SE three-stream fusion network (SETFNet). The SE block is displayed in the dotted box.
