**5. Conclusions**

In this paper, we proposed a three-stream 3D CNN architecture with an SE block called SETFNet that can automatically learn spatio-temporal features simultaneously. We only used three local regions of the face as input to the network. The advantages of using local information as input to the network were the removal of some information unrelated to recognition and a reduction of the amount of computation. To enable the network to adaptively learn the weight of each feature channel, an SE block was added to the network after the fusion of three single sub-networks. Experimental results show that SETFNet can achieve an average recognition rate of 80.34%; when a global face stream was added to SETFNet, the recognition rate was further increased to 81.67%, which is higher than some state-of-the-art methods.

**Author Contributions:** Data curation, L.Z., J.C., and Y.Y.; Formal analysis, Y.C.; Methodology, Z.Z.; Supervision, T.C.

**Funding:** This research was funded by the National Natural Science Foundation of China (Grant No. 61301297, 61502398), and the Southwest University Undergraduate Science and Technology Innovation Fund (No.20600901).

**Conflicts of Interest:** The authors declare no conflict of interest.
