**Xinyu Zhang** *∗* **and Xiaoqiang Li**

School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China; xqli@i.shu.edu.cn **\*** Correspondence: 646599367@shu.edu.cn

Received: 12 March 2019; Accepted: 1 April 2019; Published: 3 April 2019

**Abstract:** In recent years, gesture recognition has been used in many fields, such as games, robotics and sign language recognition. Human computer interaction (HCI) has been significantly improved by the development of gesture recognition, and now gesture recognition in video is an important research direction. Because each kind of neural network structure has its limitation, we proposed a neural network with alternate fusion of 3D CNN and ConvLSTM, which we called the Multiple extraction and Multiple prediction (MEMP) network. The main feature of the MEMP network is to extract and predict the temporal and spatial feature information of gesture video multiple times, which enables us to obtain a high accuracy rate. In the experimental part, three data sets (LSA64, SKIG and Chalearn 2016) are used to verify the performance of network. Our approach achieved high accuracy on those data sets. In the LSA64, the network achieved an identification rate of 99.063%. In SKIG, this network obtained the recognition rates of 97.01% and 99.02% in the RGB part and the rgb-depth part. In Chalearn 2016, the network achieved 74.57% and 78.85% recognition rates in RGB part and rgb-depth part respectively.

**Keywords:** gesture recognition; human computer interaction; alternative fusion neural network
