3.1.3. Training of Motion Map Network (MMN)

Since the network can only handle video frames, the videos need to be processed as video frames. Some methods directly split the videos, ignoring different frame rates, so dynamic information may be inconsistent in time. Therefore, we extract a constant number video frames per second, which improves the generalization performance of the network. For the network to better capture the changes in the action, we extract two frames per second. As for some short videos, we loop the extracted frames, and fill up 16 frames per video.

We introduce an iterative method to train the Motion Map Network (MMN). We use video *V* with *N* frames and video labels *L* to train our MMN. *S* is defined as maximum training iteration length. We train MMN using training length *s* from 2 to *S*. The training length *s* is the round of iteration. We cut the training video *Vi* into *s*-length clips *C<sup>i</sup> <sup>j</sup>* (*j* ∈ 1 ... .*N*i/*s*) with overlap 0.7 and assign the labels *L*<sup>i</sup> to clip *C<sup>i</sup> j* . We define the MMN as a function *Zθs*(*Ia*, *Ib*), where *θ<sup>s</sup>* denotes the parameters of MMN after the iteration of training length *s*. The initial parameters of the network are defined as *θ*1. For each *s*, we generate the motion map *F ci j* <sup>1</sup>∼*s*−<sup>1</sup> using *<sup>Z</sup>θs*−<sup>1</sup> , and train the MMN using the motion map *<sup>F</sup> ci j <sup>s</sup>*−1, video frame *f ci j <sup>s</sup>* and video clip label *Li*. Finally, we can get *θ<sup>s</sup>* which is the parameter of our trained motion map network. The detail of the training steps is summarized in Algorithm 1.

**Algorithm 1.** Training of Our Motion Map Network

**Input:** *V* is Video dataset; Frame number of video dataset, *N*; Video labels, *L*; Maximum training iteration length, *S*; Parameters of our model, *θ*1; **Output:** Final parameters of Network, *θs*; 1: Initialize the parameter *θ*<sup>1</sup> for our model; 2: *for* each *s*∈2,3, . . . ,*S do* 3: cut *Vi* into *<sup>s</sup>*-length clips *<sup>C</sup><sup>j</sup> i* (*j* ∈ 1... *N*i/*s*) with overlap 0.7; 4: Extract the video frames from *C<sup>j</sup> <sup>i</sup>* as *<sup>f</sup> <sup>C</sup><sup>j</sup> i* ; 5: *for* each *j* ∈1, 2 . . . *N*/*s do* 6: *for k* ∈1, 2 . . . *s* − 1 *do* 7: Generate the motion map *F Cj i <sup>k</sup>* using *<sup>Z</sup>θs*−<sup>1</sup> (*<sup>F</sup> Cj i <sup>k</sup>* − 1, *f Cj i <sup>k</sup>* ) *end for* 8: Train the MMN using *F Cj i <sup>s</sup>*−1, *<sup>F</sup> Cj <sup>i</sup> <sup>s</sup>* and *L*<sup>i</sup> *end for* 9: Get the MMN parameters *θs*; *end for*

3.1.4. Fusion Method

The motion of the object can be observed via changes in both appearance and semantics. Based on this, we follow a feature fusion strategy to combine spatial and temporal information. Given, *Xs* ∈ <sup>R</sup>H×W×T, *Xt* <sup>∈</sup> <sup>R</sup>H×W×<sup>T</sup> are the extracted frame level spatial and temporal features, where *<sup>H</sup>* and *<sup>W</sup>* are the height and width of the feature maps, *T* is the number of frames. Before the fusion operation, we have to reshape both features maps (spatial and temporal) into vectors, which can be given as:

$$X = [X\_{s}, X\_{t}] \tag{2}$$

Now, we perform a pixel-wise addition which is known as linear weighted fusion between *Xs* and *Xt* to compute a single feature map *F*.

$$F = w\_s F\_s \oplus w\_t F\_t \tag{3}$$

where, *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>H×W×T, <sup>⊕</sup> is a matrix addition, *ws* and *wt* are weights of appearance and motion for spatial and temporal features maps, respectively. The weights are used to measure the significance of spatial and temporal features. After performing the fusion operation, we can define the new representative features as *xf,t* for the video clip. So, for the input video, a set of fused features (*xf*,1, ... ... , *xf,t*, ... ... , *xf,N*) can be generated. Finally, we apply LSTM on these generated features to perform temporal encoding for human activity prediction.

#### *3.2. Encoding and Activity Classification*

This is the second and final part of our approach, which starts from detailed discussion on LSTM features and its architecture; then, we present the encoding and activity classification method.

#### 3.2.1. Long Short-Term Memory (LSTM)

To analyze the hidden sequential patterns, it is natural choice to use RNN to encode the temporal structure of extracted sequential features. In video, visual information is represented in many frames which help in understanding the context of an action. RNN can interpret such sequences, but in cases of long term sequences, it usually forgets the earlier input sequence. LSTM has been designed to mitigate the vanishing problem and to learn long-term contextual information from temporal sequences. LSTM is a kind of recurrent network which can capture long-term dynamics, and which preserves sequence information over time. In addition, the LSTM gradient does not tend to vanish when trained with back propagation through time. Its special structure with input, output and, control gates control long-term sequence pattern identification. The gates are adjusted by a sigmoid unit that learns during training when to open and close. We adopt LSTM for encoding and decoding to recognize human actions.

The architecture of a LSTM unit is depicted in Figure 5. *xt*, *ct, ht* and *yt* stand for input vector, cell state, hidden state and output at the *t*-th state, respectively. The output *yt* depends on hidden state *ht*, while *ht* depends on not only the cell state *ct*, but also on its previous state. Intuitively, the LSTM has the capacity to read and write to its internal memory, and hence, to maintain and process information over time. The LSTM neuron contains an input gate *it*, a memory cell *ct*, a forget gate *ft*, and an output gate *ot*. At each time step *t*, it can choose to write, read or reset the memory cell through these three gates. This strategy helps LSTM to access and memorize information in many steps. Equations (4)–(9) demonstrate the operation of temporal modelling performed in LSTM unit.

W and b are the parameters of the LSTM known as weights of the input vector and bias term. *S* means a sigmoid function, *tanh* is the activation function and ⊗ is the element-wise multiplication. The cell state and output are computed step by step to extract long-term dependencies. The input to LSTM is *xt*, which is the feature vector. A forget gate is used to clear the information from the memory unit, and an output gate keeps the information about the upcoming step. We also have *gt*, which is computed from the input of the current frame and state of the previous state *ht-1*. The hidden state of LSTM step is computed by using a *tanh* activation function and memory cell *ct*.

$$\mathbf{i}\_t = \mathbf{S} \left( \mathbf{w}\_{xi} \ge\_t + \mathsf{W}\_{hi} \mathbf{h}\_{t-1} + \mathbf{b}\_i \right) \tag{4}$$

$$f\_t = S\left(w\_{xf} \ge\_t + \mathcal{W}\_{hf} h\_{t-1} + b\_f\right) \tag{5}$$

$$\rho\_t = \mathcal{S}(w\_{\text{xo}} \ge\_t + \mathcal{W}\_{\text{lh}} h\_{t-1} + b\_o) \tag{6}$$

$$\mathbf{g}\_t = \tanh(\mathbf{w}\_{\mathbf{x}\mathbf{c}} \, \mathbf{x}\_t + \mathbf{W}\_{\mathbf{h}\mathbf{c}} h\_{t-1} + b\_{\mathbf{c}}) \tag{7}$$

$$\mathfrak{c}\_{t} = f\_{t} \otimes \mathfrak{c}\_{t-1} + i\_{t} \otimes \mathfrak{g}\_{t} \tag{8}$$

$$z\_{l} = \; h\_{l} = o\_{l} \odot \tanh\left(c\_{l}\right) \tag{9}$$

**Figure 5.** The architecture of LSTM Unit.

#### 3.2.2. Encoding and Classification Process by LSTM

The generated fused features *xf,t* are fed into LSTM as inputs to conduct encoding and decoding for activity prediction. LSTM can be jointly trained, and our proposed model provides a trainable platform which is ideal for large-scale cognitive intelligence. The unique feature of LSTM is that it processes variable length inputs and produces high-level variable length predictions (Output). As shown in Figure 6, LSTM consists of an encoder and decoder; the encoder transforms input data *xt* to a corresponding activation *h*. The decoder in the output layer is trained to reconstruct an approximation *y* of the input from activation *h*.

**Figure 6.** The framework of LSTM (Encoder-Decoder).

In general, the LSTM model has parameters i.e., *W* and *b*, which denotes the weights and the biases of input layer and the hidden layer respectively, generates an output *zt* of given input *x*t and a previous hidden state at time step *t* − 1 i.e., *ht*−1, and also updates the current hidden state *ht*.

$$z\_t = S(\mathsf{W}\_1 \mathbf{x}\_t + b\_1) \tag{10}$$

The next step is decoding, which is similar as the encoding step given in Equation (10), where *W*<sup>2</sup> and *b*<sup>2</sup> denotes the weights and biases of the hidden layer and the output layer.

$$y\_t = (\mathbb{W}\_2 z\_t + b\_2) \tag{11}$$

The final single label prediction for a video can be produced by using softmax classifier. A Softmax layer can be utilized to achieve the M-way class scores for a given video sequence. This single prediction can be achieved by averaging the label probabilities, which is the output of our decoder, and can be represented by the Equation (11).

$$P(y\_{(t)}^q = 1) = \text{softmax}(y\_t) = \text{softmax}(\mathcal{W}\_{zt} + b\_t) \tag{12}$$

where *W* and *bt* are the trained parameters of the LSTM model, *q* ∈ *Q* is the prediction and *t* is the current time step.

#### **4. Experiments**

We conduct several experiments to validate the effectiveness of our system. Three well-known benchmark human action datasets, UCF101 [32], HDMB51 [33], and UCF Sports [34], have been used. The description of datasets with their validation schemes, experimental setup, results and comparative analysis are presented in subsequent sections.

#### *4.1. Datasets*

The UCF101 dataset is the extension of UCF50; it contains 101 different action categories. Each action category consists of at least 100 video. There are 13,320 video clips in total. Most of the video clips are realistic, clean and user-uploaded videos with cluttered background, illumination and camera motion. The dataset is divided into a training set containing 9.5 K videos and testing set containing 3.8 K videos. We adopt the evaluation scheme of the THUMOS13 challenge [35] and follow the three testing/training split for performance evaluation by reporting average recognition accuracy over these three splits.

The HDMB51 dataset comprises of variety of realistic videos collected from YouTube and Google video. There are 6766 manually annotated video clips of 51 different action classes and each action class containing about 100 video clips. For experimental setting, we follow the original evaluation guidelines using three-test splits, and each split with an action class has 30 sequences for testing and 70 sequences for training. The average accuracy over these three splits is used to measure the final performance.

The UCF sports dataset encompasses 150 videos from 10 action classes. These videos were recorded in real sports environments, taken from different television channels. This dataset exhibiting the occlusion, illumination conditions and variations in background make it a complex and challenging dataset. The average accuracy is used to measure the final performance.

Some sample frames from three datasets are given in Figure 7.

**Figure 7.** Sample frames from UCF101 (**first row**), HDMB51 (**second row**) and UCF Sports (**third row**).

## *4.2. Experimental Setup and Implementation Details*

As UCF101 is the largest dataset among the three datasets, we use it to train the C3D model initially, and then transfer it to the learnt model to HMDB51 and UCF sports for feature extraction. RGB clips are resized to have a frame size of 128 × 171. On training, we randomly crop input clips into 16 × 112 × 112 crops. We also horizontally flip them with 55% probability. We fine-tune the model parameters on the UCF101 dataset, where the initial learning rate is set as 0.003, which is divided by 2 every 150 K iterations. The optimization is stopped at 1.9 M iterations.

Since the network can only handle video frames, the videos need to be processed as video frames. Therefore, we extract a constant number video frames per second, which improves the generalization performance of the network. For the network to better capture the changes in the action, we extract two frames per second (fps) and loop the extracted frames, and fill up to 16 frames per video.
