4.2. Experimental Results
The proposed model is constructed based on the settings in
Section 3.4, then trained on the four datasets, which include two types (fight and non-fight) of videos. The model can generally quickly converge into a stable value. The result shows that loss curves for the CrowdViolence and RWF2000 datasets fluctuate more than the other two. In this work, four evaluation measures, namely
,
,
and
, are introduced [
47], and which are defined as following:
If a video is labeled as a fight and the model also predicts it as a fight, the predicted result is considered true positive (
). Similarly, if a video is labeled as a non-fight and the model considers it a non-fight, the predicted result is true negative (
). Both true positive and true negative suggest a consistent result between the predicted value and the ground truth. If a video with the fight label is predicted as a non-fight by the model, the prediction is false positive (
). Similarly, if a video with the non-fight label is suggested as a fight by the model, the prediction is false negative (
).
To evaluate the performance, we compared the proposed model with seven existing models, which are 3D ConvNet [
48], ConvLSTM [
12], C3D [
8], I3D(Fusion) [
49], Flow Gated Network [
46], Context-LSTM, and STS-ResNet [
37], respectively.
Table 2 shows predicting accuracy on testing data and the number of parameters based on SpikeConvFlowNet and other methods. The performances based on testing data of the first five models (from 3D ConvNet to flow gated network) are reported in the papers. It should be noted that the methods labeled using an asterisk in
Table 2 are implemented and trained by ourselves to obtain their test accuracy values.
The mean and variance of the training accuracy for the other seven models are not reported in their articles. To verify the stability of the proposed model, we trained and evaluated it five times. The training data and testing data are randomly split for each experiment. The mean and variance of the testing accuracy are as follows: Movies (mean: 100%, variance: 0.00), Hockey (mean: 98.39%, variance: 0.26%), Crowd (mean: 90.14%, variance: 1.83%), RWF2000 (mean: 88.35%, variance: 1.77%).
To further test the performance of the proposed method on datasets with multiple classes, two public human action recognition datasets are used, namely HMDB51 and UCF101 [
29]. The HMDB51 dataset consists of 6766 video classes annotated for 51 action classes extracted from movies and YouTube. The UCF101 dataset contains 13,320 video clips categorized into 101 action classes collected from YouTube. Our model is compared with two methods for recognizing human actions: Context-LSTM and STS-ResNet. All predicted results on testing data for each method are shown in
Table 3. It is noted that Context-LSTM is implemented by ourselves and then the testing accuracy on HMDB51 is obtained, which is labeled by an asterisk in the table. Other results are reported in their articles.
The results in
Table 2 reveal that the methods are over-fitted on the Movies Fight dataset. The main reason is that this dataset contains 200 clips, with 100 positive and negative samples each. It is balanced, but its size is small, and the scenes in clips are relatively simple. For example, fierce fighting between two or more people is considered violent, while walking or playing football is considered non-violent. As a result, it is easier to identify violent actions on the Movies dataset than on other datasets containing many complex real-life scenes. Hence, we can conclude that this dataset is no longer suited to evaluate the deep learning-based models for violence detection effectively. The performances on the Crowd Violence and RWF2000 datasets have not achieved as high testing accuracy as the first two datasets. The main reason is that these two datasets consist of many real-world scenes, such as crowd activities, which enhances the difficulty of recognizing fight or non-fight violence from these scenes. Among these methods, Context-LSTM is based on RNN and performs best. One reason is that it employs a ResNet50 well-pre-trained on large-scale datasets. It is not an ideal model for small embedded devices and neuromorphic chips. Compared to STS-ResNet, the proposed model achieves comparative performance and only includes around one percentage of parameters. A possible reason is that optical flow data enhances our method’s capability of extracting critical motion information. Based on
Table 3, it can be seen that the performance of two SNN-based models drops significantly on large-scale datasets with multi-classes for action recognition. The main reason is that the HMDB51 and UCF101 datasets contain multiple complex scenes. Currently, these two convolutional spiking neural networks do not have enough ability to learn the features and recognize multi-classes actions effectively over these challenging datasets. Generally, compared with existing CNN/RNN-based methods and convolutional SNN, SpikeConvFlowNet can achieve comparative performance across all datasets related to violence detection, but also dramatically reduces the number of parameters. Moreover, the results also verify that the shallow SNN-based model can perform well while avoiding the vanishing spike phenomenon.
The training time and inference efficiency are experimentally measured to verify the proposed model’s computational efficiency further. All experiments are implemented on RWF2000, including more violent videos and complex scenes than the other three datasets. The configuration of hardware is described in
Section 3.4. As shown in
Table 2, Context-LSTM (RNN-based) and STS-ResNet (SNN-based) are selected as benchmarks to compare with the proposed method. To speed up the training process, all videos are pre-processed and transformed to the form of ndarray, which is a data type in Numpy packages. These models are generally convergent after being iteratively trained for 100 epochs. The GPU server is used to train these models. However, the inference efficiency is measured only using the CPU, similar to the devices with low computational resources. The training efficiency is measured by running time with the unit of the hour, while the inference efficiency is measured by frames per second (fps). The training time and inference efficiency are shown in
Table 4.
The results demonstrate that the proposed model can achieve higher training and inference speed than the benchmarks. It is noted that the batch size for training is set as one due to the inconsistency of video length, which makes training time increase. Compared with Context-LSTM, the proposed model only includes one-tenth of the parameters. However, their training time is close. One reason is that SpikeConvFlowNet employs optical flow, which would slow down the training process.
Based on the results of
Table 2,
Table 3 and
Table 4, it can be concluded that the proposed method is more efficient than existing CNN/RNN-based models and is more suitable for low-power embedded devices. Compared with the best convolutional spiking neural network, this method significantly reduces the number of parameters with limited accuracy loss. It shows that our method can provide a potential solution for neuromorphic chips to detect violent behaviors efficiently.
Figure 6 shows the confusion matrix of experimental results on four datasets using the proposed model.
It also demonstrates that the videos with complex scenes in the last two datasets could degrade the prediction accuracy compared with other datasets. Based on the confusion matrix,
and
are calculated, shown in
Table 5.
Generally, the is relatively higher than the . It reveals that the proposed model has a higher possibility of classifying a video as a fighting type. It can avoid omitting fight samples in testing data, but part non-fight videos could be falsely predicted as fight videos, which is acceptable for practical applications.
Furthermore, to experimentally explore the mechanisms of the SpikeConvFlowNet,
Figure 3 shows the average of spiking activity in each block on four testing datasets.
It reveals that the spikes number generally accounts for quite a low percentage (less than 10% in the first three SpikeConvBlocks) in the output of each block layer. In other words, the output for each block is a relatively sparse binary tensor.
Figure 7 shows several samples from the feature maps in hidden layers, i.e., SpikeConvBlock1, SpikeConvBlock2, and SpikeConvBlock3.
For each layer, the heatmaps before operating through the IF neuron and after the IF neuron are displayed. It could be inferred that the proposed model is capable of extracting spatial features by spiking neurons. These heatmaps also visually show the sparsity of spikes fired in each layer. This character enables the proposed architecture to be energy-efficient for the hardware solutions of SNNs.
To sum up, the spiking neuron fires a pulse when the accumulated current exceeds a threshold. Therefore, the input to a spiking neuron is the accumulation of historical inputs. In other words, the neuron can remember historical information. Therefore, when training or using the model, it only needs to receive the next frame sequentially and outputs prediction results. CNN-based models need many historical frames to output results, which consumes more computing time and hardware resources. However, it is worth noting that LSTM-based models can also remember historical information similar to spiking neurons to output prediction results. Moreover, the output of each spiking neuron is a binary value. Many matrix multiplication operations are involved in training and using deep neural networks. The binary matrix can significantly improve matrix multiplication operations’ computational efficiency, saving time and computing resources. CNN/RNN-based models do not have this advantage. Compared with existing CNN/RNN-based models and convolutional SNN, this method can improve computing efficiency and save computing time and hardware resources, which is verified by the experimental results. Meanwhile, it dramatically reduces parameters and achieves higher inference efficiency with limited accuracy loss by combining the advantages of optical flow and convolutional spiking networks. In addition, since the output of spiking neurons is a discontinuous spike train, the combination of SNN and neuromorphic chips can further improve computational efficiency and reduce power consumption.