Combining Keyframes and Image Classification for Violent Behavior Recognition

Bi, Yanqing; Li, Dong; Luo, Yu

doi:10.3390/app12168014

Open AccessArticle

Combining Keyframes and Image Classification for Violent Behavior Recognition

by

Yanqing Bi

^*

,

Dong Li

and

Yu Luo

College of Computer, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(16), 8014; https://doi.org/10.3390/app12168014

Submission received: 24 June 2022 / Revised: 3 August 2022 / Accepted: 9 August 2022 / Published: 10 August 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Surveillance cameras are increasingly prevalent in public places, and security services urgently need to monitor violence in real time. However, the current violent-behavior-recognition models focus on spatiotemporal feature extraction, which has high hardware resource requirements and can be affected by numerous interference factors, such as background information and camera movement. Our experiments have found that violent and non-violent video frames can be classified by deep-learning models. Therefore, this paper proposes a keyframe-based violent-behavior-recognition scheme. Our scheme considers video frames as independent events and judges violent events based on whether the number of keyframes exceeds a given threshold, which reduces hardware requirements. Moreover, to overcome interference factors, we propose a new training method in which the background-removed and original image pair facilitates feature extraction of deep-learning models and does not add any complexity to the networks. Comprehensive experiments demonstrate that our scheme achieves state-of-the-art performance for the RLVS, Violent Flow, and Hockey Fights datasets, outperforming existing methods.

Keywords:

violence detection; deep learning; keyframes; semantic segmentation

1. Introduction

Surveillance cameras are increasingly prevalent in public places, and are used to collect evidence and deter potential criminals. Security services urgently need to monitor violence in real time. The main violence detection models include “handcrafted” feature methods [1,2], CNN + LSTM [3,4], 3D convolutional networks [5,6], two-stream convolutional networks [7,8], and human pose estimation models [9,10], all of which mainly use CNN, 3DCNN, and LSTM to extract spatiotemporal features. However, these methods have high hardware resource requirements and can be impacted by various interference factors, such as background information, camera movement, and occlusion.

To overcome these problems, S3-Net [11] locates and segments target sub-scenes while extracting structured time-series semantic features, which are used as inputs by an LSTM-based spatiotemporal model. Motivated by S3-Net, we propose that segmentation of the human body can enable the model to learn better features and develop a stronger generalization ability. Semantic segmentation [12,13,14,15] can remove the background from video frames. We use Deeplab-V3plus [14] to obtain 100,000 background-removed images (segmented images). However, on the test set of the original image, the recognition rate of a model trained on segmented images is much lower than one trained on the original images; on the test set of the segmented image, the recognition rate of a model trained on original images is much lower than one trained on the segmented images. This means that the deep-learning model cannot learn the essential image features, because the mapping from inputs to outputs performed by deep nets quickly ceases to make sense if new inputs differ even slightly from those observed during training. This phenomenon is common in automatic speech recognition, since datasets without noise are readily available. The models trained with clean data were difficult to apply to noisy scenes. Therefore, they add noise to the data to increase the robustness of the model. For example, the wav2vec-Switch [16] encodes noise robustness into contextualized representations of speech via contrastive learning, which feeds original–noisy speech pairs simultaneously into the network. Therefore, we use pairs of segmented and original images to train the model, forcing the network to produce consistent predictions for the original and segmented image. Specifically, if the model wants to minimize the loss, its best choice will be to match the prediction of the segmented image with the original image. This enables the development of a robust model that predicts the original image in line with the segmented image.

However, current violent behavior recognition models have high hardware resource requirements, making them difficult to deploy on a large scale. Ismael et al. [17] propose a hybrid “handcrafted/learned” feature framework, which provides better accuracy than previous feature-learning methods. Inspired by this, we combine keyframes and deep learning to develop an approach that extracts the keyframes of the video and judges the violent behavior with reference to the keyframe recognition results. Specifically, the proposed method treats video frames as independent events and judges whether a violent event is occurring based on whether the number of keyframes exceeds a given threshold. The number of extracted keyframes can control the recognition efficiency of the system, while the judgment threshold can control the system’s sensitivity.

In summary, we propose a new keyframe-based violent-behavior-recognition scheme with low hardware requirements and a high recognition rate. Furthermore, we build a dataset of violent images with the backgrounds removed and propose a new training method.

2. Related Works

2.1. Datasets

Existing violent behavior datasets include Real Life Violence Situations Dataset (RLVS) [18], Hockey Fights Dataset [19], RWF-2000 [20], CCTV Fights Dataset [21], and Violent-Flow [22]. The RLVS contains 2000 short videos divided into 1000 violence videos and 1000 non-violence videos. The RWF-2000 contains 1000 violent videos and 1000 non-violent videos captured by surveillance cameras in real-world scenes. The CCTV dataset contains 1000 videos of real fights, with more than 8 h of annotated CCTV footage. The Hockey Fights dataset introduces a video database containing 1000 sequences divided in two groups: fights and non-fights. Existing semantic segmentation datasets include PASCAL Context [23], Cityscapes [24], Microsoft Common Objects in Context (MS COCO) [25], and VSPW [26], all of which contain human annotations. Human poses are normal in the semantic segmentation, while human poses are abnormal in the violence datasets. In addition, semantic segmentation datasets mostly consist of high-quality images (for example, the pictures in the COCO dataset are high resolution and taken with professional equipment; the latest VSPW dataset is the first large-scale dataset for video scene parsing in the wild, but more than 96% of its video frames are high-quality and lack abnormal behavior), while images in violent behavior datasets usually are collected by security cameras and mobile phones, meaning that they are usually low resolution and unevenly illuminated. Therefore, models trained with semantic segmentation datasets will experience widespread feature loss on violent behavior datasets. To overcome these problems, we will need to annotate video frames containing violence to supplement the semantic segmentation dataset, enabling the model to better adapt to violent scenes.

2.2. Semantic Segmentation Algorithms

Traditional Gaussian [27] and KNN [28] approaches are suitable for static surveillance cameras, but not suitable for video with moving angles. Compared with traditional methods, deep-learning-based semantic segmentation models are better able to adapt to complex scenes. SegNet [13] aims to solve image semantic segmentation for autonomous driving or robotics, and is suitable for road scenes. Deeplab-V3plus [14] can capture multiscale information and recover the edge information of objects, producing an excellent semantic segmentation effect when the targets in the image are of different sizes. Background Matting V2 [29] proposes a real-time, high-resolution background-replacement technique that operates at 30 fps in 4K resolution. Therefore, we need to choose an image segmentation algorithm that is suitable for violent scenes.

2.3. Keyframe Extraction Methods

The detection of keyframes can reduce the number of redundant frames processed by the system, thereby improving the system’s efficiency. Keyframe extraction methods include traditional methods [30,31,32] and deep-learning methods [33,34,35]. Traditional methods are fast and require no training, while deep-learning methods can more accurately extract keyframes. Compared with traditional methods, deep-learning methods have high hardware requirements and need the support of large amounts of data. Because the judgment of violent events does not depend on specific features, traditional methods can be used to extract keyframes in our approach.

2.4. Violence-Detection Models

Depending on the deep-learning-network structure employed, the main violence-detection models include CNN + LSTM, VGG + LSTM, 3D convolutional networks (3DCNN), two-stream convolutional networks, and human-pose estimation. These models mainly use CNN, 3DCNN, and LSTM to extract spatiotemporal features based on image or optical flow data. Therefore, they can be impacted by various interference factors, such as camera movement, occlusion, complex scenes, etc. Specifically, CNN-based models will struggle with interference from the background and occlusion. Moreover, the background and camera movement will greatly impact 3DCNN-based and LSTM-based models, making the currently available datasets unsuitable for model training. Although the two-stream convolutional network adds optical flow data to make the model pay attention to the moving human body, it will be impacted by the interference of camera movement and illumination variation. The recognition rate of human-motion estimation will decrease when the number of people increases and the body posture is blocked. Finally, adding attention, motion detection, and other mechanisms to the model will cause the model to pay attention to behavioral features, but will reduce the model’s speed.

3. Proposed Method

We propose a violence-detection system that combines keyframe extraction, image classification, and probability. In addition, we use segmented–original image pairs to improve the recognition rate of the deep-learning models.

3.1. The Overall Framework

Our model incorporates keyframe extraction, image classification, and violent-behavior recognition (Figure 1). The keyframe extraction algorithm uses the frame difference method to calculate the local maximum value of the sliding window (continuous N frames) and selects the local maximum frame as the keyframe in every continuous K frames. Compared with the current mainstream violence recognition models, this design reduces the number of processed frames to 1/K of the total video frames. ResNet18 is used as a keyframe-feature-recognition network, which outputs the probability that a given keyframe contains violent behavior. Compared with 3DCNN-based and VGG-based sorts, our model parameters are smaller. This approach reduces the hardware requirements of the system. Furthermore, our model is easier to train compared to 3DCNN-based and LSTM-based sorts. The violent-behavior-recognition module determines the video category with reference to the number of keyframes belonging to the category of frames containing violence. We control the recognition efficiency of the system by setting the number of keyframes and control the recognition sensitivity by setting the threshold. Therefore, our model is robust and can be flexibly adjusted. In applications, the threshold and processing efficiency can be selected according to the requirements of the actual situation.

3.2. Dataset

We use the frame difference method to extract the keyframes of the RLVS, RWF-2000, and Hockey Fights datasets, and use Labelme to annotate 5000 human poses in order to train a semantic segmentation model (Figure 2). In our experiments, we found that the labeling of the data can significantly impact the model. If the fight participants are accurately labeled, the model will split the human figures; moreover, if a dense group of people is labeled as a single person, the model will not effectively segment the crowd. Therefore, we label the individual fight participants as a human pose, label the dense group of people as a human body, and also label them separately. These annotated data are combined with the MS COCO dataset to train a semantic segmentation model. We set up a dataset to test whether the model is suitable for scenes of violence. Based on the segmentation effect (Figure 3), we select the Deeplab-V3plus as the segmentation model. The establishment of our dataset is shown in Figure 4. The segmented images need to establish a mutual index relationship with the original images so that the background-removed and original pair can be loaded during training (Figure 5), and the final set of images includes 100,000 pictures with backgrounds removed.

3.3. Keyframe Extraction

While deep-learning methods can accurately extract keyframes, they require high-performance hardware and reduce the system’s processing speed. The characteristics of violent video frames are apparent; thus, traditional methods can be used to extract keyframes. The available traditional methods have roughly equivalent probabilities of extracting frames containing violence, with all of them able to reach or exceed a success rate of 98.90%. In practical applications, the frame difference method can better control the number of extracted keyframes. Therefore, we propose to use this method in our approach, as it can reduce the number of false-positive (non-violent) frames extracted by controlling the length of the window. For example, some nonviolent behaviors (such as hugging, shaking hands, etc.) may briefly appear similar to violent behaviors, and an appropriate window length can reduce the chances of these being flagged as violent. Moreover, because violent events are uncertain in the real world, keyframes can also be extracted at fixed intervals.

3.4. Our Training Method

Current violence-detection systems mainly extract the spatiotemporal features of the video, leading to the accumulation of errors in consecutive frames. For example, the 3DCNN-based method extracts 16 consecutive frames as input, and this method will bring many uncertainties. The movement of the camera and changes in illumination will generate more errors. In addition, these structures will increase the structural complexity of the model and increase the hardware requirements. The human pose in violent video frames is significantly different from that of normal video frames, and can thus be classified by the deep-learning model. This reduces the structural complexity of the model. We tested the recognition effect of ResNet18, ResNet34, ResNet101, DesNet121, VGG19, and GoogleNet on frames containing violence and proved that this is feasible. These models are trained by segmented–original image pairs. The segmented and original images guide each other, forcing the network to make consistent predictions for the original and segmented images. Specifically, if the model is robust to noise, the representation of an original image should also predict the target concerning its segmented version, and vice versa. This is equivalent to creating a feature space between the segmented image and the original image; theoretically, such a robust feature space can be found. We provide an interpretable theoretical proof below.

Proof.

A_{n \times m}

is the perfect feature space of the segmented images.

B_{n \times m}

is the perfect feature space of the original images.

n

is the number of images, and

m

is the dimension of the image feature vector.

a_{i}

is the feature representation of

s e g m e n t_i m a g e_{i}

,

b_{i}

is the feature representation of

o r i g i n a l_i m a g e_{i}

,

l_{i}^{'}

is the model prediction feature of

s e g m e n t_i m a g e_{i}

, and

l_{i}^{″}

is the model prediction feature of

o r i g i n a l_i m a g e_{i}

:

A_{n \times m} = {[a_{1}, a_{2}, a_{3} \dots a_{n}]}^{T}, a_{i} = (α_{i 1}, α_{i 2, \dots} α_{i m}), l_{i}^{'} = (x_{i 1}, x_{i 2, \dots} x_{i m})

B_{n \times m} = {[b_{1}, b_{2}, b_{3} \dots b_{n}]}^{T}, b_{i} = (β_{i 1}, β_{i 2, \dots} β_{i m}), l_{i}^{″} = (y_{i 1}, y_{i 2, \dots} y_{i m})

L o s s = \sum_{i = 1}^{n} ‖ l_{i}^{'} - a_{i} ‖^{2} + ‖ l_{i}^{″} - b_{i} ‖^{2}

= \sum_{i = 1}^{n} \sum_{j = 1}^{m} [{(x_{i j} - α_{i j})}^{2} + {(y_{i j} - β_{i j})}^{2}]

When

[{(x_{i j} - α_{i j})}^{2} + {(y_{i j} - β_{i j})}^{2}]

is at its smallest,

L o s s

is also at its smallest. □

f (x_{i j}, y_{i j}) = {(x_{i j} - α_{i j})}^{2} + {(y_{i j} - β_{i j})}^{2}

(1)

We define the feature representation difference between the original image and its background-removed image as the distance. The segmented image is the foreground information of the original image, so the distance between them is small. Therefore, we add Function (2). Constraint (3) is obtained from Function (2).

l_{i}^{'} - l_{i}^{″} = Δ l_{i}, Δ l_{i} = (Δ_{i 1}, Δ_{i 2,} Δ_{i 3, \dots} Δ_{i m}), x_{i j} - y_{i j} = Δ_{i j}

(2)

g (x_{i j}, y_{i j}) = y_{i j} - x_{i j} + Δ_{i j} = 0

(3)

We aim to find the minimum value of

f (x_{i j}, y_{i j})

under Constraint (3). Then use the Lagrange multiplier to solve

L (x_{i j}, y_{i j}, λ) = f (x_{i j}, y_{i j}) + λ g (x_{i j}, y_{i j})

to obtain the minimum point. Solve

L_{x} = \frac{\partial L}{\partial x_{i j}} = 2 x_{i j} - 2 α_{i j} - λ = 0, L_{y} = \frac{\partial L}{\partial y_{i j}} = 2 y_{i j} - 2 b_{i j} + λ = 0, L_{λ} = \frac{\partial L}{\partial λ} = y_{i j} - x_{i j} + Δ_{i j} = 0

to obtain

x_{i j} = α_{i j} + \frac{1}{2} (β_{i j} - α_{i j} + Δ_{i j}), y_{i j} = x_{i j} - Δ_{i j}

.

L_{x x} = 2, L_{y y} = 2, L_{x y} = 0

proves that

x_{i j} = α_{i j} + \frac{1}{2} (β_{i j} - α_{i j} + Δ_{i j}), y_{i j} = x_{i j} - Δ_{i j}

as the minimum value.

Segmented pictures are the foreground information of the original pictures, and they express the same information. When the model converges, the distance between

l_{i}^{'}

and

l_{i}^{″}

tends towards

0

.

\lim_{Δ_{i j} \to 0} x_{i j} = \lim_{Δ_{i j} \to 0} y_{i j} = α_{i j} + \frac{1}{2} (β_{i j} - α_{i j})

The feature space learned by the model is

C_{n \times m}

. The features predicted by the model are between

A_{n \times m}

and

B_{n \times m}

, which is equivalent to

C_{n \times m} = A_{n \times m} + \frac{1}{2} (B_{n \times m} - A_{n \times m})

. In classification models, the loss consists of three parts (4): the loss of the segmented image and label (

l o s s_s e g

), the loss of the original image and label (

l o s s_un s e g

), and the distance between the fully connected layers of the two images (

D

), while

β

is a hyperparameter. As shown in Figure 1, in training, the loss of segmented–original images and labels is reduced, which is equivalent to finding the minimum value of Formula (1). At the same time, the model’s fully connected layer of the segmented–original image gradually approaches, which is equivalent to reducing the value of

Δ l_{i}

in Formula (2). The optimization problem of Formulas (1) and (2) can be effectively solved by means of a stochastic gradient descent and backpropagation. The segmented image and original image make the model create a feature space between the segmented image and the original image; theoretically, such a robust feature space can be found, thereby reducing the interference of background information.

L o s s = l o s_s e g + l o s s_u n s e g + β * D

(4)

3.5. Judgment of Violence

In social dynamics, a simple approach consists in substituting the continuous distribution over the micro-state by a discrete one so that each node of the discrete micro-scale variable represents the number of particles in a certain domain of the space of the microscopic states [36]. Therefore, we extract the key frames of the video as the basis for judging the behavior. We use the proportion of violent frames in keyframes to measure whether violence is present in the video, which reduces the error accumulation of the spatiotemporal features of consecutive frames. Here,

P

is the recognition rate of the image classification model,

K e y f r a m e_n o

is the number of keyframes extracted from a violent video,

T h r e s h o l d_n o

is the threshold for judging that the video contains violent behavior, and

P_{R e c o g n i t i o n}

is the probability that the model recognizes violent behavior. We define the evaluation function of

P_{R e c o g}

. The smallest

\frac{K e y f r a m e_n o}{T o t a l f r a m e}

means that the model processes the fewest frames, which can improve the model’s processing speed; the largest

\frac{T h r e s h o l d_n o}{K e y f r a m e_n o}

means that the model has the best stability when identifying violent behaviors. Therefore, we can adjust the system according to the needs of the actual situation. Specifically, the extraction of keyframes can control the number of frames processed by the model. According to the evaluation Function (6), choosing

m i n (\frac{K e y f r a m e_n o}{T o t a l f r a m e})

means that the smallest number of frames can be handled. When the number of keyframes is determined (Figure 6 shows the recognition with 16 keyframes), we can adjust the system’s sensitivity by adjusting the threshold. When p = 0.9, if

P_{R e c o g n i t i o n}

exceeds 95%, we can select 8, 9, 10, 11, or 12 as the

T h r e s h o l d_n o

; according to

\frac{T h r e s h o l d_n o}{K e y f r a m e_n o}

,

T h r e s h o l d_{n o} = 12

produces the most reliable judgment.

P_{R e c o g n i t i o n} = \sum_{k > T h r e s h o l d_n o}^{K e y f r a m e_n o} C_{K e y f r a m e_n o}^{k} P^{k} {(1 - P)}^{K e y f r a m e_n o - k}

T h r e s h o l d_n o \in [⌈ \frac{K e y f r a m e_n o}{2} ⌉, k e y f r a m e_n o]

(5)

E v a l u a t e (P_{R e c o g}) = {\min (\frac{K e y f r a m e_n o}{T o t a l f r a m e}), \max (\frac{T h r e s h o l d_n o}{K e y f r a m e_n o})}

(6)

4. Experiments

4.1. Deeplab-V3plus

We combine the annotated violent frames with the MS COCO dataset to train Deeplab-V3plus, which includes 12,000 human poses. Hyper-parameter settings are shown in Table 1, and these settings enable better adaptation to violent scenes. The backbone network uses the ‘xception’ to adapt to the changes in human poses and complex backgrounds in frames containing violence. The downsampling value is 8, which enables better segmentation of small objects. The input image size is set to 224 × 224 to facilitate better handling of low-quality frames containing violence. The training of the model consists of a freezing stage and an unfreezing stage. In the freezing stage, the backbone network is frozen, and the feature extraction network does not change. In the unfreezing phase, the backbone network is not frozen, and the feature extraction network changes, which will take up a lot of video memory. We recommend training for 60 epochs (we used a pretrained model; see Figure 7). The model achieves performance of 90.1% and 86.20% on the training and testing sets, respectively.

4.2. Facilitation of Model Learning by Segmented–Original Images

To explore the effect of segmentation on the distribution of the dataset features, we visualized the dimensionality-reduced segmented dataset and the original dataset. We set up two control groups. The first group contains 10,000 violent pictures and 10,000 non-violent pictures randomly selected from each original dataset. The second set contains 10,000 violent images and 10,000 non-violent images, which are the background-removed images of the first set. We used t-SNE to reduce the dimensionality of these pictures and then visualized them. In the violent behavior dataset, as shown in Figure 8b,d, the two sets of dimensionality-reduction results prove that segmenting the human pose can make the feature space of the dataset more aggregated.

We designed six comparative experiments (Table 2) to verify the promotion effect of the segmented–original image pair on the model. We performed experiments on VGG19, GoogleNet, ResNet18, and DesNet121. On ResNet18 and DesNet121, we adopt two methods: let the fully connected layer features of the segmented–original image pair be close to each other (the hyperparameter

β

is

0.001

); directly use the segmented–original image pair for training. Experiments show that both methods are effective. For VGG19 and GoogleNet, we adopt the second methods. Figure 9 shows the change in the recognition rate for each group during the training process. The significant difference between the recognition rates of the first and second groups indicates that the model trained on the original image has not learned the essential features, which is mainly due to the interference of background information. The significant difference between the recognition rates of the third and fourth groups indicates that the model trained on the segmented images has not learned the essential features, which is mainly because the model trained on pure data is not robust. Moreover, the recognition rate of the third group is higher than that of the fourth group, indicating that segmented images can reduce the influence of the background on the model. The fifth and sixth groups have the highest recognition rates for both original and segmented images, indicating that the segmented–original pairs promote the model’s learning of image features without increasing the model complexity. In summary, the segmented–original image pair aids the model in learning essential features, thereby improving the recognition rate (Table 3).

We design an encoder–decoder network to visualize the interaction between the segmented image and the original image (Table 4). The loss consists of three parts (7): the reconstruction loss of the segmented image, the reconstruction loss of the original image, and the distance between the hidden layers of the two images, while

β

is a hyperparameter. The reconstruction loss of the segmented and original image is reduced, which is equivalent to finding the minimum value of Formula (1). At the same time, the distance between the hidden layers of the two images gradually approaches convergence, which is equivalent to reducing the value of

Δ l_{i}

in Formula (2). The optimization problem of Formulas (1) and (2) can be effectively solved by means of stochastic gradient descent and backpropagation. Although this design increases the reconstruction error, it also forces the model to pay attention to the foreground information, thereby decoupling the reconstruction from the background information (Figure 10). Our experiments show that the segmented image and original image make the model create a feature space between the segmented image and the original image; theoretically, such a robust feature space can be found, thereby reducing the interference of background information.

L o s s = r e c o n_l o s s_s e g + r e c o n_l o s s_un s e g + β \times D

(7)

4.3. Keyframes

We counted the data distribution of frames in violent videos and found that the number of video frames varies greatly between datasets (Table 5). Moreover, the data distributions of different violent videos are different (Figure 11); as a result, the important violent frames (such as Figure 11a) cannot be extracted from consecutive frames. Therefore, using a fixed number of frames is an unreliable approach to identifying violent behavior. We instead recommend dynamically adjusting the number of frames to be identified based on the total number of frames in the video. We select keyframes in a given window length that are most different from other frames. This allows our system to control the number of keyframes. Therefore, the number of keyframes can be used to control the recognition efficiency and robustness of the system. Specifically, a sliding window is established containing

N

frames, and the local maximum frame in every continuous

K

frames is selected as the keyframe. In this way,

N / K

keyframes for judgment can be obtained. The setting of

N

can prevent short-duration behaviors that are superficially similar to violent behaviors (such as friendly play-fighting hugs, and handshakes) from causing false alarms. Moreover,

K

can make uniform sampling, which can prevent camera movement and lighting changes from causing many redundant frames being extracted. In summary, extracting the keyframe can more effectively accumulate the recognition results of a single frame and improve the overall recognition accuracy.

4.4. Our Violent-Behavior-Recognition System

Our proposed training method enables ResNet18, ResNet34, ResNet101, DesNet121, and GoogleNet to recognize frames containing violence over 90%. This shows that our method is applicable to almost all convolutional networks. We choose ResNet18 as the recognition model because of its low hardware requirements. The training and test set ratios on the RLVS and Hockey Fights datasets are 0.8 and 0.2, respectively. On the Violent Flow dataset, the ratios of the training set and test set are 0.9 and 0.1, respectively. On Violent Flow dataset, it is difficult for the semantic segmentation algorithm to obtain enough segmented images, so we only use the original images for training. ResNet18 achieved a recognition rate of more than 91.39% (Table 6) on three datasets and can theoretically achieve a recognition rate of more than 95% for violent behaviors (Figure 6). Setting the value of

K

can control the number of violent frames extracted. When

K

is fixed, the threshold setting of the system can affect the recognition rate (Table 7). Changes to

K

and the threshold will affect the recognition rate of the system (Table 8). The recognition rate of our system reaches state-of-the-art levels on RLVS and Violent Flow and is close to the state of the art on Hockey Fights (Table 9). Compared with other models that extract spatiotemporal features, our approach has a more straightforward structure, making it easier to train and reducing the hardware performance requirements.

5. Conclusions

We combine keyframes and image classification to implement a robust and adjustable violent behavior recognition system. The number of keyframes can control the recognition efficiency of the model, while probabilistic thresholds can control the robustness and sensitivity of the system. In addition, we propose a new training method to improve the model’s recognition rate. Our system performance can surpass or approach that of current violent behavior recognition systems without increasing the complexity of the model structure.

Author Contributions

Data curation, D.L.; Validation, D.L.; Writing—original draft, Y.B.; Writing—review & editing, D.L. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by College of Computer, National University of Defense Technology. And The APC was funded by College of Computer, National University of Defense Technology.

Institutional Review Board Statement

The study did not require ethical approval.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Please contact yanqing_bi@nudt.edu.cn for experimental data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Giannakopoulos, T.; Kosmopoulos, D.; Aristidou, A.; Theodoridis, S. Violence content classification using audio features. In Artificial Intelligence; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3955, pp. 502–507. [Google Scholar]
Chen, L.-H.; Su, C.-W.; Hsu, H.-W. Violent scene detection in movies. Int. J. Pattern Recognit. Artif. Intell. 2011, 25, 1161–1172. [Google Scholar] [CrossRef]
Sudhakaran, S.; Lanz, O. Learning to detect violent videos using convolutional long short-term memory. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017. [Google Scholar]
Rendón-Segador, F.J.; Álvarez-García, J.A.; Enríquez, F.; Deniz, O. ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence. Electronics 2021, 10, 1601. [Google Scholar] [CrossRef]
Gkountakos, K.; Ioannidis, K.; Tsikrika, T.; Vrochidis, S.; Kompatsiaris, I. Crowd Violence Detection from Video Footage. In Proceedings of the 2021 International Conference on Content-Based Multimedia Indexing (CBMI), Lille, France, 28–30 June 2021.
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2014; pp. 568–576. [Google Scholar]
Zhou, P.; Ding, Q.; Luo, H.; Hou, X. Violent interaction detection in video based on deep learning. J. Phys. Conf. Ser. 2017, 844, 012044. [Google Scholar] [CrossRef]
Yasin, H.; Hussain, M.; Weber, A. Keys for Action: An Efficient Keyframe-Based Approach for 3D Action Recognition Using a Deep Neural Network. Sensors 2020, 20, 2226. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Morais, R.; Le, V.; Tran, T.; Saha, B.; Mansour, M.; Venkatesh, S. Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Cheng, Y.; Yang, Y.; Chen, H.B.; Wong, N.; Yu, H. S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-shot Segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Online, 5–9 January 2021; pp. 3328–3336. [Google Scholar]
Zhang, J.; Yang, K.; Ma, C.; Reiß, S.; Peng, K.; Stiefelhagen, R. Bending reality: Distortion-Aware transformers for adapting to panoramic semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Deng, J.; Zhong, Z.; Huang, H.; Lan, Y.; Han, Y.; Zhang, Y. Lightweight semantic segmentation network for real-time weed mapping using unmanned aerial vehicles. Appl. Sci. 2020, 10, 7132. [Google Scholar] [CrossRef]
Sadhu, S.; He, D.; Huang, C.-W.; Mallidi, S.H.; Wu, M.; Rastrow, A.; Stolcke, A.; Droppo, J.; Maas, R. wav2vec-c: A self-supervised model for speech representation learning. Proc. Interspeech 2021, 2021, 711–715. [Google Scholar]
Serrano, I.; Deniz, O.; Espinosa-Aranda, J.L.; Bueno, G. Fight Recognition in Video Using Hough Forests and 2D Convolutional Neural Network. IEEE Trans. Image Process. 2018, 27, 4787–4797. [Google Scholar] [CrossRef] [PubMed]
Soliman, M.M.; Kamal, M.H.; El-Massih, N.M.A.; Mostafa, Y.M.; Chawky, B.S.; Khattab, D. Violence Recognition from Videos using Deep Learning Techniques. In Proceedings of the Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), Chongqing, China, 8–10 December 2019. [Google Scholar]
Nievas, E.B.; Suarez, O.D.; Garc, G.B.; Sukthankar, G.B. Violence detection in video using computer vision techniques. In Proceedings of the International Conference on Computer Analysis of Images and Patterns, Seville, Spain, 29–31 August 2011. [Google Scholar]
Cheng, M.; Cai, K.; Li, M. RWF-2000: An Open Large Scale Video Database for Violence Detection. In Proceedings of the International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021. [Google Scholar]
Perez, M.; Kot, A.C.; Rocha, A. Detection of Real-world Fights in Surveillance Videos. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar]
Hassner, T.; Itcher, Y.; Kliper-Gross, O. Violent flows: Real-time detection of violent crowd behavior. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
Mark, E.; Luc Van, G.; Christopher, K.I.; Williams, J.W.; Andrew, Z. The pascal visual object classes (VOC) chal-lenge. Int. J. Comput. Vision 2010, 88, 303–338. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D. Microsoft coco: Common objects in context. In Proceedings of the European Conference On Computer Vision, Zürich, Switzerland, 6–12 September 2014. [Google Scholar]
Miao, J.; Wei, Y.; Wu, Y.; Liang, C.; Li, G.; Yang, Y. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Zivkovic, Z. Improved adaptive Gaussian mixture model for background subtraction. In Proceedings of the International Conference on Pattern Recognition, Cambridge, UK, 23–26 August 2004. [Google Scholar]
Zivkovic, Z.; van der Heijden, F. Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognit. Lett. 2006, 27, 773–780. [Google Scholar] [CrossRef]
Lin, S.; Ryabtsev, A.; Sengupta, S.; Curless, B.; Seitz, S.; Kemelmacher-Shlizerman, I. Real-Time High-Resolution Background Matting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Sun, Z.; Jia, K.; Chen, H. Video Key Frame Extraction Based on Spatial-Temporal Color Distribution. In Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Haerbin, China, 15–17 August 2008. [Google Scholar]
Hannane, R.; Elboushaki, A.; Afdel, K.; Naghabhushan, P.; Javed, M. An efficient method for video shot boundary detection and keyframe extraction using SIFT-point distribution histogram. Int. J. Multimedia Inf. Retr. 2016, 5, 89–104. [Google Scholar] [CrossRef]
Guan, G.; Wang, Z.; Lu, S.; Deng, J.D.; Feng, D.D. Keypoint-Based Keyframe Selection. IEEE Trans. Circuits Syst. Video Technol. 2013, 23, 729–734. [Google Scholar] [CrossRef]
Kar, A.; Rai, N.; Sikka, K.; Sharma, G. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recogni-tion in Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Man, G.; Sun, X. Interested Keyframe Extraction of Commodity Video Based on Adaptive Clustering Annotation. Appl. Sci. 2022, 12, 1502. [Google Scholar] [CrossRef]
Bellomo, N.; Burini, D.; Dosi, G.; Gibelli, L.; Knopoff, D.; Outada, N.; Terna, P.; Virgillito, M.E. What is life? A perspective of the mathematical kinetic theory of active particles. Math. Model. Methods Appl. Sci. 2021, 31, 1821–1866. [Google Scholar] [CrossRef]
Song, W.; Zhang, D.; Zhao, X.; Yu, J.; Zheng, R.; Wang, A. A Novel Violent Video Detection Scheme Based on Modified 3D Convolutional Neural Networks. IEEE Access 2019, 7, 39172–39179. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo Vadis. Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]

Figure 1. Our violent-behavior-recognition structure. (a) In training, the loss of segmented–original images and labels is reduced, which is equivalent to finding the minimum value of Formula (1). At the same time, the fully connected layer of the segmented–original image gradually approaches, which is equivalent to reducing the value of

Δ l_{i}

in Formula (2). The segmented image and original image make the model pay more attention to the foreground information, thereby reducing the interference of background information. (b) The keyframe extraction algorithm uses the frame-difference method to extract keyframes from the sliding window (continuous

N

frames) and selects the local maximum frame as the keyframe in every continuous

K

frame.

Figure 1. Our violent-behavior-recognition structure. (a) In training, the loss of segmented–original images and labels is reduced, which is equivalent to finding the minimum value of Formula (1). At the same time, the fully connected layer of the segmented–original image gradually approaches, which is equivalent to reducing the value of

Δ l_{i}

in Formula (2). The segmented image and original image make the model pay more attention to the foreground information, thereby reducing the interference of background information. (b) The keyframe extraction algorithm uses the frame-difference method to extract keyframes from the sliding window (continuous

N

frames) and selects the local maximum frame as the keyframe in every continuous

K

frame.

Figure 2. Annotation of violent video frames. The red curve is the annotation.

Figure 3. Comparison of four image segmentation algorithms. (a) The Gaussian, (b) The ENET, (c) The Mediapipe, (d) The DeeplabV3plus.

Figure 4. The process of removing the background of the violent-behavior dataset.

Figure 5. Our dataset. (a) Original image, (b) Segmented image.

Figure 6. The relationship between Threshold and

P_{R e c o g}

.

Figure 6. The relationship between Threshold and

P_{R e c o g}

.

Figure 7. The training process of Deeplab-V3plus.

Figure 8. Dimensionality-reduction visualization results for datasets. (a) Original nonviolence, (b) Segmented nonviolence, (c) Original violence, (d) Segmented violence.

Figure 9. Changes in recognition rate during training. (a) VGG19, (b) GoogLeNet, (c) ResNet18, (d) DesNet121.

Figure 10. Reconstruction effect of the encoder–decoder network.

Figure 11. Frame differences for violent videos. (a) camera movement, (b) Camera fixation.

Table 1. Hyperparameters for Deeplab-V3plus.

Hyperparameter
backbone	xception
downsample	16 or 8
input_shape	224 × 224
Freeze_Epoch	10
Freeze_lr	5.00 × 10⁻⁴
UnFreeze_Epoch	80
Unfreeze_lr	5.00 × 10⁻⁵

Table 2. Comparative experiment grouping.

No.	Training Dataset	Test Dataset
1	Original Image	Original Image
2	Original Image	Segmented Image
3	Segmented Image	Segmented Image
4	Segmented Image	Original Image
5	Segmented + Original Image	Original Image
6	Segmented + Original Image	Segmented Image

Table 3. The effect of our method on different models.

	RLVS		Hockey Fights
	Original Image	Segmented Image	Original Image	Segmented Image
VGG19	85.70%	70.79%	97.62%	90.38%
Ours (VGG19)	88.34%	89.34%	97.03%	96.52%
ResNet18	86.59%	81.98%	98.59%	91.81%
Ours (ResNet18)	91.39%	91.78%	98.27%	96.35%
DesNet121	87.96%	83.41%	98.60%	91.13%
Ours (DesNet121)	93.01%	92.34%	98.40%	97.58%
GoogLeNet	90.49%	84.42%	98.48%	91.23%
Ours (GoogLeNet)	92.80%	93.45%	98.24%	97.02%

Table 4. Implementation of the model.

# Define the reconstruction loss function
def recon_loss(recon_x,x):
loss = torch.sum((recon_seg—recon_unseg)**2).mean()
return loss
# Define segmented-original image pair hidden layer distance
def Distance(laten_seg,laten_unseg):
D = torch.sqrt((laten_seg—laten_unseg)**2).mean()
Return D
#model training
for seg, unseg in emulate (trainloader):
laten_seg = Encoder(seg)
recon_seg = Decoder(laten_seg)
#compute the segmented image reconstruction loss
recon_loss_seg = recon_loss(seg, recon_seg)
laten_unseg = Encoder(unseg)
recon_unseg = Decoder(laten_unseg)
#compute the original image reconstruction loss
recon_loss_unseg = recon_loss(unseg, recon_unseg)
#compute the distance of segmented-original image pair in the hidden layer
Distance = Distance(laten_seg, laten_unseg)
Loss = recon_loss_seg + recon_unseg_loss+ 0.001*Distance
Loss.backward()
optimizer()

Table 5. Statistics on the number of video frames in the violence dataset.

Datasets	Average	Median	Var	Std	Min	Max
HockeyFights	41.056	41	0.4528	0.6733	40	49
RLVS	143.69	132	84,490	290.74	29	11,272
RWF-2000	150	150	0	0	150	150
CCTV	8797.55	1295	1,301,301,216.97	36,091.60	95	472,304

Table 6. Video frame recognition rate.

	Average	Violence	Nonviolence
RLVS	91.39%	87.88%	94.90%
Hockey Fights	98.59%	99.13%	98.05%
Violent Flow	92.06%	87.12%	98.73%

Table 7. The relationship between recognition rate and threshold.

	Threshold = 0.5			Threshold = 0.7
	A	V	N	A	V	N
RLVS	94.6%	93.2%	96.00%	91.50%	89.50%	93.5%
Hockey	98.5%	98.0%	99.0%	97.5%	96.0%	99.0%
Violent Flow	95.0%	90.0%	100%	93.75%	87.50%	100%

A: average, V: violence, N: nonviolence, Window Length = 5.

Table 8. The relationship between window length and recognition rate.

	$K$	RLVS	Hockey Fights	Violent Flow
Threshold = 0.5	5	94.6%	98.50%	95.0%
	10	96.75%	97.5%	95.0%
	15	96.25%	94.0%	95.0%
	20	96.0%	92.0%	95.0%
Threshold = 0.7	5	91.50%	91.5%	91.5%
	10	93.0%	98.0%	93.5%
	15	93.75%	95.5%	95.0%
	20	92.5%	91.5%	93.5%

Table 9. Comparisons between the proposed method and others on previous datasets.

Method	Hockey Fights	RLVS	Violent Flow
FightNet [10]	97.00%	–	–
3D CNN [37]	99.62%	–	94.30%
CNN + LSTM [3]	97.10%	–	94.57%
C3D [6]	96.50%	–	84.44%
I3D [38]	97.50%	–	86.89%
VGG + LSTM [17]	95.10%	88.20%	90.01%
Ours	98.50%	94.60%	95.00%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bi, Y.; Li, D.; Luo, Y. Combining Keyframes and Image Classification for Violent Behavior Recognition. Appl. Sci. 2022, 12, 8014. https://doi.org/10.3390/app12168014

AMA Style

Bi Y, Li D, Luo Y. Combining Keyframes and Image Classification for Violent Behavior Recognition. Applied Sciences. 2022; 12(16):8014. https://doi.org/10.3390/app12168014

Chicago/Turabian Style

Bi, Yanqing, Dong Li, and Yu Luo. 2022. "Combining Keyframes and Image Classification for Violent Behavior Recognition" Applied Sciences 12, no. 16: 8014. https://doi.org/10.3390/app12168014

APA Style

Bi, Y., Li, D., & Luo, Y. (2022). Combining Keyframes and Image Classification for Violent Behavior Recognition. Applied Sciences, 12(16), 8014. https://doi.org/10.3390/app12168014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Combining Keyframes and Image Classification for Violent Behavior Recognition

Abstract

1. Introduction

2. Related Works

2.1. Datasets

2.2. Semantic Segmentation Algorithms

2.3. Keyframe Extraction Methods

2.4. Violence-Detection Models

3. Proposed Method

3.1. The Overall Framework

3.2. Dataset

3.3. Keyframe Extraction

3.4. Our Training Method

3.5. Judgment of Violence

4. Experiments

4.1. Deeplab-V3plus

4.2. Facilitation of Model Learning by Segmented–Original Images

4.3. Keyframes

4.4. Our Violent-Behavior-Recognition System

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI