Lightweight Violence Detection Model Based on 2D CNN with Bi-Directional Motion Attention

Wang, Jingwen; Zhao, Daqi; Li, Haoming; Wang, Deqiang

doi:10.3390/app14114895

Open AccessArticle

Lightweight Violence Detection Model Based on 2D CNN with Bi-Directional Motion Attention

School of Information Science and Engineering, Shandong University, Qingdao 266237, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4895; https://doi.org/10.3390/app14114895

Submission received: 6 May 2024 / Revised: 31 May 2024 / Accepted: 4 June 2024 / Published: 5 June 2024

(This article belongs to the Special Issue Intelligent Detection and Identification System Based on Computer Vision Technology)

Download

Browse Figures

Versions Notes

Abstract

:

With the widespread deployment of surveillance cameras, automatic violence detection has attracted extensive attention from industry and academia. Though researchers have made great progress in video-based violence detection, it is still a challenging task to realize accurate violence detection in real time, especially with limited computing resources. In this paper, we propose a lightweight 2D CNN-based violence detection scheme, which takes advantage of frame-grouping to reduce data redundancy greatly and, meanwhile, enable short-term temporal modeling. In particular, a lightweight 2D CNN, named improved EfficientNet-B0, is constructed by integrating our proposed bi-directional long-term motion attention (Bi-LTMA) module and a temporal shift module (TSM) into the original EfficientNet-B0. The Bi-LTMA takes both spatial and channel dimensions into consideration and captures motion features in both forward and backward directions. The TSM is adopted to realize temporal feature interaction. Moreover, an auxiliary classifier is designed and employed to improve the classification capability and generalization performance of the proposed model. Experiment results demonstrate that the computational cost of the proposed model is 1.21 GFLOPS. Moreover, the proposed scheme achieves accuracies of 100%, 98.5%, 91.67%, and 90.25% on the Movie Fight dataset, the Hockey Fight dataset, the Surveillance Camera dataset, and the RWF-2000 dataset, respectively.

Keywords:

violence detection; lightweight; temporal modeling; attention mechanism

1. Introduction

As a serious threat to public security, violence has attracted broad attention. For the purpose of public security, numerous surveillance cameras have been installed in important places such as banks, airports, train stations, and campuses. In surveillance centers, special personnel are employed to monitor the videos in polling mode and thus discover possible abnormal events. From the view of public security, it is desirable to stop violence from taking place and thus protect the victims. Hence, it is essential to recognize violent behaviors from surveillance videos reliably in real time. In reality, manual monitoring is not efficient enough to efficiently discover violent behaviors in a great number of videos. Therefore, most surveillance videos are used to trace back acts of violence when needed. Motivated by these facts, researchers have proposed a variety of automatic violence detection algorithms. In recent years, deep learning (DL) has been adopted as a mainstream methodology in the area of violence detection. Nowadays, it is still a challenging problem to achieve high violence detection accuracy with low computational cost.

Two-dimensional convolutional neural networks (CNNs) are usually used to realize violence detection in combination with some kinds of recurrent neural networks (RNNs). As is well-known, 2D CNNs have achieved great success in image classification tasks. However, due to the lack of temporal modeling capability, it is difficult for 2D CNNs to extract motion information in videos, which is crucial for behavior recognition tasks. To overcome this difficulty, schemes combining 2D CNNs and RNNs have been proposed [1,2,3]. In these schemes, 2D CNNs are responsible for extracting spatial features, and RNNs, e.g., long short-term memory (LSTM) [1,3] and bi-directional long short-term memory (BiLSTM) networks [2], are employed to extract temporal features. However, these approaches cannot capture and encode long-range temporal information sufficiently and, hence, the violence detection accuracy is limited.

Three-dimensional CNNs can extract both spatial and temporal features in videos effectively and thus lead to high performance in tasks of behavior recognition. In comparison with 2D convolution kernels, 3D convolution kernels are capable of extracting motion features lying in the temporal dimension of videos. The C3D [4], R3D [5], and I3D [6] are typical 3D CNNs proposed for action recognition tasks and exhibit significant improvement in recognition accuracy. It is worth noting that 3D CNN models are computationally expensive for real-world applications.

Two-stream networks combining RGB frames and optical flow have also been proposed for video detection [7,8,9,10,11,12]. The idea is to make use of the spatial information lying in RGB frames and the motion information lying in optical flow. The spatial features extracted by CNNs from RGB images and the optical flow are fused to realize video detection tasks. Though good performance can be achieved by using these schemes, the complicated calculation of optical flow turns out to be an obstructive factor for real-world applications.

Skeleton-based action recognition schemes rely on precise estimates of the sequence of skeleton points [13,14]. The sequence of skeleton points is firstly extracted from the video and then transformed to match the input format of the following CNN model [13] or graph convolutional network (GCN) model [14]. Evidently, it is computationally demanding to extract skeleton points from videos, and extra errors in skeleton points may lead to misjudgment of actions.

Regarding the design of automatic violence detection systems, most researchers focus on the architecture of the Internet of Things (IoT). For real-time applications, the latencies required are usually on the scale of sub-second [15,16], and therefore it is more promising to implement violence detection on cameras or edge devices than on clouds. The authors of [15] embedded their algorithm on a camera module to realize on-camera violence crime detection. NVIDIA RTX 3090 and Jetson TX2 were employed to test the latencies of their models. The authors of [16] focused on violence detection on edge devices, each of which serves multiple surveillance cameras. NVIDIA RTX 3070 TI and NVIDIA Jetson Xavier NX were employed to test the inference speed of their model. The author of [17] proposed an edge vision-based surveillance system for violence detection in public and private spaces. For inference tasks, Jetson AGX Orin 64 GB was employed in their edge devices. Due to the fact that surveillance cameras and edge devices are usually resource-limited platforms, a violence detection model with low memory complexity and low computational cost is desirable.

In this paper, we focus on edge device-based violence detection as in [16,17]. It is expected that an edge device with limited resources, e.g., a desktop computer with a GPU card, can detect violence reliably from videos acquired by a number of surveillance cameras. Specifically, we propose a lightweight 2D CNN-based violence detection scheme. The key idea is to extract both short-term and long-term motion features effectively from videos using an improved EfficientNet-B0 [18], a lightweight 2D CNN. As shown in Figure 1, the pipeline of the proposed scheme consists of three stages: data preprocessing, 2D CNN-based feature extraction, and classification. In the data preprocessing stage, sparse sampling [8] and RGB2Gray operation are performed to reduce data redundancy, and then frame-grouping [15] is conducted to assemble three-channel images fit for the following 2D CNN. In the following feature extraction stage, an improved EfficientNet-B0 is employed to extract short-term and long-term motion features. In the classification stage, cross-entropy loss (the main loss) is applied together with an auxiliary loss [19] to supervise the training process of the model so as to improve the violence detection accuracy effectively.

The contributions of this work include the following:

A lightweight violence detection scheme is proposed for real-time applications. The proposed scheme leverages the merits of frame-grouping [15] and EfficientNet-B0 [18] to remove data redundancy and lower the computational cost of the model effectively. In comparison with state-of-the-art violence detection models, our model has a much lower computational cost.
An improved EfficientNet-B0 is built to boost the feature extraction capability. Specifically, we put forward a bi-directional long-term motion attention (Bi-LTMA) mechanism and integrate it into the original EfficientNet-B0. Moreover, the temporal shift module (TSM) [20] is also adopted to realize temporal feature interaction. Both Bi-LTMA and TSM are computationally lightweight and beneficial to violence detection accuracy.
An auxiliary loss is introduced and applied jointly with the main loss in the phase of model training to minimize the classification error. The auxiliary head module is inserted in the middle stage of the improved EfficientNet-B0. Extensive experiments have been conducted to tune the balance between the auxiliary loss and main loss and illustrate its effectiveness.
Experiments have been carried out on four recognized datasets, i.e., the Movie Fight dataset [21], RWF-2000 dataset [9], Hockey Fight dataset [21], and Surveillance Camera dataset [22], to evaluate the performance of our proposed model. Numerical results reveal the merits of the proposed model in terms of accuracy, computational cost, and parameters.

2. Related Works

2.1. Violence Detection

Violence detection is an important branch of behavior recognition. Up to date, violence detection algorithms can be categorized into three categories: algorithms based on hand-crafted features, DL algorithms based on 3D CNNs, and DL algorithms based on 2D CNNs.

Traditional violence detection algorithms are based on hand-crafted features. The most classical hand-crafted features include a histogram of oriented gradient (HOG) [23], space–time interest points (STIP) [24], and speeded-up robust features (SURF) [25]. Optical flow and its variants were proposed later. In [26], Hassner et al. introduced the violent flows descriptor (ViF) and combined it with a support vector machine (SVM) classifier to detect violent behaviors, where the statistics of flow-vector magnitudes over time were taken into consideration. In [27], oriented violent flow (OViF) was proposed to take advantage of the motion magnitude change information in statistical motion orientations to exhibit better accuracy. Moreover, improved dense trajectory (iDT) was adopted in [28] to mitigate the adverse effect of camera motion on action recognition. The scheme proposed in [29] combined video frames into dynamic images using approximate rank pooling. Afterward, hand-crafted features were extracted and summarized by using a bag of visual words (BOVW). Based on the hybrid handcrafted/learned features, a support vector machine (SVM) classifier was trained to recognize violence. In [30], the video sequence was encoded into a result image leveraging handcrafted descriptors, i.e., binary robust invariant scalable key points (BRISK) [31] and features from an accelerated segment test (FAST) [32]. Then, the resulting image was fed to a 2D CNN to learn high-level features and recognize fight actions. Differently from these schemes, this work employs DL techniques to learn violence features from videos automatically.

DL algorithms based on 3D CNNs [4,5,6,7,17,33] extract spatiotemporal features in videos automatically. In [4], the authors applied 3D CNNs to extract spatiotemporal features in videos and demonstrated that the proposed C3D features outperformed hand-crafted features. Subsequently, two-stream networks such as Inflated 3D ConvNet (I3D) [6] and Slowfast [7] emerged and achieved good performance in the field of behavior detection. The I3D was constructed by using 2D ConvNet inflation, where RGB frames and optical flow were taken as input. As the ResNet [34] gained widespread popularity in image classification, several 2D ResNet variants were adapted to the 3D domain for behavior recognition [5,33]. Feichtenhofer et al. proposed a residual 3D network (R3D) [5] based on the original 2D ResNet architecture. The R(2 + 1)D [33] was proposed to split the 3D convolutional filters into separate spatial and temporal components. In comparison with R3D, the R(2 + 1)D is easier to optimize and demonstrates better action recognition accuracy. Though these methods have made great progress in accuracy, they are computationally expensive. We designed our violence detection model based on a lightweight 2D CNN to meet the requirements of real-time applications.

DL algorithms based on 2D CNNs take 2D CNNs as their backbone and adopt temporal modeling strategies to enable the extraction of motion features. The ConvLSTM network proposed in [1] employed an LSTM network to aggregate the features extracted from video frames by a 2D CNN. The CNN-BiLSTM Model proposed in [2] employed BiLSTM instead of LSTM to improve the performance further. The temporal segment networks (TSN) proposed in [8] utilized a sparse sampling strategy, where sampled frames were fed into a 2D CNN for feature extraction and action classification purposes. Following the TSN architecture, variant schemes [12,15,20,35,36] were proposed to boost the extraction of motion features by inserting temporal modeling modules into 2D CNN backbones. The temporal shift module (TSM) proposed in [20] shifted part of the channels of the feature map along the temporal dimension to improve the temporal modeling capability. The temporal difference network (TDN) [35] introduced a temporal difference module to improve both short-term and long-term temporal modeling capabilities. The authors of [15] proposed a frame-grouping method, based on which 2D CNNs could be used to extract spatiotemporal features in videos. Moreover, spatial and temporal attention modules were also applied to improve the violence recognition accuracy. The goal of this work is to build a lightweight 2D CNN-based violence detection model. To this end, we propose an improved EfficientNet-B0 that can be used to extract spatiotemporal features effectively with low computational cost.

2.2. Attention Mechanism on Temporal Information

It is critical to capture motion features for violence detection. Attention mechanisms are widely applied to capture motion features effectively in violence detection models.

Self-attention mechanisms have been proposed for use in the area of computer vision. The authors of [37] presented non-local operations to capture long-range dependencies with deep neural networks. A non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps. Vision transformer (ViT) [38] and StNet [39] are typical self-attention modules modeling the long-range spatiotemporal features in behavior recognition. Though these schemes are effective, they come at the cost of significant computational overhead.

Some researchers use temporal attention mechanisms to enhance the long-range motion modeling ability of 2D CNNs. In [40], the authors proposed a feature-level motion attention mechanism and performed the channel-wise spatiotemporal fusion so as to improve the action recognition accuracy. In [41] the authors proposed a motion excitation (ME) attention mechanism, which enabled 2D CNNs to capture long-range motion information. The ME primarily focuses on motion-sensitive channels. Compared with [40], the scheme in [41] uses a Multiple Temporal Aggregation structure to recalibrate the features after ME and achieves higher accuracy.

We propose a bi-directional motion attention mechanism (Bi-LTMA) to boost the violence detection performance. Differently from [41], the Bi-LTMA computes an attention map based on bi-directional motion information in both channel and spatial dimensions. In addition to Bi-LTMA, TSM [20] is also employed in our model to realize temporal feature interaction such that the motion feature modeling capability of the model is enhanced further.

2.3. Auxiliary Loss

The concept of auxiliary head supervision was initially introduced in [42] to tackle the issues of gradient disappearance and gradient explosion in the training of DL models. Later on, variant schemes were proposed and applied in tasks such as image classification, object recognition, image segmentation, and action recognition [19,43,44,45,46,47]. The authors of [43] put two auxiliary classifiers on top of the output of two intermediate inception modules (4a and 4d) and demonstrated competitive accuracy in the ILSVRC detection task. In [44], the authors pointed out that auxiliary classifiers attached to early layers might yield high error rates, and hence introduced horizontal and vertical connections to keep the coarse-level information in the earlier layers. The authors of [46] point out that adding a highly distinctive classifier to the intermediate layer is overly aggressive. In [19], the side branches from the intermediate layers were designed delicately so that multiple auxiliary classifiers were consistent in their optimization directions. Inspired by [19], we designed an auxiliary classifier tailored to our model to enhance the training process.

3. Proposed Violence Detection Model

In this section, we present the details of the proposed lightweight violence detection model shown in Figure 1.

3.1. Data Preprocessing

In the data preprocessing stage, sparse sampling, RGB2Gray, and frame-grouping operations are performed successively so that the target video clip is put into the format suitable for the following 2D CNN network.

The purpose of sparse sampling is to reduce redundancy in videos at the frame level. In reality, there is a high degree of redundancy in videos especially in cases of high frame rates. Dense sampling inevitably captures too many highly similar consecutive frames. For violence detection in videos, dense temporal sampling is costly in both time and computing resources. In contrast, sparse sampling preserving relevant information can be used to model the long-range temporal structure over the video with a much lower cost [8,20]. In this work, we adopt the sparse sampling strategy proposed in [8] to remove redundancy. As shown in Figure 1, a video clip of a fixed duration is divided into 3T segments, and one frame is sampled from each segment. Thus, all 3T RGB frames, F = [F₁, F₂, …, F_3T], are captured from the video clip for use in the following procedure.

Based on the sparsely sampled RGB frames, the RGB2Gray operation is performed to reduce redundancy further at the channel level. Traditionally, 2D/3D CNNs are employed to learn features from RGB images/frames so as to realize tasks such as image classification and object detection and tracking. However, though color information may be useful for violence detection, motion information has been found to be the most crucial [27,28,29,30,39]. Moreover, compacting RGB images into single-channel images is beneficial to lowering computational load. Based on this observation, RGB images were squeezed into single-channel ones by using an averaging operation along channel dimension [15]. In this work, we convert the sparsely sampled RGB images into grayscale images using the standard weighted averaging method instead of the method in [15]. Mathematically, the obtained grayscale pixel intensity

I (x, y)

can be given by

I (x, y) = W_{R} I_{R} (x, y) + W_{G} I_{G} (x, y) + W_{B} I_{B} (x, y),

(1)

where,

I_{R}

,

I_{G}

, and

I_{B}

represent intensities of the red, green, and blue channels of the pixel, separately;

W_{R} = 0.30

,

W_{G} = 0.59

, and

W_{B} = 0.11

are standard weighting coefficients obtained from the human luminance perception system. As shown in Figure 1, 3T grayscale frames, X = [X₁, X₂,…, X_3T], are obtained from the video clip.

The function of frame-grouping is to reshape the obtained grayscale frames into three-channel images. According to Figure 1, the obtained 3T grayscale frames are divided into T groups, each of which contains three consecutive frames. For each group, a three-channel image is constructed by putting the three grayscale frames into RGB channels [15]. Therefore, T three-channel images, I = [I₁, I₂,…, I_T], are obtained from a video clip and fed to the following 2D CNN as input. The shape of I is [T, C, H, W], where T is a temporal dimension, C denotes the number of channels, and H and W represent the height and width of the image, respectively. As reported in [15], the merit of frame-grouping is twofold. Firstly, the synthesized three-channel images are fit for 2D CNNs in format. Secondly, the synthesized three-channel images enable 2D CNNs to learn short-term motion features. In this work, we take the advantages of frame-grouping and propose a new lightweight 2D CNN, i.e., improved EfficientNet-B0, to extract both short-term and long-term motion features effectively.

3.2. Improved EfficientNet-B0

The improved EfficientNet-B0 is a modified version of EfficientNet-B0 [18]. Specifically, we propose a new attention module, namely, Bi-LTMA, and integrate it into EfficientNet-B0 to enable long-term motion feature extraction. In addition, the TSM module [20] is adopted to realize temporal feature interaction. In what follows, we briefly review the EfficientNet-B0 [18] and then present details of the improved EfficientNet-B0.

EfficientNet-B0 is a state-of-the-art lightweight 2D CNN obtained by using a multi-objective neural architecture search that optimizes both accuracy and FLOPS [18]. As shown in Figure 2, it consists of nine stages. The specific structure of each stage can be found in Table 1. Notably, the second to eighth stages are composed of inverted residual blocks, named as mobile inverted bottleneck (MBConv) in Table 1. In fact, MBConvs resemble the basic block of MobileNetV3 [48]. The structure of the inverted residual block is shown in Figure 3. Within the block, depthwise separable convolution (DW Conv) is employed to decrease the computational cost-effectively. For further detailed information, please refer to [18,48] and references therein. To take advantage of EfficientNet-B0 in both accuracy and FLOPs, we select it as the backbone of the proposed model.

The goal of the improved EfficientNet-B0 is to gain capabilities of long-term motion feature extraction and temporal feature interaction. To this end, we integrate the Bi-LTMA module and TSM module into inverted residual blocks in Figure 4, i.e., the second to eighth stages, of the EfficientNet-B0. The modified inverted residual block is shown in Figure 4. In what follows, we present the details of our proposed Bi-LTMA and that of TSM.

The Bi-LTMA module is proposed to extract long-term motion information effectively. As reported in the literature [1,2,15,41], long-term motion information plays a key role in violence detection in videos. In [1,2], LSTM and BiLSTM were used to aggregate temporal features. In [15], a temporal attention module, named T-SE, was employed jointly with frame-grouping to recalibrate and aggregate global temporal features. In [41], motion features were defined as a channel-wise difference between adjacent frames, and the motion excitation (ME) module was proposed to excite the motion-sensitive channels. The ME focuses on channel attention, where detailed spatial features are omitted by using the pooling operation. Our idea of Bi-LTMA is partly inspired by ME. Differently from ME, Bi-LTMA takes both spatial and channel dimensions into consideration and captures motion features in both forward and backward directions.

The architecture of the Bi-LTMA module is shown in Figure 5. To meet the lightweight design requirements, the input feature F is downscaled by using a 1 × 1 2D convolution (

{Conv}_{1}

). The downscaled feature is given by

F^{r} = {Conv}_{1} * F, F^{r} \in ℝ^{T \times C / r \times H \times W},

(2)

where

r = 2

is the reduction ratio,

*

indicates the convolution calculation. Based on the downscaled feature, motion information is then derived. Similar to ME [44], motion information is obtained by computing the feature-level difference between two adjacent frames. A 3 × 3 2D convolution (

{Conv}_{2}

) is used as the mapping function to deal with the problem of spatial misalignment. Specifically, given the features of two adjacent frames

F^{r} (t)

and

F^{r} (t + 1)

, we compute the bi-directional motion information as per the following:

\begin{array}{l} M_{f} (t) = {Conv}_{2} * F^{r} (t + 1) - F^{r} (t), t \in [1, T - 1], \\ M_{b} (t) = {Conv}_{2} * F^{r} (t) - F^{r} (t + 1), t \in [1, T - 1], \end{array}

(3)

where

M_{f} (t)

is the forward difference and

M_{b} (t)

is the backward difference. For the last time step

t = T

, we set the motion information as

M_{f} (T) = 0

and

M_{b} (T) = 0

. Then, another 1 × 1 2D convolution (

{Conv}_{3}

) is employed together with a concatenation (concat) layer in each branch to restore the original shape of the input feature F. After that, forward and backward attention weights,

W_{f}^{​}

and

W_{b}^{​}

, are generated separately by using sigmoid functions as per the following:

\begin{array}{l} W_{f}^{​} = δ ({concat (Conv}_{3} (M_{f} (t)))) \\ W_{b}^{​} = δ (c o n c a t ({Conv}_{3} (M_{b} (t)))) \end{array},

(4)

where

δ (•)

is the sigmoid activation function,

concat (•)

stands for concatenation function. The attention map

A

is computed by averaging the forward and backward attention weights as follows:

A = 0.5 W_{f}^{​} + 0.5 W_{b}^{​} .

(5)

As shown in Figure 5, a weighted feature can be obtained by performing Hadamard product operation

A ⊙ F

. DropConnect is an effective method to improve the generalization ability of neural network models [49]. It randomly sets the weights to zero according to a certain probability. To improve the generalization ability of the model, we put a DropConnect structure after the

A ⊙ F

operation, where

𝒢 (•)

represents the DropConnect operation. Moreover, to preserve the original feature and, meanwhile, make use of the Bi-LTMA attention mechanism, we employ a shortcut connection to fuse the original feature with the weighted. Mathematically, the output feature of the Bi-LTMA module can be expressed as

F^{'} = F + 𝒢 (A ⊙ F) .

(6)

The TSM module [20] is adopted in the modified inverted residual block to enhance the temporal modeling of the improved EfficientNet-B0 further. As reported in [20], TSM module enables 2D CNNs joint spatial–temporal modeling via shifting part of the channels along the temporal dimension. Moreover, inserting the TSM module into 2D CNNs does not impose an additional computational cost. These merits of TSM are attractive to real-time video understanding. It is also worth noting that in-place TSM may lower the spatial feature learning capability of the backbone model to some extent [20]. Residual TSM is a promising alternative to tackle this problem. Noting that most stages of EfficientNet-B0 are inverted residual blocks, we insert bi-directional TSM modules into residual branches of stages of EfficientNet-B0 to realize temporal feature interaction and meanwhile keep good spatial feature learning capability of the backbone, as shown in Figure 4.

Given a video clip, the principle of the bi-directional TSM module is demonstrated in Figure 6. As described above, the T three-channel images are obtained from a video clip after frame-grouping. For a given three-channel image, the feature extracted by the 2D CNN layer right before the TSM module is denoted with a specific color. Assume that the feature extracted from each three-channel image consists of C feature maps of height H and width W, where C is the number of channels. As shown in Figure 6, the TSM module realizes temporal feature interaction in three steps. Firstly, the features of all T three-channel images are reshaped into a tensor of dimensions H × W, T, and C. Then, a small number of the channels are shifted along the temporal dimension of the tensor. Some of them are shifted backward by one step, the others are shifted forward by one step. The upper and lower vacancies are padded with zeros. Finally, the resultant tensor is reshaped to recover T output features, the format of which is the same as the input. Evidently, the feature at arbitrary time step t is fused with those at time steps t − 1 and t + 1. It means that the temporal perceptional field is expanded effectively by using the TSM module with no additional computation cost.

3.3. Auxiliary Loss

We introduce an auxiliary loss to improve the classification capability of the proposed model. As shown in Figure 1, the auxiliary head is inserted in an intermediate stage of the improved EfficientNet-B0. Softmax classifier is employed in both the main branch and the auxiliary branch. In reality, auxiliary heads are used only in the phase of model training and discarded in the phase of inference [42]. Therefore, the auxiliary head does not impose any additional computational cost to the model during inference. In what follows, we introduce the auxiliary loss in detail.

An important issue is to determine the intermediate layer to which the auxiliary head is attached. In principle, early layers of CNNs learn low-level features while later layers learn high-level features. It is the high-level features in the final layers that are important for classification [45]. As reported in [44], if a single auxiliary head is inserted at a too-early layer of typical CNNs, the accuracy of the final classifier can be harmed badly. The underlying reason is that the auxiliary classifier attached to an early layer benefits short-term early features but may collapse information required to generate high-quality features in later layers. On the other hand, if the auxiliary head is inserted at a too-deep layer, the auxiliary loss will lose its advantage in combatting the gradient disappearance problem [19]. Based on these observations, we attach the auxiliary head right after stage 7 of the improved EfficientNet-B0, as shown in Figure 7.

Another issue is to design the auxiliary branch reasonably so as to gain better final classification accuracy. In some past approaches, auxiliary classifiers were connected to hidden layers directly. Experiment results suggested that the final accuracies achieved were hardly improved [43]. As analyzed in [19], discrepancy in optimization directions of the main classifier and the auxiliary ones could degrade the model accuracy. In order to relieve the optimization inconsistency between the auxiliary and main classifiers, the auxiliary branch is designed to resemble the structure of stage 9 of the EfficientNet-B0 as shown in Figure 7. Intuitively, discriminative high-level features can be learned by the auxiliary branch in a way similar to the main branch.

For a given video clip, the total loss is obtained as shown in Figure 1. The main classifier branch and the auxiliary classifier branch operate in parallel. For each of the T three-channel images I = [I₁, I₂, …, I_T], the two Softmax classifiers independently give instant scores, based on which the instant cross-entropy losses in both two branches can be calculated. A unique main loss

ℒ_{m a i n}

and an auxiliary loss

ℒ_{a u x}

are obtained by performing average pooling over the instant cross-entropy losses. The total loss is formulated as

ℒ_{t o t a l} = ℒ_{m a i n} + λ ℒ_{a u x},

(7)

where

λ

is a tunable balancing coefficient.

4. Experiments

In this section, we show the experimental results of our proposed scheme on four recognized datasets, Movie Fight dataset [21], Hockey Fight dataset [21], RWF-2000 dataset [9], and Surveillance Camera dataset [22], comprehensively. First, we introduce the datasets briefly and provide information on basic settings relevant to model building, training, and testing. Second, ablation studies are presented and discussed to verify the effectiveness of our proposed building blocks. Third, activation maps are visualized and compared with those of the baseline to illustrate the effectiveness of our model. Fourth, we compare our scheme with the state-of-the-art 2D CNN-based and 3D CNN-based ones in terms of accuracy, GFLOPs, and parameters.

4.1. Datasets

4.1.1. Movie Fight Dataset

The Movie Fight dataset is a collection of 1 to 2 s video clips from movies. It consists of a total of 200 video clips, 100 violent samples, and 100 non-violent samples. All the video clips are of the same resolution of 360 × 240 pixels. We used 80% of the dataset for training and 20% for testing, employing five-fold cross-validation for evaluation.

4.1.2. Hockey Fight Dataset

The Hockey Fight dataset is a video database dedicated to the identification of violence, containing 1000 action clips collected from National Hockey League (NHL) hockey games. The dataset is split evenly, containing 500 videos with normal actions and 500 videos with violent actions. Each clip consists of 40 frames of 360 × 288 pixels approximately, and the authors manually labeled each clip. It is worth noting that the Hockey Fight dataset is limited to violence in a single scenario and thus lacks diversity of background. Due to the limitations in the dataset size and the absence of the dataset partitioning, we employed a five-cross-validation on this dataset.

4.1.3. Surveillance Camera Dataset

The Surveillance Camera dataset has a total of 300 2-s video clips collected from YouTube or acquired by surveillance cameras. There are 150 violent videos and 150 non-violent videos with different sizes and different numbers of frames in this dataset. Due to the relatively small sample size of this dataset, we also adopted a five-fold cross-validation method in our experiments.

4.1.4. RWF-2000 Dataset

The RWF-2000 dataset is the largest dataset for violence detection at present, covering a wide range of violent and non-violent behaviors from different scenes. It contains a collection of 2000 video clips captured by surveillance cameras in real scenes. Each video clip contains 150 frames. The dataset has been well split by the authors. Specifically, 80% of the dataset is the training set and 20% is the testing set. In comparison with the Movie Fight dataset and the Hockey Fight dataset, this dataset is closer to real-world violence and more challenging. Based on this observation, we mainly focused on the RWF-2000 dataset in our experiments.

4.2. Basic Settings

Our proposed scheme is implemented based on PyTorch. As for sparse sampling, 15 frames are sampled from each clip of the Hockey Fight dataset and Surveillance Camera dataset, 18 frames are sampled from each clip of the Movie Fight dataset, while 24 frames are sampled from each clip of the RWF-2000 dataset. The input images are cropped to 224 × 224 and normalized. In order to mitigate overfitting, random horizontal flipping and random cropping are employed for data augmentation. During model training, Adam optimizer is employed with a learning rate of 0.00005, weight decay of 0.01, and batch size of 8. The training process spans 120 epochs. An Nvidia GeForce RTX 3080Ti GPU (NVIDIA, Santa Clara, CA, USA) is used for model training and testing.

4.3. Ablation Studies

4.3.1. Ablation Studies on The Proposed Modules

We conduct experiments for all possible combinations of the Bi-LTMA module, the TSM module, and the auxiliary classifier with an optimized coefficient

λ = 0.5

on the RWF-2000 dataset. Here, the case without any of the three proposed modules is taken as a baseline. For each combination, the model is trained and the corresponding accuracy, FLOPs, and parameters are assessed carefully. The experiment results are given in Table 2.

We analyzed the effectiveness of the proposed modules and their combinations in terms of accuracy according to Table 2. Compared with the baseline, the Bi-LTMA module, the TSM module, and the auxiliary classifier bring accuracy increases of 1%, 0.75%, and 0.5%, respectively, when employed alone. It means that all the proposed modules are beneficial to violence detection accuracy and Bi-LTMA contributes the best. The accuracy can be improved further when two of the proposed modules are arbitrarily employed in combination. For instance, the accuracy is improved from 88.5% to 89.25%, when Bi-LTMA and TSM are combined instead of Bi-LTMA alone. Moreover, the best accuracy of 90.25% is achieved, when all three proposed modules are employed. These observations confirm that the proposed modules can be used to improve the violence detection accuracy effectively.

Referring to Table 2, we analyze the computational costs and parameters of models with different combinations of the proposed modules. The baseline model is lightweight with 0.23 GFLOPs and 4.26 M parameters. One can find that models employing TSM or auxiliary classifiers, or even both, consume the same computing power and have the same number of parameters as the baseline model. This is due to the fact that TSM has no additional parameters and the shift operations involved are almost computationally negligible. Meanwhile, the auxiliary classifier is discarded in the phase of inference and thus the relevant computations and parameters can be neglected. In comparison with the above-mentioned ones, models employing Bi-LTMA consume more computing power (1.21 GFLOPs vs. 0.23 GFLOPs) and have slightly more parameters (4.48 M vs. 4.26 M). This means that the computational cost of the proposed model is dominated by the Bi-LTMA module.

4.3.2. Ablation Studies on The Bi-LTMA Module

We conduct ablation studies to evaluate the effectiveness of the architecture design and the fusion method proposed for the Bi-LTMA module on the RWF-2000 dataset. The TSM module and the auxiliary classifier with optimized coefficient

λ = 0.5

are employed in our experiments.

Compared with the ME module [41], which focuses on the channel dimension of the input feature, the novelty of Bi-LTMA is twofold: (1) a bidirectional architecture is adopted to capture motion information in both forward and backward directions; and (2) both the spatial dimension and channel dimension of the input feature are used to capture motion information. Experiment results are given in Table 3 to illustrate the contributions of the bidirectional architecture, channel dimension, and spatial dimension of the input feature. As can be seen, the accuracy is 87.75% when the channel dimension of the input feature is used alone, as in the ME module [41]. If the bidirectional architecture is adopted together with the channel dimension of the input feature, the accuracy is improved to 89.75%. This means that the bidirectional architecture brings an improvement of 2% in accuracy. Moreover, if the bidirectional architecture is adopted and both the spatial and channel dimensions of the input feature are used as in Bi-LTMA, the best accuracy of 90.25% is achieved. It reveals that motion information lying in the spatial dimension of the input feature can be utilized to improve the accuracy effectively. These observations confirm the advantages of the proposed Bi-LTMA.

As shown in Figure 5, a DropConnect layer and a shortcut connection are employed to realize feature fusion in the Bi-LTMA module. Experiment results are given in Table 4 to highlight the contributions of the DropConnect layer and the shortcut connection. One can find that the accuracy is 87% if neither the DropConnect layer nor the shortcut connection is employed. Notably, the accuracy is improved to 87.75% when the DropConnect layer is employed. Moreover, if both the DropConnect layer and the shortcut connection are employed, as in the Bi-LTMA module, the best accuracy of 90.25% is achieved. This suggests that the shortcut connection can be used to improve the accuracy further. These observations confirm the effectiveness of the proposed fusion method.

4.3.3. Ablation Studies on the Auxiliary Head Module

We conducted experiments to evaluate the performance of our proposed auxiliary classifier with the auxiliary head in Figure 7 and figured out the optimal coefficient

λ

in Equation (7) on the RWF-2000 dataset. To highlight the effectiveness of the designed auxiliary head, a baseline scheme, which directly inserts the soft-max classifier at stage 7 of the improved EfficientNet-B0, was also considered. Experiment results are given in Table 5.

As shown in Table 5, our proposed auxiliary classifier outperforms the baseline in accuracy for all values of

λ

. This suggests that the designed auxiliary head is effective in relieving the optimization inconsistency between the auxiliary and main classifiers. Moreover, for each of the schemes considered, one can find that the best accuracy can be achieved when

λ = 0.5

is the case. This tells that

λ = 0.5

can be a good choice for the combined loss in Equation (7) during model training.

4.4. Model Performance Evaluation

4.4.1. Accuracy Evaluation

In order to evaluate the performance of our proposed violence detection model comprehensively, we conducted experiments on the four datasets mentioned above. Recall, precision, F1 score, and accuracy were taken as metrics. The numerical results are illustrated in Table 6. Figure 8 shows the normalized confusion matrices on the test sets of the four datasets used.

As can be seen from Table 6, the scores for recall, precision, F1 score, and accuracy are no less than 90% on all four datasets. This confirms that our proposed model can be used to detect violence in videos reliably in a variety of scenarios. It is also observed that the scores on the RWF-2000 dataset are comparatively lower than those on the other three datasets. Intuitively, this is due to the fact that the RWF-2000 dataset covers more diverse and complicated scenarios than its counterparts. Though the recall and precision on the Surveillance Camera dataset show notable advantages of 6% and 8% compared with that on the RWF-2000 dataset, the corresponding accuracy of 91.67% is not as high as expected. The reason lies in the risk of false detection, as will be analyzed in the following.

Based on Figure 8, we conducted a brief risk analysis of our proposed model. As can be seen, the missed detection rate and false detection rate on the Movie Fight and Hockey Fight datasets are notably lower than those on the other two datasets. This observation is consistent with the numerical results in Table 6. It is also observed that the false detection rate is generally higher than the missed detection rate on the datasets used. On the Surveillance Camera dataset, the false detection rate is up to 13%, notably higher than the corresponding missed detection rate of 3%. This reveals that the loss of accuracy on the Surveillance Camera dataset mainly comes from false detection events.

4.4.2. Inference Time

We measured the inference time of the proposed model on three different devices, a desktop computer with Intel Core i7-8700 CPU @3.2GHz and 16G RAM, an RTX 2080Ti GPU, and an RTX 3080Ti GPU (Intel, Santa Clara, CA, USA). These devices simulated edge servers with limited resources to implement violence detection in videos from multiple cameras. The average inference times on the test sets of the four datasets are shown in Table 7.

We focus on the most challenging case of RWF-2000 in the following analysis. As can be seen from Table 7, the inference time is 323.3 ms on the RWF-2000 dataset when the desktop computer with Intel Core i7-8700 CPU @3.2GHz and 16G RAM is employed. It means that an inference speed of three times per second can be achieved. For 25 fps videos, this computer can serve three cameras if a latency of 960 ms is acceptable. In contrast, the inference time is lower than 23 ms if an RTX 2080Ti GPU is employed instead. In this case, the inference speed is higher than 40 times per second. For the same 25 fps videos and the latency constraint of 960 ms, one RTX 2080Ti GPU can serve 40 cameras. Evidently, even more cameras can be served if an RTX 3080Ti GPU is employed. These observations confirm the applicability of the proposed model in real-time applications.

4.5. Visualization of Activation Maps

We demonstrate the effectiveness of our model in terms of motion feature extraction using the Grad-CAM visualization method [50]. A baseline model, which employs the original EfficientNet-B0 instead of the improved version, is trained for comparison. Given a violent video clip, Grad-CAM visualizations are obtained for both our model and the baseline model.

Experiment results on four violent video clips taken from the RWF-2000 dataset are shown in Figure 9. One can find that, for each video clip, our model exhibits a more distinct focus on the motion area where the violence occurs than the baseline model. Meanwhile, our model shows far fewer misjudgments of motion areas. These observations confirm that our model is effective in motion feature extraction. In principle, the advantage of our model mainly comes from the proposed Bi-LTMA and the TSM.

4.6. Comparison with the State-of-the-Art Methods

We compare our model with the state-of-the-art models in terms of accuracy, computational cost, and parameters on the Movie Fight dataset, the Hockey Fight dataset, the Surveillance Camera dataset, and the RWF-2000 dataset. For a comprehensive comparison, models based on 2D CNNs [1,2,11,15,16,36], (2 + 1)D CNN [33], and 3D CNNs [4,6,7,9,17,34] are taken into consideration. Experiment results of all models are given in Table 8.

As observed from Table 8, each of the models achieves a comparatively lower accuracy on the RWF-2000 dataset than on the Hockey Fight dataset, the Movie Fight dataset, and the Surveillance Camera dataset. The reason lies in the fact that the RWF-2000 dataset covers more complicated scenarios than the other three datasets and, thus, is more challenging for violence detection.

As can be seen from Table 8, our model achieves very competitive accuracies on the four datasets. On the Movie Fight dataset, our model achieved the best accuracy of 100%. On the Hockey Fight dataset, our model achieves an accuracy of 98.5%, which is 0.5% lower than the best and ranked third among all models considered. On the Surveillance Camera dataset, our model achieves an accuracy of 91.67%, which is 0.83% lower than the best and ranked third among all models as well. However, on the RWF-2000 dataset, one can find that our model achieves the best accuracy of 90.25% among all models considered. Compared with the second-placed model, our model shows an advantage of 1% in accuracy. It reveals that our model is more robust than its counterparts in scenarios with complicated backgrounds.

Table 8 also shows that our model is lightweight in both computational cost and parameters. Among all the models considered, our model requires the lowest computational power of 1.21 GFLOPs, which is only 28% of that of the second-placed model and far less than those of others. Meanwhile, our model has only 4.48 M parameters, and ranked third among all models considered. These merits can be attractive, especially for real-time violence detection tasks on resource-limited platforms.

5. Conclusions

In this paper, we propose a lightweight 2D CNN-based violence detection scheme. Frame-grouping is adopted at the preprocessing stage to reduce the data redundancy of videos and realize short-term temporal modeling. An improved EfficientNet-B0 is constructed and employed as a backbone to extract motion information from preprocessed videos. Particularly, we integrate the newly proposed motion attention module Bi-LTMA and the TSM module into the EfficientNet-B0 to enable long-term temporal modeling and temporal feature interaction. Furthermore, an auxiliary classifier tailored for the improved EfficientNet-B0 is proposed for use in the phase of model training to improve violence detection accuracy. Experiment results suggest that our proposed model is lightweight in both computational cost and parameters. Compared with state-of-the-art 2D CNN-based and 3D CNN-based models, the proposed model requires much lower computing power and achieves competitive accuracy, especially on the RWF-2000 dataset. For violence detection applications on platforms with limited computing power, our proposed model can be a promising candidate.

Author Contributions

Conceptualization, J.W., H.L. and D.Z.; methodology, J.W.; software, J.W. and D.Z.; validation, J.W. and H.L.; writing—original draft preparation, J.W.; writing—review and editing, J.W. and D.W.; visualization, J.W.; supervision, D.W.; funding acquisition, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under Grant 2020YFC0833203.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank those colleagues who have ever provided constructive and valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sudhakaran, S.; Lanz, O. Learning to Detect Violent Videos Using Convolutional Long Short-Term Memory. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
Halder, R.; Chatterjee, R. CNN-BiLSTM Model for Violence Detection in Smart Surveillance. SN Comput. Sci. 2020, 1, 201. [Google Scholar] [CrossRef]
Abdullah, M.S.N.B.; Karim, H.A.; AlDahoul, N. A Combination of Light Pre-Trained Convolutional Neural Networks and Long Short-Term Memory for Real-Time Violence Detection in Videos. IJTech 2023, 14, 1228. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Feichtenhofer, C.; Pinz, A.; Wildes, R.P. Spatiotemporal Residual Networks for Video Action Recognition. arXiv 2016, arXiv:1611.02155. [Google Scholar]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast Networks for Video Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6201–6210. [Google Scholar]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9912, pp. 20–36. [Google Scholar]
Cheng, M.; Cai, K.; Li, M. RWF-2000: An Open Large Scale Video Database for Violence Detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4183–4190. [Google Scholar]
Pu, Y.; Wu, X.; Wang, S.; Huang, Y.; Liu, Z.; Gu, C. Semantic Multimodal Violence Detection Based on Local-to-Global Embedding. Neurocomputing 2022, 514, 148–161. [Google Scholar] [CrossRef]
Ciampi, L.; Santiago, C.; Costeira, J.; Falchi, F.; Gennaro, C.; Amato, G. Unsupervised Domain Adaptation for Video Violence Detection in the Wild. In Proceedings of the 3rd International Conference on Image Processing and Vision Engineering, Prague, Czech Republic, 21–23 April 2023; SCITEPRESS—Science and Technology Publications: Setúbal, Portugal, 2023; pp. 37–46. [Google Scholar]
Lopez, D.J.D.; Lien, C.-C. Two-Stage Complex Action Recognition Framework for Real-Time Surveillance Automatic Violence Detection. J Ambient Intell. Humaniz. Comput. 2023, 14, 15983–15996. [Google Scholar] [CrossRef]
Su, Y.; Lin, G.; Zhu, J.; Wu, Q. Human Interaction Learning on 3D Skeleton Point Clouds for Video Violence Recognition. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12349, pp. 74–90. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3590–3598. [Google Scholar]
Kang, M.-S.; Park, R.-H.; Park, H.-M. Efficient Spatio-Temporal Modeling Methods for Real-Time Violence Recognition. IEEE Access 2021, 9, 76270–76285. [Google Scholar] [CrossRef]
Vijeikis, R.; Raudonis, V.; Dervinis, G. Efficient Violence Detection in Surveillance. Sensors 2022, 22, 2216. [Google Scholar] [CrossRef] [PubMed]
Khan, M.; Saddik, A.E.; Gueaieb, W.; De Masi, G.; Karray, F. VD-Net: An Edge Vision-Based Surveillance System for Violence Detection. IEEE Access 2024, 12, 43796–43808. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Stanford, CA, USA, 2019; Volume 97, pp. 6105–6114. [Google Scholar]
Li, D.; Chen, Q. Dynamic Hierarchical Mimicking towards Consistent Optimization Objectives. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7639–7648. [Google Scholar]
Lin, J.; Gan, C.; Han, S. TSM: Temporal Shift Module for Efficient Video Understanding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7082–7092. [Google Scholar]
Bermejo Nievas, E.; Deniz Suarez, O.; Bueno García, G.; Sukthankar, R. Violence Detection in Video Using Computer Vision Techniques. In Computer Analysis of Images and Patterns; Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch, W., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6855, pp. 332–339. [Google Scholar]
Akti, S.; Tataroglu, G.A.; Ekenel, H.K. Vision-Based Fight Detection from Surveillance Cameras. In Proceedings of the 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA), Istanbul, Turkey, 6–9 November 2019; pp. 1–6. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Laptev, I. On Space-Time Interest Points. Int. J. Comput. Vis. 2005, 64, 107–123. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up Robust Features. In Computer Vision—ECCV 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3951, pp. 404–417. [Google Scholar]
Hassner, T.; Itcher, Y.; Kliper-Gross, O. Violent Flows: Real-Time Detection of Violent Crowd Behavior. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 1–6. [Google Scholar]
Gao, Y.; Liu, H.; Sun, X.; Wang, C.; Liu, Y. Violence Detection Using Oriented Violent Flows. Image Vis. Comput. 2016, 48–49, 37–41. [Google Scholar] [CrossRef]
Wang, H.; Schmid, C. Action Recognition with Improved Trajectories. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
Guedes, A.R.M.; Chavez, G.C. Real-Time Violence Detection in Videos Using Dynamic Images. In Proceedings of the 2020 XLVI Latin American Computing Conference (CLEI), Loja, Ecuador, 19–23 October 2020; pp. 503–511. [Google Scholar]
Serrano, I.; Deniz, O.; Espinosa-Aranda, J.L.; Bueno, G. Fight Recognition in Video Using Hough Forests and 2d Convolutional Neural Network. IEEE Trans. Image Process. 2018, 27, 4787–4797. [Google Scholar] [CrossRef] [PubMed]
Leutenegger, S.; Chli, M.; Siegwart, R.Y. BRISK: Binary Robust Invariant Scalable Keypoints. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2548–2555. [Google Scholar]
Rosten, E.; Drummond, T. Fusing Points and Lines for High Performance Tracking. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05), Beijing, China, 17–21 October 2005; Volume 1, pp. 1508–1515. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, L.; Tong, Z.; Ji, B.; Wu, G. TDN: Temporal Difference Networks for Efficient Action Recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1895–1904. [Google Scholar]
Liang, Q.; Li, Y.; Chen, B.; Yang, K. Violence Behavior Recognition of Two-Cascade Temporal Shift Module with Attention Mechanism. J. Electron. Imag. 2021, 30, 043009. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
He, D.; Zhou, Z.; Gan, C.; Li, F.; Liu, X.; Li, Y.; Wang, L.; Wen, S. StNet: Local and Global Spatial-Temporal Modeling for Action Recognition. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8401–8408. [Google Scholar]
Jiang, B.; Wang, M.; Gan, W.; Wu, W.; Yan, J. STM: Spatiotemporal and Motion Encoding for Action Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2000–2009. [Google Scholar]
Li, Y.; Ji, B.; Shi, X.; Zhang, J.; Kang, B.; Wang, L. TEA: Temporal Excitation and Aggregation for Action Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 906–915. [Google Scholar]
Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; Tu, Z. Deeply-Supervised Nets. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics; Lebanon, G., Vishwanathan, S.V.N., Eds.; PMLR: San Diego, CA, USA, 2015; Volume 38, pp. 562–570. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Huang, G.; Chen, D.; Li, T.; Wu, F.; van der Maaten, L.; Weinberger, K.Q. Multi-Scale Dense Networks for Resource Efficient Image Classification. arXiv 2018, arXiv:1703.09844. [Google Scholar]
Yang, C.; Xu, Y.; Shi, J.; Dai, B.; Zhou, B. Temporal Pyramid Network for Action Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 588–597. [Google Scholar]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. FitNets: Hints for Thin Deep Nets. arXiv 2015, arXiv:1412.6550. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for Mobilenetv3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; Fergus, R. Regularization of Neural Networks Using Dropconnect. In Proceedings of the 30th International Conference on Machine Learning; Dasgupta, S., McAllester, D., Eds.; PMLR: Atlanta, GA, USA, 2013; Volume 28, pp. 1058–1066. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. The pipeline of the proposed lightweight violence detection scheme. F₁–F_3T stand for RGB frames sparsely sampled from the video clip. X₁–X_3T denote grayscale frames converted from F₁–F_3T by RGB2Gray operation. I₁–I_T are 3-channel images obtained by frame-grouping.

Figure 2. The architecture of the EfficientNet-B0.

Figure 3. The architecture of the Inverted Residual block. BN stands for batch normalization, Swish is the activation function used, and SE denotes the Squeeze-and-Excitation attention module.

Figure 4. The architecture of the modified Inverted Residual Block. We insert the Bi-LTMA module before the first convolution block, and the TSM module after the first convolution block.

Figure 5. The structure of the bi-directional long-term motion attention module.

Figure 6. The principle of TSM module.

Figure 7. The architecture of the auxiliary branch.

Figure 8. Normalized confusion matrices on the four datasets used.

Figure 9. Grad-CAM results of our model and the baseline model.

Table 1. The structure of the EfficientNet-B0.

Stage	Operator	Resolution	Channels	Layers
1	Conv3 × 3	224 × 224	32	1
2	MBConv1, k1 × 1	112 × 112	16	1
3	MBConv6, k3 × 3	112 × 112	24	2
4	MBConv6, k5 × 5	56 × 56	40	2
5	MBConv6, k3 × 3	28 × 28	80	3
6	MBConv6, k5 × 5	14 × 14	112	3
7	MBConv6, k5 × 5	14 × 14	192	4
8	MBConv6, k3 × 3	7 × 7	320	1
9	Conv1 × 1 and Pooling & FC	7 × 7	1280	1

Table 2. Ablation studies on the proposed modules on the RWF-2000 dataset.

Bi-LTMA	TSM	Auxiliary Classifier	Accuracy	GFLOPs	Params
			87.5%	0.23	4.26 M
√			88.5%	1.21	4.48 M
	√		88.25%	0.23	4.26 M
		√	88.0%	0.23	4.26 M
√	√		89.25%	1.21	4.48 M
√		√	88.75%	1.21	4.48 M
	√	√	89.25%	0.23	4.26 M
√	√	√	90.25%	1.21	4.48 M

Table 3. Ablation studies on the architecture design of the Bi-LTMA module.

Bidirectional Architecture	Spatial Dimension	Channel Dimension	Accuracy
		√	87.75%
√		√	89.75%
√	√	√	90.25%

Table 4. Ablation studies on the fusion method of the Bi-LTMA module on RWF-2000 dataset.

DropConnect Layer	Shortcut Connection	Fusion Expression	Accuracy
		$A ⊙ F$	87.0%
√		$𝒢 (A ⊙ F)$	87.75%
√	√	$F + 𝒢 (A ⊙ F)$	90.25%

Table 5. Accuracies achieved by using the baseline and our proposed auxiliary classifiers with different values of

λ

on RWF-2000 dataset.

Table 5. Accuracies achieved by using the baseline and our proposed auxiliary classifiers with different values of

λ

on RWF-2000 dataset.

$λ$	Accuracy
$λ$	Baseline	Our Proposed
0.1	88.75%	89.25%
0.2	88.75%	89.5%
0.3	89.0%	90.0%
0.4	89.25%	90.0%
0.5	89.5%	90.25%
0.6	89.25%	89.75%
0.7	88.75%	89.75%
0.8	88.75%	89.25%
0.9	88.5%	89.0%

Table 6. Recall, precision, F1 score, and accuracy of our model on the four datasets used.

Dataset	Recall (%)	Precision (%)	F1 Score (%)	Accuracy (%)
Movie Fight	100	100	100	100
Hockey Fight	100	98	99	98.5
Surveillance	97	98	95	91.67
RWF-2000	91	90	90	90.25

Table 7. Average inference times on the test sets of the four datasets used.

Device	Inference Time
Device	Movie Fight	Hockey Fight	Surveillance	RWF-2000
Inter Core i7-8700 CPU @3.2GHz, 16G RAM	240.8 ms	198.9 ms	199.1 ms	323.3 ms
RTX 2080Ti GPU	21.5 ms	21.0 ms	20.7 ms	22.6 ms
RTX 3080Ti GPU	19.9 ms	19.3 ms	18.9 ms	20.1 ms

Table 8. Comparison with the state-of-the-art methods on Hockey Fight and RWF-2000.

Method	CNN Type	Datasets				GFLOPs	Parameters
Method	CNN Type	Movie Fight	Hockey Fight	Surveillance	RWF-2000	GFLOPs	Parameters
ConvLSTM [1]	2D	100%	97.1%	--	77.00%	15.69	9.76 M
ConvBiLSTM [2]	2D	99.66%	98.78%	--	79.1%	24.52	11.20 M
Efficient Spatiotemporal [15]	2D	100%	99.0%	92.0%	89.25%	4.26	5.29 M
FightCNN-based Attention [16]	2D	99.5%	96.1%	71.00%	82.00%	--	4.074 M
ViT Large-16 [11]	2D	99.5%	98.0%	84.60%	--	--	--
Two-cascade TSM [36]	2D	--	97.5%	--	87.75%	--	8.49 M
Ours	2D	100%	98.5%	91.67%	90.25%	1.21	4.48 M
R(2 + 1)D-18 [33]	(2 + 1)D	100%	95.0%	--	81.25%	44.42	33.20 M
R3D-50 [34]	3D	100%	96.0%	85.21%	80.90%	39.98	46.20 M
VD-Net [17]	3D	99.0%	98.5%	92.50%	--	15.30	49.28 M
Flow Gated Network(RGB) [9]	3D	100%	98.0%	--	84.50%	17.2	0.248 M
SlowFast [7]	3D	100%	97.5%	--	85.00%	23.89	33.79 M
I3D(RGB) [6]	3D	100%	98.5%	84.67%	85.75%	111.33	12.29 M
C3D [4]	3D	100%	96.0%	--	82.75%	154.19	78.00 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Zhao, D.; Li, H.; Wang, D. Lightweight Violence Detection Model Based on 2D CNN with Bi-Directional Motion Attention. Appl. Sci. 2024, 14, 4895. https://doi.org/10.3390/app14114895

AMA Style

Wang J, Zhao D, Li H, Wang D. Lightweight Violence Detection Model Based on 2D CNN with Bi-Directional Motion Attention. Applied Sciences. 2024; 14(11):4895. https://doi.org/10.3390/app14114895

Chicago/Turabian Style

Wang, Jingwen, Daqi Zhao, Haoming Li, and Deqiang Wang. 2024. "Lightweight Violence Detection Model Based on 2D CNN with Bi-Directional Motion Attention" Applied Sciences 14, no. 11: 4895. https://doi.org/10.3390/app14114895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Violence Detection Model Based on 2D CNN with Bi-Directional Motion Attention

Abstract

1. Introduction

2. Related Works

2.1. Violence Detection

2.2. Attention Mechanism on Temporal Information

2.3. Auxiliary Loss

3. Proposed Violence Detection Model

3.1. Data Preprocessing

3.2. Improved EfficientNet-B0

3.3. Auxiliary Loss

4. Experiments

4.1. Datasets

4.1.1. Movie Fight Dataset

4.1.2. Hockey Fight Dataset

4.1.3. Surveillance Camera Dataset

4.1.4. RWF-2000 Dataset

4.2. Basic Settings

4.3. Ablation Studies

4.3.1. Ablation Studies on The Proposed Modules

4.3.2. Ablation Studies on The Bi-LTMA Module

4.3.3. Ablation Studies on the Auxiliary Head Module

4.4. Model Performance Evaluation

4.4.1. Accuracy Evaluation

4.4.2. Inference Time

4.5. Visualization of Activation Maps

4.6. Comparison with the State-of-the-Art Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI