Violence-YOLO: Enhanced GELAN Algorithm for Violence Detection

Xu, Wenbin; Zhu, Dingju; Deng, Renfeng; Yung, KaiLeung; Ip, Andrew W. H.

doi:10.3390/app14156712

Open AccessArticle

Violence-YOLO: Enhanced GELAN Algorithm for Violence Detection

by

Wenbin Xu

¹,

Dingju Zhu

^1,2,*,

Renfeng Deng

¹,

KaiLeung Yung

³ and

Andrew W. H. Ip

⁴

¹

School of Software, South China Normal University, Foshan 528000, China

²

School of Computer Science, South China Normal University, Guangzhou 510000, China

³

Department of Industrial and Systems Engineering, Hong Kong Polytechnic University, Hong Kong 999077, China

⁴

Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A2, Canada

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6712; https://doi.org/10.3390/app14156712

Submission received: 10 June 2024 / Revised: 23 July 2024 / Accepted: 29 July 2024 / Published: 1 August 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Violence is a serious threat to societal health; preventing violence in airports, airplanes, and spacecraft is crucial. This study proposes the Violence-YOLO model to detect violence accurately in real time in complex environments, enhancing public safety. The model is based on YOLOv9’s Generalized Efficient Layer Aggregation Network (GELAN-C). A multilayer SimAM is incorporated into GELAN’s neck to identify attention regions in the scene. YOLOv9 modules are combined with RepGhostNet and GhostNet. Two modules, RepNCSPELAN4_GB and RepNCSPELAN4_RGB, are innovatively proposed and introduced. The shallow convolution in the backbone is replaced with GhostConv, reducing computational complexity. Additionally, an ultra-lightweight upsampler, Dysample, is introduced to enhance performance and reduce overhead. Finally, Focaler-IoU addresses the neglect of simple and difficult samples, improving training accuracy. The datasets are derived from RWF-2000 and Hockey. Experimental results show that Violence-YOLO outperforms GELAN-C. [email protected] increases by 0.9%, computational load decreases by 12.3%, and model size is reduced by 12.4%, which is significant for embedded hardware such as the Raspberry Pi. Violence-YOLO can be deployed to monitor public places such as airports, effectively handling complex backgrounds and ensuring accurate and fast detection of violent behavior. In addition, we achieved 84.4% mAP on the Pascal VOC dataset, which is a significant reduction in model parameters compared to the previously refined detector. This study offers insights for real-time detection of violent behaviors in public environments.

Keywords:

computer vision; objection detection; space explorations; violence detection; surveillance video

1. Introduction

With the continuous development of modern society, the frequency and scope of violent behavior have gradually increased, posing serious challenges to social security and public safety. An effective violence detection system is of great significance in the field of surveillance, as it can help detect and respond to potential violent events promptly, thereby reducing unnecessary injuries and property losses. Traditional violent behavior detection methods mainly rely on manual surveillance or simple rule-based determination, suffering from low recognition accuracy and efficiency. With the development of computer vision and artificial intelligence, intelligent video surveillance is gradually replacing traditional video surveillance. Using computer vision technology to automatically identify and locate human violence in images or videos, it can effectively monitor and control violent scenes [1,2]. This technology can be applied to streets, factories, schools, shopping centers, and other public places.

Behavior recognition is an important research direction in the field of computer vision, enabling the automatic understanding of specific behavior semantics by analyzing body movements and motions of individuals in a scene [3,4]. Violence detection generally relies on deep learning models for training and inference, which effectively reduces the labor costs associated with violence detection and improves both accuracy and efficiency. Traditional methods, such as motion detection, background subtraction, and target tracking, have been widely used for violence detection in surveillance systems. However, these methods have limitations in terms of accuracy and robustness, especially in complex and chaotic environments [5,6].

Recent studies have shown that deep learning-based models, such as recurrent neural networks (RNN) and convolutional neural networks (CNN), can significantly improve the accuracy and efficiency of vision-based violence detection systems [7,8]. These models can process and analyze image and video data, extract complex features, and identify violent events in real time. Despite recent advances in violence detection, the use of deep learning-based methods shows great promise in improving efficiency and accuracy. However, these methods still face challenges such as high computational requirements, restricted viewpoints, varying illumination conditions, complex crowd scenes, and intensity variations. Additionally, the weaker feature representation capability of the models and poor model generalization remain significant limitations [3,9,10].

Considering the above challenges and limitations, we improve YOLOv9, which offers outstanding advantages in terms of lightweight design, capturing richer gradient flow information and providing better feature representation, and apply it to violent behavior target detection. The system will be invaluable. The main contributions are as follows:

The simple, parameter-free attention module (SimAM) is integrated into the neck structure of GELAN to identify attention regions in the scene.
By integrating GhostNet and RepGhostNet modules into the YOLOv9 network model, we introduce two new modules, RepNCSPELAN4_GB and RepNCSPELAN4_RGB, in the backbone and neck networks. Furthermore, the shallow ordinary convolution in the backbone network is replaced with GhostConv, reducing computational complexity.
A lightweight universal upsampling operator, DySample, is used to replace the traditional nearest neighbor interpolation upsampling module, minimizing the loss of feature information during the upsampling process.
Combined with Focaler-IoU loss, it mitigates the neglect of simple and difficult samples, focusing on different regression samples, thus improving training accuracy.
Violence-YOLO, a violent behavior detection algorithm based on YOLOv9, is proposed. On our customized dataset, the average precision of detection during training ([email protected]) reaches 92.6%, reflecting an improvement of 0.9%. Additionally, the computational load and the model size are reduced by 12.3% and 12.4%, respectively.

2. Related Work

Currently, methods for detecting violent behavior primarily include traditional machine learning techniques and deep learning-based computer vision approaches. In previous studies, traditional machine learning methods were mainly relied upon for violence detection. These methods typically depend on manually designed features and selected classifiers to determine violence, which poses several limitations. For example, hand-designed features often require prior knowledge of the input data, and some complex violent features are challenging to extract manually. Additionally, obtaining sufficient labeled violence data is difficult, resulting in trained classifiers that fail to achieve expected performance. Moreover, traditional methods tend to be inefficient when dealing with large-scale video data, making them impractical for real-world applications [11].

Recently, deep learning techniques have gained popularity for detecting violence. Earlier work in this area [12] proposed a deep learning-based approach to recognize instances of violence in surveillance videos using a two-stream convolutional neural network (CNN) architecture to extract spatio-temporal features, achieving 89% classification accuracy on a surveillance dataset. Similarly, in [13], a deep learning-based model for social media violence detection was developed using long short-term memory (LSTM) to extract visual features and temporal dependencies, obtaining 94.9% accuracy on the same dataset. Additionally, in [14,15], other researchers have presented deep learning approaches to detect violence in city surveillance videos.

2.1. 3D-CNN

Deep learning is effectively addressing challenges in computer vision. A deep learning architecture for video analysis tasks requiring spatial and temporal information, 3D-CNN, is an extension of 2D-CNN that processes video sequences as input. These 3D-CNN architectures typically consist of multiple convolutional and pooling layers that learn to extract spatio-temporal features from video sequences. The output of the convolutional layers is then passed through a fully connected layer and an activation function for final prediction. For instance, in [16], the method utilizes key information provided by Hough features to represent a frame in a sequence. Liu et al. [17] used 3D-CNN to identify violent scenes in video applications, with sampling as a preprocessing step. Researchers have developed a deep learning-based model to detect violent scenes using transfer learning techniques, while [18] introduced a Spark framework to detect violent scenes via bidirectional LSTM. Similarly, ref. [19] introduced the idea of ensembles, and ref. [9] used a combination of 3D-CNN and Support Vector Machines (SVMs) to recognize violent behaviors in videos [20].

2.2. CNN-RNN

Many researchers believe that video data requires more than just feature extraction with CNN; it is necessary to include RNN in their models to consider extracted features over time. Therefore, the CNN-RNN model was proposed for anomaly detection in video datasets. Mai Magdy et al. [21] proposed the Violence 4D model based on the 3D-CNN model, using ResNet50 as a backbone and dense optical flow to obtain the region of interest of RGB frames. The 4D convolutional neural network blocks the interactions between recording segments to improve the 3D-CNN’s representation at the segment level; the 4D residual block layering merges the current 3D-CNN for long-term modeling. Kin-Choong Yow et al. [22] proposed a new model called KianNet, designed to improve the performance of ResNet50 and ConvLSTM architectures by combining them with a multi-head self-attention layer to effectively detect violence in recorded events, achieving 96.21% accuracy on the RWF dataset, surpassing Violence 4D.

However, a comprehensive literature analysis shows that many existing methods must overcome numerous limitations and challenges. These include inadequate integration with state-of-the-art IoT devices, heavy reliance on end-to-end pre-trained models, failure to integrate cloud-based concepts, and the use of hand-crafted features [23]. These algorithms are computationally intensive, making it difficult to achieve real-time detection and failing to meet the real-world needs of surveillance systems.

2.3. YOLO

YOLO is currently the most popular real-time target detector, including YOLOv5, YOLOv7, YOLOv8, and YOLOv9. YOLOv9 is an evolution of the YOLO target detection method with admirable real-time detection capabilities. By combining PGI and GELAN, YOLOv9 introduces several architectural advances and training methods that improve accuracy and performance. YOLOv9 builds on the framework of YOLOv7 and dynamic YOLOv7, including RepConv with CSPNet blocks and GELAN [24]. A simplified downsampling module and an optimized anchorless prediction header are implemented. The auxiliary loss part of PGI follows the auxiliary header setup of YOLOv7 [25,26]. As a single-stage detection algorithm, GELAN is an anchorless detection method with outstanding advantages in lightweight design, capturing richer gradient flow information, and providing better feature representation. Its network structure is shown in Figure 1. It is characterized by fast detection speed, high accuracy, and low computational cost [27].

Therefore, we propose Violence-YOLO, a unique YOLO model based on GELAN-C, which is a lighter and more accurate violence detection model.

3. Materials and Methods

3.1. Overview of Violence-YOLO

GELAN, a newly proposed network architecture in YOLOv9, the latest version of the YOLO family of detection models, is a generalized efficient layer aggregation network (GELAN) that integrates considerations of lightweight design, inference speed, and accuracy by combining two neural network architectures, CSPNet [28] and ELAN [29], designed using gradient path planning. Renowned for its joint detection and segmentation capabilities, we have enhanced this architecture and applied it to the domain of violence detection. The architecture of our Violence-YOLO detector is depicted in Figure 2 and comprises three components: backbone, head, and neck. GhostConv, proposed by GhostNet, generates more feature maps through cost-effective operations [30]. The Violence-YOLO backbone employs GhostConv to replace standard convolution at the shallow levels. The Ghost bottleneck, designed to leverage GhostConv, reduces the computational cost of the standard convolutional layer while maintaining comparable recognition performance [31]. The RepGhost module is a hardware-efficient module for feature reuse, utilizing structural reparameterization techniques [32] for implicit feature reuse. Similarly, the RepGhost bottleneck module is designed using the RepGhost module. The RepNCSPELAN4 module is a feature extraction and fusion module for GELAN, combining the CSP and ELAN modules, as shown in Figure 2. The Violence-YOLO network integrates the aforementioned RepNCSPELAN4 with the Ghost bottleneck and RepGhost bottleneck, respectively, to form the new modules RepNCSPELAN4_GB and RepNCSPELAN4_RGB, which facilitate richer gradient flow information while maintaining a lightweight design. These two new modules are integrated into the backbone network. In the final layer of the backbone network, we continue to use the SPPELAN module in GELAN to extract features at different spatial resolutions, enhancing resilience to object size variations and occlusions.

In the neck component, we employ the PAN-FPN [33] for feature fusion, enhancing the integration and utilization of feature layer information across different scales. The SimAM attention module is lightweight, practical, and simple to use, focusing on channel and spatial dimension information. It does not require additional parameters to compute 3D attention weights, thereby avoiding structural modifications that increase model parameters. We utilize two lightweight upsampling operators, Dysample and multiple RepNCSPELAN4_GB modules, alongside a decoupled head structure to form the neck module. Finally, the proposed model, Violence-YOLO, is a detection model that balances accuracy and lightweight design, specifically engineered for violence detection.

3.2. Lightweight Modules

To facilitate the deployment of the model on embedded devices with limited memory and computational resources while efficiently utilizing the model’s detection accuracy, this study introduces GhostConv, a lightweight alternative to specific Conv operations in the GELAN algorithm, to reduce computational and parameter requirements. The distinctions between Conv and GhostConv are illustrated in Figure 3.

GhostConv comprises two parts: the first part performs a standard convolution to obtain the eigenfeature maps, precisely controlling the number of convolutions, while the second part uses the eigenfeature maps generated by the standard convolution to conduct a series of linear operations to produce additional feature maps. Finally, these two sets of feature maps are combined to form a new output. This technique significantly reduces the parameter and computational load while maintaining model performance, providing a viable pathway for more efficient and lightweight neural network deployment. Suppose “c”, “h”, and “w” denote the number of channels, height, and width of the input features, respectively. The height and width of the output features are denoted by “h” and “w”, respectively, the number of convolution kernels is denoted as “n”, the kernel size as “k”, the linear transformation kernel size as “d”, and the number of transformations as “s”. “rs” and “rc” represent the ratios of the number of floating point operations and parameters between Conv and GhostConv, as shown in the following:

r_{s} = \frac{n \cdot h \cdot w \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot h \cdot w \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot h \cdot w \cdot d \cdot d}

(1)

= \frac{c \cdot k \cdot k}{\frac{1}{s} \cdot c \cdot k \cdot k + \frac{s - 1}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s

r_{c} = \frac{n \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s .

(2)

From the formula, it can be concluded that r_{_s} and r_{_c} are affected by the number of transformations s. Increasing the number of transformations enhances the model’s effectiveness in reducing the number of parameters and floating point operations. Therefore, introducing GhostConv into the model is an effective strategy to significantly reduce FLOPs and parameters, ultimately improving the model’s speed and efficiency. Leveraging the Ghost module, the Ghost Bottleneck (G-bneck) is designed. As shown in Figure 4, the Ghost Bottleneck primarily consists of two stacked Ghost modules. The first Ghost module acts as an extension layer, enhancing feature expression by increasing the number of channels. Here, the concept of the expansion ratio, defined as the ratio of output channels to input channels, is introduced. This is followed by a second Ghost module, responsible for reducing the number of channels to match the shortcut path. A shortcut is then used to connect the inputs and outputs of these two Ghost modules. Notably, the ReLU activation function is no longer applied after the second Ghost module, as adding ReLU after deep convolution diminishes the effect. Additionally, Batch Normalization and ReLU nonlinear activation are applied after each layer. The Ghost Bottleneck designed in this way is suitable for the case of stride = 1. For stride = 2, the shortcut path is achieved by downsampling the layer and using depthwise convolution with stride = 2. To improve computational efficiency, the initial convolution in the Ghost module is implemented via Pointwise Convolution.

The structure of the RepGhost bottleneck is illustrated in Figure 5. This structure halves the number of input channels using a 1 × 1 convolutional layer and the ReLU activation function. After the first RepGhost module, the number of channels is preserved, and the model’s capacity to perceive channel features is further enhanced using the SE module and 1 × 1 convolution, ensuring that the output channels match the input channels. Finally, the feature and input feature mappings are proportionally weighted and summed post-RepGhost module to generate the output results. During the inference phase, the RepGhost bottleneck retains only two branches to conserve memory resources and enhance inference speed.

In this paper, RepNBottleneck in RepNCSP, a submodule of RepNCSPELAN4, is replaced to improve it. RepNBottleneck is substituted with GhostBottleneck and RepGhostBottleneck, as illustrated in Figure 1 and Figure 2, resulting in the improved RepNCSPELAN4, named RepNCSPELAN4_GB and RepNCSPELAN4_RGB, as shown in Figure 6.

The two modules were compared to the RepNCSPELAN4 module by replacing all instances of RepNCSPELAN4 in the backbone and neck networks. This enhancement reduces the computational cost of the network while maintaining similar identification performance.

3.3. Upsampling Module

In target detection tasks, upsampling is required to resize input feature mappings to match the size of the original image. This ensures the model can effectively detect targets of various sizes and distances. Traditional upsampling methods typically rely on bilinear interpolation. These methods have many limitations and can lead to the loss of critical image details. Additionally, traditional kernel-based upsamplers are not suitable for lightweight architectures due to time-consuming dynamic convolution and the need for additional sub-networks to generate dynamic kernels, leading to high computational effort and parameters [34]. In violent scenes captured by real cameras, images of violent behavior may be relatively small and suffer from pixel distortion, leading to the loss of fine-grained details and difficulties in feature learning. To address this issue, the model introduces Dysample, a lightweight and efficient dynamic upsampling operator that achieves efficient upsampling with low computational resources, enhancing the model’s capability to detect violent acts with low resolution or blurring.

Dysample completely avoids time-consuming dynamic convolution operations and additional sub-networks by employing a Differential Sampling approach, selecting only the most disparate parts of the data distribution for sampling. By retaining only the pixel values with significant differences, the data volume can be effectively reduced, computational complexity lowered, and upsampling speed and efficiency improved.

The network structure of Dysample is illustrated in Figure 7.

The sampling set S consists of the original sampling grid (O) and the generated offsets (G), which are produced using the “linear + pixel shuffling” method, with the offset range determined by static and dynamic factors. Specifically, considering the static factor sampling method as an example, given the upsampling scale factor s and the feature mapping X of size C × H × W, the linear layer with input and output channels C and 2S² is used to generate the offset O of size 2S² × H × W, which is then reshaped to 2 × sh × sw using pixel transformation. The sampling set S is the sum of the offset O and the original sample grid G. This relationship is defined by Equations (3) and (4). Consequently, an upsampled feature map of size c × sh × sw is generated.

O = linear (χ)

(3)

S = G + O

(4)

3.4. Attention Mechanism

Attention mechanisms in convolutional neural networks (CNNs) help the network accurately focus on input-related information, proving to be crucial for enhancing the performance of deep neural networks. However, existing attention mechanisms often require additional parameters, increasing model complexity and computational cost. SimAM is a lightweight, parameter-free attention mechanism for convolutional neural networks that effectively improves CNN performance without introducing additional parameters.

Studies have shown that 3D weights are superior to traditional one- and two-dimensional weighted attention mechanisms. The connection pattern diagram for the SimAM network is shown in Figure 8. The algorithm can infer the 3D attention weights of a feature mapping layer without adding parameters to the original network. Based on well-known neuroscience theories, it determines the importance of each neuron by optimizing an energy function. The energy function is defined as shown in Equation (5):

e_{t} (w_{t}, b_{t}, y, x_{i}) = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(- 1 - (w_{t} x_{i} + b_{t}))}^{2} + {(1 - (w_{t} t + b_{t}))}^{2} + λ w_{t}^{2} .

(5)

In Equation (5), w_{_t} and b_{_t} are parameters of the energy function, M is a constant term,

λ

w_{_t}^² is a regularization term, x_{_i} is the input feature mapping, and y is the output feature mapping. Most operators are chosen based on the solution of the defined energy function, minimizing the need for extensive structural tuning.

The introduction of SimAM in the violence detection network model enhances its capability to perform lightweight, high-accuracy detection tasks.

3.5. Loss Function

The loss function is used to evaluate the difference between the model’s predicted and actual values. The smaller the loss function, the better the model usually performs. The loss function is generally different for different models. IoU (Intersection over Union) is a measure of the accuracy of detecting the corresponding object in a given dataset and is widely used to calculate the loss function in target recognition algorithms. In the field of target detection, bounding box regression plays a crucial role, and the localization accuracy of target detection largely depends on the loss function of bounding box regression. Existing studies improve the regression performance by utilizing the geometric relationship between bounding boxes, while ignoring the effect of the distribution of difficult and easy samples on bounding box regression [35]. For example, in detection tasks that are dominated by simple samples, focusing on simple samples during bounding box regression helps to improve detection performance. So, we introduce the Focaler-IoU method, which uses a linear interval mapping method to reconstruct the IoU loss and helps to improve the edge regression. Enhanced focus on different regression samples enhances focusing on different detection tasks. It is defined as in Equation (6):

I o U_{f o c a l e r} = \{\begin{matrix} 0, & I o U < d \\ \frac{I o U - d}{u - d}, & d \leq I o U \leq u \\ 1, & I o U > u \end{matrix},

(6)

where IoUfocaler is the reconstructed Focaler-IoU, IoU is the original IoU value, and [d, u] ∈ [0,1]. By adjusting the values of d and u, we can make the IoUfocaler focus on different regression samples. Its loss is defined as in Equation (7).

L_{F o c a l e r - I o U} = 1 - I o U^{f o c a l e r}

(7)

Focaler-IoU adjusts the loss according to the value of the intersection and merger ratio (IoU). This design allows the loss function to be sensitive to the value of IoU in a certain interval, which enables it to focus more on samples where the predicted bounding box overlaps with the true bounding box moderately, i.e., samples that are neither too difficult nor too easy. It helps the model learn how to better extract features from moderately difficult samples, rather than focusing only on the easiest or most difficult samples.

4. Experimental Design and Analysis of Results

4.1. Data Set

Our experiments are mainly based on the Violence data set, and in order to evaluate and validate the generalization performance of our model, we also introduce the Pascal VOC dataset for our experiments.

4.1.1. Violence Data Set

In this study, a dataset for acts of violence was created based on the Hockey Dataset [36] and the RWF-2000 dataset [37]. We name this dataset Violence data set. The Hockey Dataset contains videos of NHL games, covering scenes where people are fighting from different angles. This dataset contains 1000 videos, with 500 involving violent acts and 500 containing nonviolent acts. The RWF-2000 dataset, in contrast, collects 2000 video clips from security cameras that have not been modified by multimedia techniques, ensuring practical applicability. In both datasets, we filtered out clearer videos and used OpenCV for frame processing, intercepting one image frame every five frames. Then, we further filtered images containing key features of violent behavior, obtaining more than 1600 samples in total. We used the LabelImg annotation tool (v.1.8.6) to label violent behaviors (Violence) and nonviolent behaviors (NonViolence) in these samples, generating the corresponding label files. To expand the number of samples and improve the model’s generalization ability and robustness, we performed data augmentation. Specifically, we applied horizontal flips, brightness adjustments, and rotation operations to individual samples. In addition, we used the MixUp data enhancement method, where we read two images at a time and processed them separately with data enhancement techniques such as flipping, scaling, and color space conversion. Finally, the ground truth frames of the two images were merged. After data enhancement, we obtained a total of 5987 samples and divided the dataset into training, validation, and testing sets in an 8:1:1 ratio. This dataset, shown in Figure 9, represents a comprehensive and well-curated resource for experimental studies.

4.1.2. Pascal VOC

The PASCAL Visual Object Classes (VOC) dataset offers a comprehensive and standardized benchmark for image recognition and classification tasks. This dataset was the basis for an annual image recognition challenge from 2005 to 2012. The primary datasets, VOC 2007 and VOC 2012, are organized into four major categories and twenty subcategories, establishing them as pivotal benchmarks for evaluating object detection algorithms.

The VOC 2007 dataset comprises 5011 annotated images in the training and validation (trainval) set, and 4952 annotated images in the test set, resulting in a total of 9963 annotated images. The VOC 2012 dataset contains 11,540 annotated images in the trainval set. In our study, we trained the detector using the combined trainval datasets from both VOC 2007 and VOC 2012, and we assessed its performance on the VOC 2007 test set.

4.2. Experimental Environment and Parameter Setting

The experiments were conducted using NVIDIA A30 GPUs (NVIDIA, Santa Clara, CA, USA), and the software environment included the CentOS operating system (v.2), Python 3.8 programming language, PyTorch 1.10 deep learning framework, and CUDA 11.3. The training parameters included an iteration number of up to 300, a batch size of 8, a learning rate of 0.01, a SGD optimizer with a momentum of 0.9, and an input image size of 640 × 640 (the image is resized to this specified size before detection).

4.3. Evaluation Metrics

In order to better evaluate the performance of the model, the following metrics are used in this experiment: precision (P), mean average precision (mAP), model parameters (parameters), giga floating point operations per second (FLOPs) and frames per second (FPS). The specific formulas are as follows:

P = \frac{T P}{T P + F P} \times 100 %

(8)

R = \frac{T P}{T P + F N} \times 100 %

(9)

A P = \int_{0}^{1} P (R) d R \times 100 %

(10)

m A P = \frac{1}{n} \sum_{k = 1}^{k = n} A P_{k} \times 100 %

(11)

F P S = \frac{1}{Processing time per frame} .

(12)

In this experiment, n represents the number of detection types, and two categories of objects are detected. TP denotes the number of true positive detections, FP denotes the number of false positive detections, and FN denotes the number of false negative detections. Precision (P) is the ratio of the number of target boxes correctly predicted as positively classified by the model to the number of all target boxes predicted as positively classified by the model. AP represents the accuracy of a single category, while mAP represents the average accuracy across all categories. mAP@50 is the mean average precision (mAP) calculated at an Intersection over Union (IoU) threshold of 0.5, specifically evaluating the model’s performance when there is a high degree of overlap between the predicted and ground truth bounding boxes. For violence detection, where speed and agility are critical, mAP@50 is even more important. mAP@[0.5:0.95] is the average accuracy calculated over IoU thresholds ranging from 0.50 to 0.95 in increments of 0.05. FPS (frames per second) measures the speed of image processing or model inference.

4.4. Impact of Lightweight Modules on Algorithm Performance

To evaluate the impact of introducing lightweight modules at different locations within GELAN-C on algorithm performance, we conducted comparative experiments. The results are shown in Table 1. “One” denotes replacing the second Conv of the only two Convs in GELAN-C with GhostConv. “All” denotes replacing all Convs in GELAN-C with lightweight GhostConv, RepNCSPELAN4_GB, or RepNCSPELAN4_RGB modules. The “BackBone” scenario involves replacing all Conv or RepNCSPELAN4 modules in the GELAN-C backbone with lightweight GhostConv, RepNCSPELAN4_GB, or RepNCSPELAN4_RGB modules. “BackBone_RRG2” denotes replacing two RepNCSPELAN4 modules in the backbone network with RepNCSPELAN4_RGB modules, and “BackBone_RRG3” involves a similar approach. “Ours” involves replacing the second Conv in the backbone with GhostConv, the first two RepNCSPELAN4 modules with RepNCSPELAN4_RGB modules, and the remaining modules with RepNCSPELAN4_GB modules.

The results indicate that replacing the second Conv in the neck of GELAN-C with GhostConv improves F1 by 0.2%, mAP50 by 0.6%, and [email protected] by 0.2%. Replacing all RepNCSPELAN4 modules in the GELAN-C network with RepNCSPELAN4_GB reduces the parameters to 88.7% and FLOPs to 88.8% of the original values, while [email protected] only decreases by 0.2%, which is more efficient than using RepNCSPELAN4_RGB. Experiments also demonstrate that the number of RepNCSPELAN4_RGB modules should be limited. When only two RepNCSPELAN4 modules in the backbone network are replaced with RepNCSPELAN4_RGB, the [email protected] reaches 91.8%, but the FPS decreases by 10%. The scheme is improved by replacing the second Conv in the neck of GELAN-C with GhostConv, the first two RepNCSPELAN4 modules in the backbone with RepNCSPELAN4_RGB, and all other RepNCSPELAN4 modules with RepNCSPELAN4_GB. The number of parameters decreases to 87.6% of the original, and the GFLOPs decrease to 87.3%, while FPS improves by 0.7, and [email protected] drops by only 0.1%. These results highlight the effectiveness of the proposed lightweight optimization algorithm.

4.5. Impact of Different Attention Mechanisms on Algorithm Performance

To investigate the effects of various attention mechanisms on the neck network, this study presents comparative experiments on an enhanced GELAN model that incorporates multiple attention mechanisms into the neck. The enhanced GELAN model integrates GhostConv, RepNCSPELAN4_GB module, RepNCSPELAN4_RGB module, and Dysample module (abbreviated as GD-GELAN). The experimental results are shown in Table 2. The data indicate that while [email protected] is the highest after adding SKA attention, the increase in the number of parameters is substantial. Despite a slight increase in mAP@[0.5:0.95] and FPS with the addition of SE attention, the all-important [email protected] is still not as good as SimAM. Overall, the introduction of SimAM attention in the neck network provides the best detection performance.

4.6. Ablation Experiments

The experimental data in Table 3 show that each enhancement improves the network’s detection performance to varying degrees. GhostConv, RepNCSPELAN4_GB, and RepNCSPELAN4_RGB lightweight modules were introduced into the GELAN-C network. The following abbreviations indicate that Ghost Lightweight Modules reduce the number of parameters to 87.6% and 87.3%, while mAP50 has only decreased by 0.1%. The results demonstrate the effectiveness of the lightweight module optimization algorithm. Introducing the Dysample module in the neck achieves efficient upsampling, resulting in a 0.2% increase in mAP@50. Adding the SimAM attention module after the neck layer 3 RepNCSPELAN4_GhostBlock module improves the network’s average detection accuracy without increasing the computational load and slightly enhances the detection speed. Finally, the introduction of Focaler-IoU results in a 0.7% increase in [email protected], indicating that Focaler-IoU reduces the neglect of simple and difficult samples and improves training accuracy. The experimental results show that the improved GELAN network increases [email protected] by 0.9% while reducing parameters and FLOPs by 12.3% and 12.4%, respectively.

4.7. Comparative Experiments

To further validate the effectiveness of the proposed model, we trained various target detection models of similar size using the same test set and configuration conditions. Compared with classical algorithms such as YOLOv9-C and RT-DETR, our model demonstrates higher accuracy in detecting violent behaviors. As shown in Table 4, the [email protected] and mAP@[0.5:0.95] of the proposed Violence-YOLO algorithm reach 92.6% and 75.0%, respectively. The number of parameters is slightly higher than that of YOLOv3-tiny, but the GFLOPs are relatively low. Although the parameters and computational requirements of the YOLOv3-tiny algorithm are lower than those of our model, its detection accuracy is significantly lower. In addition, although the computational requirements of the YOLOv9-C and RT-DETR algorithms are much higher, their average detection accuracies are still inferior to ours. Compared with all aforementioned algorithms, the proposed Violence-YOLO violence detection model excels in detection performance while maintaining lower parameters and computational effort. This demonstrates the feasibility and effectiveness of the improved algorithm.

In order to verify the generalizability of our proposed method, we conducted experiments on the Pascal VOC dataset and compared its performance with that of GELAN-C and some other state-of-the-art detectors. The experimental results are shown in Table 5, where the performance metrics of the compared detectors are derived from their respective original articles. The results show that while YOLOv8m-World accuracy and detection speed are better than that of our method, it is not as good as our method in terms of the number of model parameters, computational complexity, and mean accuracy (mAP). YOLOv5m is also better than our method only in terms of detection speed and computational complexity. Compared to GELAN-C, with a loss of 0.3% of mean accuracy (mAP50), the precision (P) is improved by 1%, the computational load is reduced by 12.7%, the model size is reduced by 12%, and the detection speed is slightly improved. Taken together, our proposed model works better, which proves its generalizability on public datasets.

4.8. Visualization and Analysis

To visualize our model’s ability to predict violent behavior, we plotted the confusion matrices of the improved model and GELAN, as depicted in Figure 10. The horizontal axis represents real categories, and the vertical axis represents predicted categories, where the background is usually regarded as an implied category that does not affect the actual performance of the model; the model automatically handles the background region. The diagonal values represent the proportion of correct predictions for each category, while the off-diagonal values indicate misclassifications. As illustrated in Figure 10, the values on the diagonal of the confusion matrix for the improved model have increased compared to those for GELAN, with the proportion of correct predictions for violent behavior increasing from 0.90 to 0.92, and for nonviolent behavior from 0.84 to 0.85. This demonstrates that the improved model’s performance in distinguishing between violent and nonviolent behavior categories has significantly improved. We will continue to fine-tune the model to balance performance across all categories in future iterations.

To demonstrate the precision of the model in predicting positive classes at different recall levels, PR (precision-recall) curves for the improved model and GELAN were generated, as illustrated in Figure 11. Upon comparison, it is evident that the improved model demonstrates significantly higher accuracy in both categories, particularly in the “NonViolence” category, where the accuracy increases from 0.893 to 0.904. Additionally, the mean average precision ([email protected]) across all categories in the improved model increases from 0.917 to 0.926, further confirming its superior accuracy compared to the GELAN model.

To demonstrate the improved detection performance of our model, several images depicting violent behavior were selected for comparative analysis, as illustrated in Figure 12. The visual comparison reveals that, compared to the GELAN model, our model accurately recognizes violent acts and exhibits higher detection accuracy in complex situations, such as those with dim illumination, lack of clarity, or large and complex crowds.

4.9. Misjudgment and Ethical Analysis

We believe that in real video surveillance systems, violence is a rare and atypical nonviolent activity. Therefore, it is necessary to analyze the resistance of our proposed method to false alarms in order to provide results that better understand the potential for reducing human intervention in real video surveillance systems.

Figure 13 shows the detection results of our proposed model on the RWF-2000 dataset. Each row in Figure 13 consists of three key frames from video clips with corresponding prediction labels and probabilities. Some misjudgments can be observed; for example, in Figure 13(3), a passerby without obvious violent behavior is detected as violent, in Figure 13(5), the scuffle between two individuals on the left is not detected, and in Figure 13(8), the violent behavior is vague yet still marked. Despite these misjudgments, the predicted probabilities of these misjudgments are relatively low, and overall, the results are quite good. Misjudgments may be caused by low resolution, object occlusion, ambiguous actions, and poor lighting, so we aim to improve the quality of surveillance images in these aspects to reduce false positives.

Additionally, false positives are inevitable, and this issue is crucial as it affects the system’s reliability and credibility. When false alarms occur, different outcomes may arise depending on the system design. In most cases, the system will trigger an alarm to notify security personnel, who will then assess the situation to confirm whether violent behavior is present. Some highly secure environments may adopt automatic responses, such as locking doors or initiating recording protocols, but these usually still require human supervision to confirm the threat. In these processes, human involvement remains essential to ensure that false alarms do not lead to unnecessary actions against innocent individuals, thereby maintaining a balance between automatic detection and human intervention.

5. Conclusions

Preventing violence in airports, airplanes, and spacecrafts is very important. In this study, we propose a violence detection model named Violence-YOLO, designed to achieve real-time and accurate violence detection in complex environments using surveillance systems to enhance public safety. Violence-YOLO is built upon the GELAN architecture of YOLOv9, which integrates a multilayer SimAM attention module into its neck structure in an innovative manner. Two new modules, RepNCSPELAN4_GB and RepNCSPELAN4_RGB, are proposed. They combine the GhostBottleneck and RepGhostBottleneck modules of GhostNet and RepGhostNet, respectively. These modules replace the shallow convolution in the backbone network to reduce computational complexity and introduce an ultra-lightweight Dysample upsampler into the neck network. Training accuracy is optimized using Focaler-IoU to mitigate the neglect of simple and difficult samples. This suggests that our study offers an effective method and novel insights for real-time detection of violent behavior in unstructured public environments.

Moving forward, it is imperative to enhance the model’s generalization ability to improve its adaptability across diverse scenarios through training and validation on more varied datasets. Additionally, optimizing the model structure and algorithms is crucial to reduce computational complexity and resource consumption to meet real-time detection requirements.

Author Contributions

Conceptualization, W.X. and D.Z.; methodology, W.X. and D.Z.; software, W.X.; validation, W.X.; formal analysis, W.X.; investigation, W.X. and R.D.; data curation, W.X.; writing—original draft preparation, W.X.; writing—review and editing, W.X., D.Z. and A.W.H.I.; funding acquisition, K.Y. and D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Written informed consent has been obtained from the patient(s) to publish this paper.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the corresponding author on request.

Acknowledgments

This research is in part supported by the Research Centre of Deep Space Explorations, the Hong Kong Polytechnic University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yao, H.; Hu, X. A survey of video violence detection. Cyber-Phys. Syst. 2023, 9, 1–24. [Google Scholar] [CrossRef]
Kumar, P.; Shih, G.L.; Guo, B.L.; Nagi, S.K.; Manie, Y.C.; Yao, C.K.; Arockiyadoss, M.A.; Peng, P.C. Enhancing Smart City Safety and Utilizing AI Expert Systems for Violence Detection. Future Internet 2024, 16, 50. [Google Scholar] [CrossRef]
Wang, Z.; Lei, L.; Shi, P. Smoking behavior detection algorithm based on YOLOv8-MNC. Front. Comput. Neurosci. 2023, 17, 1243779. [Google Scholar] [CrossRef]
Moshayedi, A.J.; Uddin, N.M.I.; Khan, A.S.; Zhu, J.; Emadi Andani, M. Designing and Developing a Vision-Based System to Investigate the Emotional Effects of News on Short Sleep at Noon: An Experimental Case Study. Sensors 2023, 23, 8422. [Google Scholar] [CrossRef]
Singh, A.; Anand, T.; Sharma, S.; Singh, P. IoT based weapons detection system for surveillance and security using YOLOV4. In Proceedings of the 2021 6th International Conference on Communication and Electronics Systems (ICCES), Coimbatre, India, 8–10 July 2021; pp. 488–493. [Google Scholar]
Li, J.; Liu, J.; Li, C.; Jiang, F.; Huang, J.; Ji, S.; Liu, Y. A hyperautomative human behaviour recognition algorithm based on improved residual network. Enterp. Inf. Syst. 2023, 17, 2180777. [Google Scholar] [CrossRef]
Gao, H. A Yolo-based Violence Detection Method in IoT Surveillance Systems. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 143–149. [Google Scholar] [CrossRef]
Moshayedi, A.J.; Roy, A.S.; Kolahdooz, A.; Shuxin, Y. Deep learning application pros and cons over algorithm deep learning application pros and cons over algorithm. EAI Endorsed Trans. AI Robot. 2022, 1. [Google Scholar] [CrossRef]
Khan, M.; El Saddik, A.; Gueaieb, W.; De Masi, G.; Karray, F. VD-Net: An Edge Vision-Based Surveillance System for Violence Detection. IEEE Access 2024, 12, 43796–43808. [Google Scholar] [CrossRef]
Luo, D.; Xue, Y.; Deng, X.; Yang, B.; Chen, H.; Mo, Z. Citrus Diseases and Pests Detection Model Based on Self-Attention YOLOV8. IEEE Access 2023, 11, 139872–139881. [Google Scholar] [CrossRef]
Wang, P.; Wang, P.; Fan, E. Violence detection and face recognition based on deep learning. Pattern Recognit. Lett. 2021, 142, 20–24. [Google Scholar] [CrossRef]
Zhou, X.; Chen, Y.; Zhang, Q. Trajectory Analysis Method Based on Video Surveillance Anomaly Detection. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021; pp. 1141–1145. [Google Scholar]
Guo, M.F.; Zeng, X.D.; Chen, D.Y.; Yang, N.C. Deep-learning-based earth fault detection using continuous wavelet transform and convolutional neural network in resonant grounding distribution systems. IEEE Sens. J. 2017, 18, 1291–1300. [Google Scholar] [CrossRef]
Barros, F.; Aguiar, S.; Sousa, P.J.; Cachaço, A.; Tavares, P.J.; Moreira, P.M.; Ranzal, D.; Cardoso, N.; Fernandes, N.; Fernandes, R.; et al. Displacement monitoring of a pedestrian bridge using 3D digital image correlation. Procedia Struct. Integr. 2022, 37, 880–887. [Google Scholar] [CrossRef]
Khan, M.; Gueaieb, W.; El Saddik, A.; De Masi, G.; Karray, F. An efficient violence detection approach for smart cities surveillance system. In Proceedings of the 2023 IEEE International Smart Cities Conference (ISC2), Bucharest, Romania, 24–27 September 2023; pp. 1–5. [Google Scholar]
Ramzan, M.; Abid, A.; Khan, H.U.; Awan, S.M.; Ismail, A.; Ahmed, M.; Ilyas, M.; Mahmood, A. A review on state-of-the-art violence detection techniques. IEEE Access 2019, 7, 107560–107575. [Google Scholar] [CrossRef]
Liu, G.; Wang, Z.; Zhang, H.; Guo, X.; Wang, Y.; Zhang, C. A novel violent video detection method based on improved C3D and transfer learning. In Proceedings of the CIBDA 2022; 3rd International Conference on Computer Information and Big Data Applications, Wuhan, China, 25–27 March 2022; pp. 1–7. [Google Scholar]
Fenil, E.; Manogaran, G.; Vivekananda, G.; Thanjaivadivel, T.; Jeeva, S.; Ahilan, A. Real time violence detection framework for football stadium comprising of big data analysis and deep learning through bidirectional LSTM. Comput. Netw. 2019, 151, 191–200. [Google Scholar]
Singh, K.; Rajora, S.; Vishwakarma, D.K.; Tripathi, G.; Kumar, S.; Walia, G.S. Crowd anomaly detection using aggregation of ensembles of fine-tuned convnets. Neurocomputing 2020, 371, 188–198. [Google Scholar] [CrossRef]
Accattoli, S.; Sernani, P.; Falcionelli, N.; Mekuria, D.N.; Dragoni, A.F. Violence detection in videos by combining 3D convolutional neural networks and support vector machines. Appl. Artif. Intell. 2020, 34, 329–344. [Google Scholar] [CrossRef]
Magdy, M.; Fakhr, M.W.; Maghraby, F.A. Violence 4D: Violence detection in surveillance using 4D convolutional neural networks. IET Comput. Vis. 2023, 17, 282–294. [Google Scholar] [CrossRef]
Waddenkery, N.; Soma, S. An efficient convolutional neural network for detecting the crime of stealing in videos. Entertain. Comput. 2024, 51, 100723. [Google Scholar] [CrossRef]
Polverino, L.; Abbate, R.; Manco, P.; Perfetto, D.; Caputo, F.; Macchiaroli, R.; Caterino, M. Machine learning for prognostics and health management of industrial mechanical systems and equipment: A systematic literature review. Int. J. Eng. Bus. Manag. 2023, 15, 18479790231186848. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Balakrishnan, T.; Sengar, S.S. RepVGG-GELAN: Enhanced GELAN with VGG-STYLE ConvNets for Brain Tumour Detection. arXiv 2024, arXiv:2405.03541. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing network design strategies through gradient path analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Firdiantika, I.M.; Lee, S.; Bhattacharyya, C.; Jang, Y.; Kim, S. EGCY-Net: An ELAN and GhostConv-Based YOLO Network for Stacked Packages in Logistic Systems. Appl. Sci. 2024, 14, 2763. [Google Scholar] [CrossRef]
Chen, C.; Guo, Z.; Zeng, H.; Xiong, P.; Dong, J. Repghost: A hardware-efficient ghost module via re-parameterization. arXiv 2022, arXiv:2211.06088. [Google Scholar]
Niu, K.; Yan, Y. A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Images. In Proceedings of the 2023 2nd International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP), Hangzhou, China, 27–29 October 2023; pp. 57–60. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 6027–6037. [Google Scholar]
Zhang, H.; Zhang, S. Focaler-IoU: More Focused Intersection over Union Loss. arXiv 2024, arXiv:2401.10525. [Google Scholar]
Cai, Z.; Neher, H.; Vats, K.; Clausi, D.A.; Zelek, J. Temporal hockey action recognition via pose and optical flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Cheng, M.; Cai, K.; Li, M. RWF-2000: An open large scale video database for violence detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4183–4190. [Google Scholar]

Figure 1. Network structure of GELAN-C.

Figure 2. Network structure of Violence-YOLO.

Figure 3. The processes of Conv (a) and GhostConv (b).

Figure 4. Ghost bottleneck. Left: Ghost bottleneck with stride = 1; Right: Ghost bottleneck with stride = 2.

Figure 5. The structure of RepGhost bottleneck in training (a) and in reasoning (b).

Figure 6. The structure of RepNCSPELAN4_GB and RepNCSPELAN4_RGB.

Figure 7. Dysample network structure: (a) sampling-based dynamic upsampling; (b) sampling point generator in DySample. The input feature, upsample feature, generated offset, and original grid are denoted by

χ

,

χ^{'}

, G, and O, respectively.

σ

denotes the sigmoid function,

s h

represents the sampled height,

s w

represents the sampled width, and

g s^{2}

represents the number of channels after the feature graph passes through the linear layer.

Figure 7. Dysample network structure: (a) sampling-based dynamic upsampling; (b) sampling point generator in DySample. The input feature, upsample feature, generated offset, and original grid are denoted by

χ

,

χ^{'}

, G, and O, respectively.

σ

denotes the sigmoid function,

s h

represents the sampled height,

s w

represents the sampled width, and

g s^{2}

represents the number of channels after the feature graph passes through the linear layer.

Figure 8. The connection pattern diagram of SimAM network.

Figure 9. Violence detection data set: first row taken from Hockey Dataset, second row taken from RWF-2000 dataset, and third row taken from MixUp data after enhancement.

Figure 10. Confusion matrix comparative effectiveness diagram on Violence data set. On the left is the confusion matrix of GELAN-C. On the right is the confusion matrix of Violence-YOLO.

Figure 11. Comparison effect diagram of PR curves on Violence data set. On the left are the PR curves of GELAN-C. On the right are the PR curves of Violence-YOLO.

Figure 12. Comparison of detection performance on Violence data set. (On the upper is a graph of the test results of GELAN-C, and on the lower is a graph of the test results of Violence-YOLO.)

Figure 13. Detection results of the proposed violence detection model (Violence-YOLO) on the RWF-2000 dataset. The first two rows are taken from key frames in violent videos where our model correctly predicts the presence of violence. The third row, taken from key frames in nonviolent videos, suffers from mispredictions, where large crowds and low-quality surveillance footage may lead to incorrect predictions. 1–3, 4–6, and 7–9 show three keyframes from three different videos respectively.

Table 1. Performance comparison of lightweight modules at different positions in the algorithm on Violence data set.

Module	Algorithms	mAP50/%	mAP50-95/%	Parameters/M	FLOPs/G	FPS
Baseline	GELAN-C	91.7	74.9	25.2	101.8	47.6
GhostConv	One	92.5	75.3	25.19	100	48.5
GhostConv	All	92.3	75.2	25.19	99.8	46.7
RepNCSPELAN4_GB	BackBone	91.7	74.6	23.9	96.1	49.2
RepNCSPELAN4_GB	All	91.5	74.6	22.1	90.4	49.5
RepNCSPELAN4_RGB	BackBone	91.8	74.4	24	96.6	47.3
RepNCSPELAN4_RGB	All	91.5	73.9	22.3	91.3	44.4
RepNCSPELAN4_RGB	BackBone_RRG2	91.9	74.9	25	98.7	43.3
RepNCSPELAN4_RGB	BackBone_RRG3	91.5	74.1	24.6	97.0	42.9
GhostConv+RepNC SPELAN4_GB+RepN CSPELAN4_RGB	Ours	91.6	74.2	22.1	88.9	48.3

Table 2. Comparative experiments with different attentional mechanisms on Violence data set.

Models	mAP50/%	mAP50-95/%	Parameters/M	FLOPs/G	FPS
GD-GELAN +SimAM	91.9	74.6	22.1	88.9	48.5
+ECA	91.5	73.9	22.1	88.9	49.7
+SKA	92.3	75.2	44.2	159.5	37.3
+FocalModulation	91.0	73.0	23.2	92.9	46.3
+SE	91.8	75	22.1	88.9	49

Table 3. Ablation experiments with the modules on Violence data set. The checkmarks represent substitutions or overlays with our improvement modules on the GENLAN-C model.

Baseline	Ghost Lightweight Modules	Dysample	SimAM	Focaler-IoU	mAP50/%	mAP50-95/%	Parameters/M	FLOPs/G	FPS
GELAN-C					91.7	74.9	25.2	101.8	47.6
	✓				91.6	74.2	22.1	88.9	48.0
	✓	✓			91.8	74.3	22.1	88.9	48.0
	✓	✓	✓		91.9	74.6	22.1	88.9	48.5
	✓	✓	✓	✓	92.6	75	22.1	88.9	47.8

Table 4. Performance comparison of different detection models on Violence data set.

Models	mAP50/%	mAP50-95/%	Parameters/M	FLOPs/G	FPS
YOLOv8l	84.7	69.3	43.6	164.8	42.6
YOLOv8m-World	83.5	66.3	29	89.9	57.8
YOLOv5m	82.4	62.8	25	64	73.5
YOLOv3-tiny	77.6	53.2	12.1	18.9	192.3
RT-DETR-L	90.4	75.4	31.9	108	63.3
GELAN-C	91.7	74.9	25.2	101.8	47.6
YOLOv9-C	90.6	70.6	50.7	236.6	24.8
Ours	92.6	75.0	22.1	88.9	47.8

Table 5. Performance comparison of different detection models on Pascal VOC dataset.

Models	P /%	mAP50/%	mAP50-95/%	Parameters/M	FLOPs/G	FPS
YOLOv8l	76.2	79.4	61.7	43.6	164.9	55.8
YOLOv8m-World	81.8	78.4	60.2	29	99.4	74.6
YOLOv5m	75.1	77.5	58.1	25.1	64	95.2
GELAN-C	77.3	80.7	62.6	25.2	101.9	62.5
Ours	78.3	80.4	62.6	22.2	89.0	63.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, W.; Zhu, D.; Deng, R.; Yung, K.; Ip, A.W.H. Violence-YOLO: Enhanced GELAN Algorithm for Violence Detection. Appl. Sci. 2024, 14, 6712. https://doi.org/10.3390/app14156712

AMA Style

Xu W, Zhu D, Deng R, Yung K, Ip AWH. Violence-YOLO: Enhanced GELAN Algorithm for Violence Detection. Applied Sciences. 2024; 14(15):6712. https://doi.org/10.3390/app14156712

Chicago/Turabian Style

Xu, Wenbin, Dingju Zhu, Renfeng Deng, KaiLeung Yung, and Andrew W. H. Ip. 2024. "Violence-YOLO: Enhanced GELAN Algorithm for Violence Detection" Applied Sciences 14, no. 15: 6712. https://doi.org/10.3390/app14156712

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Violence-YOLO: Enhanced GELAN Algorithm for Violence Detection

Abstract

1. Introduction

2. Related Work

2.1. 3D-CNN

2.2. CNN-RNN

2.3. YOLO

3. Materials and Methods

3.1. Overview of Violence-YOLO

3.2. Lightweight Modules

3.3. Upsampling Module

3.4. Attention Mechanism

3.5. Loss Function

4. Experimental Design and Analysis of Results

4.1. Data Set

4.1.1. Violence Data Set

4.1.2. Pascal VOC

4.2. Experimental Environment and Parameter Setting

4.3. Evaluation Metrics

4.4. Impact of Lightweight Modules on Algorithm Performance

4.5. Impact of Different Attention Mechanisms on Algorithm Performance

4.6. Ablation Experiments

4.7. Comparative Experiments

4.8. Visualization and Analysis

4.9. Misjudgment and Ethical Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI