Next Article in Journal
Improved Detection of Multi-Class Bad Traffic Signs Using Ensemble and Test Time Augmentation Based on Yolov5 Models
Previous Article in Journal
The Use of Electric Vehicles to Support the Needs of the Electricity Grid: A Systematic Literature Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Effective Strategies for Enhancing Real-Time Weapons Detection in Industry

by
Ángel Torregrosa-Domínguez
*,
Juan A. Álvarez-García
,
Jose L. Salazar-González
and
Luis M. Soria-Morillo
Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, 41012 Sevilla, Spain
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(18), 8198; https://doi.org/10.3390/app14188198
Submission received: 25 June 2024 / Revised: 7 September 2024 / Accepted: 9 September 2024 / Published: 12 September 2024
(This article belongs to the Special Issue Applications of Artificial Intelligence in Industrial Engineering)

Abstract

:
Gun violence is a global problem that affects communities and individuals, posing challenges to safety and well-being. The use of autonomous weapons detection systems could significantly improve security worldwide. Despite notable progress in the field of weapons detection closed-circuit television-based systems, several challenges persist, including real-time detection, improved accuracy in detecting small objects, and reducing false positives. This paper, based on our extensive experience in this field and successful private company contracts, presents a detection scheme comprising two modules that enhance the performance of a renowned detector. These modules not only augment the detector’s performance but also have a low negative impact on the inference time. Additionally, a scale-matching technique is utilised to enhance the detection of weapons with a small aspect ratio. The experimental results demonstrate that the scale-matching method enhances the detection of small objects, with an improvement of +13.23 in average precision compared to the non-use of this method. Furthermore, the proposed detection scheme effectively reduces the number of false positives (a 71% reduction in the total number of false positives) of the baseline model, while maintaining a low inference time (34 frames per second on an NVIDIA GeForce RTX-3060 card with a resolution of 720 pixels) in comparison to the baseline model (47 frames per second).

1. Introduction

Gun violence is a pervasive phenomenon in the daily lives of numerous people worldwide. This issue transcends borders and affects communities, families, and individuals, generating a significant impact on the security, well-being, and tranquillity of societies. The proliferation and easy access to firearms, coupled with social, economic, and cultural factors, contribute to the persistence of this global problem. Thanks to numerous technological advancements and the widespread adoption of deep learning, an increasing number of aspects of our daily lives are now being encompassed by artificial intelligence.
In the context of weapons detection, the majority of the collected data show weapons close to the camera (foreground). In the field of security, the widespread use of closed-circuit television (CCTV) systems makes the utilisation of such datasets inappropriate, as they fail to capture real-case scenarios. This is due to the fact that cameras within these systems are typically installed at high altitudes, resulting in smaller-scale objects in the video stream. Salazar et al. [1] mentioned that a handgun can occupy around 3% of the total image size. Additionally, environmental factors such as ambient lighting or the time of day can reduce image quality, causing images to become noisy and objects to blend with their surroundings.
Another issue is that CCTV surveillance requires at least one person to constantly monitor one or more environments simultaneously, leading to the possibility of missed threats and delayed responses to hazards. For example, Velastin et al. [2] noted that, typically, even trained professional operators often fail to detect objects on surveillance videos after 20 min of continuous monitoring. Therefore, combining human attention with a real-time detection system to identify threat objects can be highly effective and valuable.
While there is considerable research on weapons detection, only a few of these solutions are applied in practice. When deployed in real-world environments, multiple problems emerged, including challenges in achieving real-time inferences, insufficient accuracy, and, most importantly, high false positive rates. Despite these challenges, several companies already offer commercially available deep learning-based weapons detection services (https://zeroeyes.com/, https://www.scylla.ai/, https://www.omnilert.com/, https://actuate.ai/).
This paper conducts a comparative analysis of various versions of real-time firearms detection, drawing on our experience with a real customer who utilised our solution. The analysis employs a method called Scale Match to address the scale mismatch between the datasets used. Additionally, a straightforward detection scheme is designed to decrease the number of false positives by introducing a larger secondary classifier and a temporal window. Importantly, these modifications have a minimal impact on the inference time of the system. The goal of this work is to offer guidance on enhancing the performance of models trained for weapons detection when deployed in production environments.
The main contributions presented in this paper are as follows:
  • A dataset has been released to test the effectiveness of models in real-world situations against CCTV systems (https://deepknowledge-us.github.io/DISARM-dataset/, accessed on 5 February 2024). The dataset comprises intricate images containing weapons that are difficult to identify, simpler images with detectable weapons, and a substantial number of images showing objects that are not weapons but could be classified as such in both complex and simple scenarios.
  • Both the detection of small-aspect-ratio weapons (with an area less than or equal to 100 pixels2) and the reduction in false positives are improved using a method called Scale Match. Specifically, it obtains +27% average precision (AP) and −17% false positives (FP).
  • A new detection scheme is suggested to aid the base model in detecting weapons. By utilising this scheme, false positives are decreased from 70%, when relying solely on the second classifier, to 100%, when using either the time window or both modules, with only a minor impact on inference time of -13 frames per second (FPS). It is important to note that relying solely on the temporal window leads to a reduction in precision.
  • A comparative analysis is conducted on multiple versions of the You Only Look Once (YOLO) detectors, namely YOLOv5, YOLOv7, and YOLOv8, for real-time weapons detection. The study involves the execution of extensive experiments on both publicly available datasets, such as the Gun-Dataset [3], YouTube-GDD [4], HOD Dataset [5], and usdataset [1], as well as datasets generated specifically for this research.
The remainder of the paper is structured as follows. Section 2 reviews the existing works related to weapons detection, small object detection, and real-time detection. In Section 3, the methodology used is explained. Both the experiments and their results are shown in Section 4. In Section 5, some key points are derived from the results. Finally, conclusions are drawn in Section 6.

2. Literature Review

2.1. Weapons Detection in Security Monitoring Environments

Numerous methods for weapons detection have been proposed, with object detection algorithms recently gaining popularity among researchers. Olmos et al. [6] suggested an automatic handgun detector using a Convolutional Neural Network (CNN) classifier with the Region Proposal Network (RPN) approach, confirming its effectiveness by comparing it to the Sliding Windows method. In a subsequent work [7], they proposed the use of two symmetric IP cameras to eliminate background objects, reducing the number of false positives.
Romero and Salamea [8] designed a detection scheme divided into two stages. In the first stage, they detected people using YOLO [9], and in the second stage, they detected handguns on them using a CNN. Other methods utilising the YOLO detector have been developed. Pang et al. [10], using YOLOv3, created a detection system for millimetre-wave weapons. Wang et al. [11] modified the YOLOv4 backbone by implementing a spatial module to enable real-time autonomous weapons detection in CCTV. Ahmed et al. [12] proposed using Scaled-YOLOv4 for improving weapon detection and applied the TensorRT network optimiser for deployment on an edge computing device.
Other studies have employed different methods. For instance, Castillo et al. [13] developed a real-time detector for monitoring videos that focuses on cold weapons such as kitchen knives, daggers, and machetes, among others. Additionally, the authors improved the model’s robustness to varying lighting conditions by applying several pre-processing algorithms. Pérez-Hernández et al. [14] proposed a two-level detection methodology to minimise false positive rates in handgun detection. Similarly, Salido et al. [15] proposed including pose information in order to reduce the false positive rates on different models. Ashraf et al. [16] applied the Gaussian blur algorithm to remove the image background, focusing on weapons detection and minimising both false negative and false positive rates. Likewise, Goenka and Sitara [17] applied the Gaussian deblur technique to pre-process the images and improve the features of the weapons.
Furthermore, the work of Salazar et al. [1] played a pivotal role in this domain by providing the first publicly available database, to our knowledge, based on a CCTV environment. This contribution has significantly enhanced the performance of detectors in such scenarios, and has been used in multiple contexts even to check the computation needed in computer vision [18]. Recently, some research has followed a similar approach. On one hand, Hnoohom et al. [19] introduced a new CCTV-like outdoor dataset and utilised a tiling technique to detect small weapon objects. Similarly, Berardini et al. [20] presented another CCTV-like indoor dataset and deployed two models on an edge device using the same detection scheme as Romero and Salamea [8].
Like Romero and Salamea, we advocate for applying a detection scheme. However, our approach differs, as our detection scheme is employed to support the model in more complex cases, allowing for a reduction in the number of false positives without adversely affecting the inference time. Furthermore, as can be seen in Table 1, there are few models that utilise the same dataset, and most of them do not have real-case scenarios that reflect a CCTV system. Moreover, each model uses a different experimental setup, resulting in some discrepancies. Given the heterogeneity of datasets and evaluation methods, as well as the fact that many models are not publicly available, with several datasets being private or restricted, we have created a near-real-world CCTV image dataset, of which the test subset is publicly available. The full dataset is available on request. We also provide our model’s trained weights on request.

2.2. Small Object Detection

Objects in an image are typically categorised as large, medium, or small based on their physical size or spatial extent within the image. For instance, COCO defines these three categories in terms of object area [21]. Unlike large and medium-sized objects, small objects face challenges such as fewer pixels and semantic information, as well as noise from compression, movement, and occlusions that can have a more pronounced impact than on larger objects. These challenges result in unwanted detection effects in small object detection within existing models. Small object detection is currently applied in specific scenarios, such as traffic sign detection [22,23], person detection through aerial images [24], or weapons detection.
Many domain-specific proposals employ various object detection algorithms, such as Faster R-CNN, with different backbones or modifications. For example, in the traffic sign detection domain, Li et al. [25] applied a generative adversarial network, and in the aerial detection domain, Hong et al. [26] implemented various modules in Faster R-CNN for detecting people in aerial images. Yu et al. [27] presented a method called Scale Match to reduce the scale mismatch between the pre-training dataset (COCO) and the fine-tuning dataset (TinyPerson dataset), improving the detection of people in aerial images. Zhang et al. [28] suggested adding a deconvolution layer after the last convolution layer in Faster R-CNN to enhance the detection of small objects in remote sensing images.
Lin et al. [29] introduced a new feature extractor called Feature Pyramid Network (FPN), which has been adopted in various studies to extend other architectures. Li and Yang [30] modified YOLOv2 using FPN, and Liang et al. [31] proposed a similar modification for Faster R-CNN. Pérez-Hernández et al. [14] implemented a classifier using diverse binarisation techniques, such as One-Versus-All and One-Versus-One, to enhance the detection of small objects.
For weapons detection, Salazar et al. [1] applied FPN with Faster R-CNN to detect handguns in CCTV surveillance, while Wang et al. [11] implemented a receptive field enhancement module in YOLOv4 to improve the model’s perceptiveness in the region of interest.
In this paper, Yu et al.’s technique, along with other methods, is employed to address the scale mismatch between different datasets for weapons detection. Additionally, two datasets are generated based on two versions of this technique with the goal of improving the precision of weapons detection in CCTV systems.

2.3. Real-Time Detection

Detecting various types of objects is a crucial aspect in many real-world applications, especially in industries such as surveillance [20] or industrial production [32]. The two most renowned real-time inference-focused detectors are YOLO [9] and Single Shot Multibox Detector (SSD) [33]. YOLO is notable for the numerous updates it has received from the community through various versions [34,35,36,37,38,39,40]. Additionally, there are several modifications of specific versions that aim to improve efficiency [41,42,43], accuracy [44,45], or both [46,47,48]. As for SSD, Wang et al. [49] integrated a parallel network and a bi-directional FPN (BiFPN) to enhance SSD.
In alternative approaches, some researchers strive to build more efficient backbones on top of existing ones [50,51,52,53], while others prune the detector to enhance efficiency while maintaining accuracy through improvements [53,54]. Since the introduction of vision transformers [55], researchers have proposed efficient transformers, such as the work of Carion et al. [56] or the work of Liu et al. [57]. Wang et al. [58] suggested the use of partial dense blocks and partial transition layers, replacing the FPN with an Exact Fusion Model (EFM) to make existing methods more efficient. Similarly, Tan et al. [59] implemented a BiFPN and a compound scaling method to improve both the accuracy and efficiency of EfficientNet [60]. On the other hand, Sun et al. [61] propose their detector, called Sparse R-CNN, by replacing the RPN for a small set of proposal boxes and by removing Non-Maximum Suppression (NMS) in the post-process stage.
Given that real-time detection is a key factor in our work, the well-known one-stage detector YOLO was utilised. YOLO has been widely employed in various areas and has undergone continuous improvements in recent years. Each new version of the architecture has been modified to strive for better performance while maintaining or reducing the inference time.

3. Materials and Methods

This section outlines the complete experimentation for real-time weapons detection using CCTV. Initially, multiple weapons detection datasets were acquired and processed, involving the creation of new datasets and modifications to existing ones (Section 3.1). To fulfil the objective of real-time detection, we opted for the one-stage detector, YOLO. Experimentation involved testing several models from various versions of the YOLO family to identify the optimal model that achieves high accuracy while minimising inference duration (Section 3.2).
To address the significant scale disparity issue in the primary dataset, where the scale difference between the weapon object images and the CCTV camera images was considerable, we employed the Scale Match method [27]. A modification was applied to the primary dataset using the Scale Match technique, facilitating seamless utilisation of the dataset with diverse models. This modification ensures consistent and precise weapons detection, regardless of scale deviations (Section 3.3).
To tackle the challenge of reducing the false positive rate and enhancing accuracy in complex scenarios where the YOLO detector encounters difficulties, a comprehensive pipeline is developed in Section 3.4. This pipeline incorporates a secondary cascaded classifier (Section 3.4.1) and a temporal window approach (Section 3.4.2).

3.1. Datasets

High-quality datasets are essential for obtaining an optimal model. Given the focus of this research on weapons detection, various datasets have been examined. Our objective is to obtain a comprehensive array of datasets and identify the most effective datasets or combinations of datasets that provide the most favourable outcomes in detecting weapons on CCTV systems. As a result of this search, the following weapons detection datasets were found (an overview can be seen in Table 2, along with some examples in Figure 1):
  • usdataset [1]. This dataset was created at the University of Seville, capturing a simulated attack through a CCTV system. It encompasses two different scenarios; both cam 1 and cam 7 are directed towards a corridor with a door on one side and two exits in the background, experiencing minimal lighting variations, as they are indoors. Cam 5 is focused on an entrance, introducing ambient lighting and resulting in some lighting variations. The dataset comprises a total of 5149 images at 1080 p resolution, featuring 1520 instances of guns.
  • usdataset_Synth [1]. This dataset is generated through simulation using the Unity engine. Gun annotations were performed automatically by the engine, yielding a substantial number of annotated images without the need for human annotation. The dataset is presented in three versions based on the number of images included: U0.5, U1.0, and U2.5 (500, 1000, 2500, respectively).
  • Gun-Dataset [3]. This dataset consists of over 51,889 labelled images containing weapons. For our purposes, we use 46,242 labelled images, with the majority of weapons positioned in the foreground.
  • YouTube-GDD [4]. This dataset comprises images extracted from YouTube videos. It includes two classes, gun and person, but our experiments focus solely on the gun class. The dataset comprises 5000 images at a resolution of 720 p, featuring 16,064 instances of weapons.
  • HOD-Dataset [5]. Utilised to provide the detector with background images of objects that may resemble weapons, this dataset includes over 12,800 images of handheld objects. In our case, only five classes from the original dataset are used: mobile phone, keyboard, calculator, cup, mobile HDD, resulting in a total of 4200 images.
Based on the insights from [11], a novel dataset named the Disarm-Dataset has been curated by amalgamating the majority of the previously described datasets. This dataset was further expanded with the inclusion of two new datasets, derived from videos recorded by a CCTV-like system. The first dataset encompasses both weapons and other objects that may resemble weapons, while the second dataset (FP-Dataset) intentionally excludes weapons to mitigate false positives. Consequently, the dataset incorporates over 62,000 images and nearly 65,000 instances of weapons (refer to Table 2 for an overview of this dataset). Additionally, a modification is applied by using usdataset_v2 instead of usdataset, where both cam 5 and cam 7 are utilised for training and validation, while cam 1 is reserved for testing. This modification aims to enhance the training process by incorporating challenging scenarios from cam 5, known for its difficult scenarios, into the detector’s learning process. Furthermore, it can be seen that most of the images in each dataset, if not all, are captured in scenes under high ambient light, where Gun-Dataset and Disarm-Dataset are the datasets with the highest percentage of images with darker scenes. Note that in Section 4.3.2, we experiment with the performance of our model with darker scenes, as this type of scenario is one of the most common in the real world.

3.2. Weapon Detector

In this study, YOLOv5, YOLOv7, and YOLOv8 versions have been employed. These versions were chosen as they are the most up-to-date to our knowledge and exhibit a favourable balance between accuracy and inference time. Table 3 provides details on the number of parameters for each model, along with the average time required to complete an epoch during training. The objective is to identify the optimal YOLO model for weapons detection by evaluating the trade-off between the model’s weapons detection performance and its inference time.

3.3. Scale Match

When dealing with CCTV systems, where cameras are typically installed at a considerable height, weapons may exhibit a reduced aspect ratio. The utilisation of most public datasets might not be suitable, as the majority of images feature weapons in the foreground, deviating from real-world scenarios. Although Salazar et al. [1] contributed to the development of a CCTV-based dataset, it might not be adequate. Therefore, the public datasets outlined in Section 3.1 are employed.
A notable challenge arises from the fact that, since most images depict weapons in close proximity to the camera lens, the detector’s ability to recognise smaller weapons may be compromised without appropriate adjustments. Yu et al. [27] introduced an approach called Scale Match to enhance the accuracy of detectors in detecting small objects by addressing the scale mismatch between the pre-training dataset and the TinyPerson dataset. This approach involves taking a target dataset D and an additional dataset E. By sampling the size distribution of both datasets, denoted as P s i z e ( D ) and P s i z e ( E ) , respectively, the method transforms the size distribution of the additional dataset to align with that of the target dataset, as illustrated below:
P s i z e ( T ( E ) ) P s i z e ( D )
where T is the scale transform. Initially, one could scale solely the bounding boxes of the images, although this would compromise the image’s composition. Therefore, the complete image is resized by taking the average of the objects in the picture equal to the size of s. This study utilises a variation that applies a monotone function introduced by Yu et al. (Monotone Scale Match) to prevent extreme cases where a very tiny object can generate an exceedingly large image or vice versa. The method is as follows. First, P s i z e ( D ) and P s i z e ( E ) are estimated with a rectified histogram. Then, for each bounding box in E, the mean size of its bounding boxes s ^ is calculated. Subsequently, a random sample size s is drawn from D, and the image is resized using a monotonic function to reach an average size of s. Furthermore, another method was devised in which, instead of up-scaling or down-scaling the images, the bounding box of the object is obtained and then re-scaled and pasted into another image with no weapons. In Section 4.2.2, the experiments executed will determine whether the use of Scale Match is advantageous for this particular case and which method fits the best.

3.4. Detection Scheme

Detectors may face challenges in accurately identifying weapons in complex scenarios, primarily stemming from the detector’s diminished confidence in detecting specific weapons, despite its capability to identify a diverse range of weapons in various positions and sizes. Consequently, certain weapons may be erroneously filtered out. Another challenge arises from the prevalence of false positives in detecting weapons amidst everyday objects like mobile phones, keyboards, calculators, cups, and portable hard drives. Thus, a two-step detection scheme that integrates a Secondary Classifier (Section 3.4.1) module and a Temporal Window (Section 3.4.2) module into the baseline detector is implemented, as depicted in Figure 2.
The procedure unfolds as follows. First, a frame is read and passed to the detector, which produces predictions. For each prediction, the confidence of the prediction is scrutinised and compared to a threshold. If the prediction falls below this threshold, the image is cropped around the area of the prediction and forwarded to the secondary classifier. If not, it proceeds to a temporal window. The secondary classifier generates a new confidence score for the cropped image, which is compared to another threshold distinct from the first one. If the confidence does not surpass this threshold, the prediction is discarded. Conversely, if it does, the initial confidence is updated with the new value, and the prediction undergoes a temporal window that filters the predictions. Details about both the secondary classifier and the temporal window are explained in the subsequent sections.
It is important to note that this scheme is applicable to all types of detectors; as both modules are completely independent, one can be used without the other and vice versa, and the output of the previous component can be used as input, but this study concentrates on its implementation with YOLO detectors. Section 4.3.1 and will present objective experimentation to assess the effectiveness of these modules in the context of YOLO detectors and identify the optimal combination.

3.4.1. Secondary Classifier

One of the major challenges encountered was the significant number of false positives obtained during detection, especially in the presence of objects that could be mistaken for weapons, such as mobile phones or keyboards. To enhance confidence in detecting small weapons and effectively filter out these false positives, a secondary classifier was implemented. This classifier comes into play when the first detector exhibits low confidence in its detection.
In this setup, the detection with low confidence from the primary detector is forwarded to the secondary, more robust classifier, which determines whether the identified object is a weapon or some other item. For this secondary classifier, state-of-the-art classifiers in the well-known ImageNet-1k dataset [62] were considered. This dataset encompasses diverse classes, including representations of weapons.
It has been tested with state-of-the-art architectures such as CAFormer [63] and EVA02 [64], both renowned for their robust performance on the ImageNet-1k dataset. CAFormer employs a hybrid architecture that integrates convolutional and attention mechanisms without conventional activation functions, demonstrating proficiency in extracting both local and global features. This capability potentially enhances its ability to differentiate between weapons and non-weapons, particularly in challenging scenarios where the primary detector may encounter difficulties. In contrast, EVA02 employs advanced vision transformers, which facilitate superior contextual comprehension and granular classification capabilities. This results in the accurate identification of weapons, even in complex visual environments where visually similar objects could otherwise be misclassified. The integration of these architectures aims to reduce the number of false positives while maintaining an acceptable level of detection for real weapons. Moreover, both CAFormer and EVA02 have been designed with computational efficiency as a key consideration, thereby potentially limiting the impact on overall system performance.
Despite achieving high accuracy, ImageNet-1k has an extensive array of classes that are unnecessary. Consequently, a modified version of this dataset was created to fine-tune the model for improved accuracy in weapons detection and to reduce the occurrence of false positives. This dataset includes specific classes, each having an equal number of images: assault gun, mobile phone, computer keyboard, laptop, computer mouse, and water bottle.
It is noteworthy that attempts were made initially using both the full dataset and a dataset containing only weapons. However, both versions, while demonstrating reasonable confidence in detecting weapons, lacked the ability to effectively filter out false positives.

3.4.2. Temporal Window

To further mitigate the number of false positives and ensure confidence consistency across frames, a temporal window was implemented. The process of this temporal window is illustrated in Figure 3. The primary objective of this temporal window is to ensure that the majority of detections involve weapons that consistently appear, rather than objects that may coincidentally resemble weapons.
For each prediction passing through this temporal window, it undergoes a comparison with other stored bounding boxes using a distance metric (Euclidean distance). The bounding boxes are stored in a dictionary, where each key represents the coordinates of a bounding box, and the comparison is performed in two stages. First, the bounding boxes are filtered and ordered based on a predetermined maximum distance. Then, among the top five closest bounding boxes, the similarity of each bounding box is evaluated based on its area. If similarity is identified, the temporal window of that bounding box is updated by increasing the number of frames the object has appeared in. If no similarity is found between the bounding box and the top five closest bounding boxes, a new temporal window is generated for that bounding box, and the bounding box is not displayed.
Furthermore, the temporal window incorporates a countdown mechanism and a confidence history, which are utilised to oversee the life cycle of each bounding box. The countdown mechanism involves decrementing a counter associated with each bounding box, which discards the bounding box once the counter reaches zero if no further detections are made. Conversely, the confidence history stores the confidence of each bounding box in order to calculate the mean value among similar bounding boxes. It should be noted that an object must appear in more than a specified number of frames and have a mean confidence exceeding a defined threshold in order for it to be displayed. This guarantees that only consistent detections are retained, thereby excluding spontaneous detections.

4. Results

This section describes all the experiments performed and the results obtained. An experimental strategy consisting of three phases will be followed; these phases are as follows:
  • The performance of various models and their different sizes will be studied from numerous training datasets. Additionally, the improvement of adding diverse amounts of background images in the best dataset will be tested.
  • The impact on performance will be evaluated, in terms of accuracy, when using varying data augmentations and the two versions of Scale Match during training.
  • Finally, the study will evaluate the performance and efficacy in accuracy and inference time of using the two proposed modules in the detection scheme on the best model obtained.
Please note that all experiments were conducted on an NVIDIA A100 with 40 GB of memory. In all datasets, only the “gun” class was utilised for detecting the presence of weapons. The primary objective of this work is to detect weapons, rather than distinguishing between various types of weapons present in the images.
Throughout all experiments, unless explicitly mentioned in a specific section, the models were trained using the same optimal hyper-parameters as a baseline. Specifically, the models underwent training for 30 epochs with a learning rate of 1 × 10−3 and a momentum of 0.937, utilising the one-cycle optimiser policy [65]. The final learning rate was significantly lower than the initial learning rate. The batch size was set to 8, and the images were resized to 736 × 1280, incorporating basic transformations for data augmentation, such as rotation, randomHSV, and resizing (distinct from Section 4.2.1).
Additionally, four videos were employed for testing, as depicted in Figure 4. Two of the videos were recorded via CCTV and represent real-case scenarios, manually annotated at a rate of 2 frames per second, resulting in 256 images per video. The third video was captured in a room simulating a CCTV camera environment, featuring guns and objects resembling weapons; manual annotation was performed at a rate of 30 frames per second, resulting in 3305 images. The final footage, also recorded at 30 frames per second, was shot in the same setting but excluded any depictions of weaponry, generating 2019 images. To compare different models with the test set, the following metrics were employed: average precision at IoU thresholds of 0.5 (AP50), 0.75 (AP75), and 0.5–0.95 (AP); average recall (AR); F1-score; false positives (FP); and mean frames per second (mFPS).

4.1. First Phase

4.1.1. Models

Most versions of YOLO come with models that exhibit varying precision and performance. Therefore, in this experiment, our aim is to identify the optimal model, considering both inference time and accuracy, for real-time object detection among the most suitable models within the selected YOLO versions. Specifically, we tested YOLOv5, including its small, medium, and large models. As for YOLOv7, we utilised the baseline (YOLOv7), YOLOv7x, YOLOv7-w6, and YOLOv7-e6 for this experiment. Finally, the small, medium, and large models of YOLOv8 were experimented with.
According to the results displayed in Figure 5, utilising the Disarm-Dataset for training yields superior average precision overall. Additionally, referring to Table 4, employing the Disarm-Dataset for training surpasses the usage of other datasets by a significant margin (+16.1 AP50, +7 AP75, +24.3 AP, +19.2 F1), despite generating the highest number of false positives. Using either the Gun-Dataset or Disarm-Dataset for training yields positive results, as they contain the highest number of high-quality images. To enhance performance on smaller datasets, decreasing the batch size may be beneficial, as it can mitigate the occurrence of over-fitting.
Moreover, the usdataset is a very complex dataset due to the use of CCTV-style images. This dataset comprises a large number of weapons with a very small size, and it also includes occluded or blurred weapon appearances, making it challenging to learn the characteristics of these objects. By changing the cameras used for training and incorporating synthetic images, better learning of the weapon’s features can be achieved. Upon closer examination, the models face considerable difficulty in identifying weapons in the first two videos due to the reduced size of the weapons and increased occlusion, compared to the third video. In general, the heightened complexity of the images makes detection more challenging.
Focusing on the Disarm-Dataset, YOLOv8 exhibits superior accuracy and inference time compared to the models of YOLOv5 and YOLOv7, as evidenced by the results shown in Figure 6 and Table 5. Although YOLOv5 and YOLOv8 outperform YOLOv7, they exhibit lower consistency in the first three videos. Notably, the first two videos demonstrate inferior performances, in terms of accuracy, compared to the third, which has the least complexity. On the contrary, the larger the backbone, the better the outcomes, but there is also a decrease in frames per second with each subsequent backbone, except for YOLOv7, which yields results that are less consistent than those expected. Furthermore, these models require the longest training time, as illustrated in Figure 7, while YOLOv8 demands the least time and delivers superior results. For the forthcoming experiments, we selected the small version of YOLOv8. This decision was based on two factors: firstly, balancing the trade-off between the model’s inference speed and its accuracy, and secondly, minimising the training time required for the model. The small variant of YOLOv8 necessitates the least duration of training but demonstrates comparable parameters to its larger alternatives, while notably improving the inference speed. This model is appealing for deployment on peripheral devices due to its compactness.

4.1.2. Background Images

The aim of this experiment is to determine the optimal percentage of background images that enables the model to perform optimally. Adding too few background images can result in a high false positive rate, while an excess of background images may lead to a deterioration in the model’s accuracy in detecting weapons. According to the YOLOv5’s documentation (https://docs.ultralytics.com/yolov5/tutorials/tips_for_best_training_results/, accessed on 24 January 2024), it is recommended that this percentage should fall between 0 and 10%. In this case, the experiment involved using 0 % , 2 % , 5 % , 7 % , and 10 % of background images for the Disarm-Dataset. Examples of the added background images can be seen in Figure 8, sourced from live streams of various streets on YouTube, selected images from usdataset, usdataset (Synthetic), and HOD-Dataset (refer to Section 3.1), along with some videos captured in a workspace.
As depicted in Table 6, the addition of more background images results in the model detecting fewer false positives. However, this also leads to a slight decrease in some metrics. Only when 5% background images are added with respect to the dataset does it show an increase in accuracy, although its false positive count is not as low as the other percentages. Nevertheless, as evident in both Table 6 and Figure 9, the inclusion of this percentage outperforms the previously mentioned percentages and the base model, while also achieving a slightly better average precision than YOLOv8l but with fewer false positives.

4.2. Second Phase

4.2.1. Data-Augmentation

Initially, all the augmentations offered by Albumentations’ library in the YOLOv8 repository were used, but we looked for the augmentation settings that make the training images similar to real situations in CCTV camera and therefore improving the model’s performance, in terms of precision, in these scenarios. The augmentations used are
  • Blur: applies a blur with a maximum size of 10 × 10 pixels to the image with a probability of 50%.
  • Cutout: applies three black squares, each 10% of the size of the image, with a probability of 50%.
  • Horizontal Flip: applies a horizontal flip with a probability of 50%.
  • Color Augmentation: applies different Hue, Saturation, and Value variations with a probability of 50% (0.015, 0.7, 0.4, respectively).
  • ISONoise: applies a camera noise to the image with a probability of 50%, a colour variance of between 0.05 and 0.4, and an intensity of between 0.1 and 0.7.
  • Random Brightness applies random image lighting variations with a probability of 50%.
  • Rotate applies rotation with a probability of 50% and a range from −15° to 15°.
  • Translate applies translation of the image with a gain of ± 10 % .
The initial configuration utilised was the combination of all augmentations, referred to as Default. Subsequently, various combinations were explored, starting by excluding one augmentation at a time to identify the most effective setup. Different combinations were then tested, incorporating the best-performing augmentation configurations from the previous step. The optimal results are presented in Table 7. Generally, omitting the blur augmentation contributes to an improved overall performance, while the removal of other augmentations does not yield enhancements. This could be attributed to certain images having insufficient resolution for such augmentations. When the representation of weapon features is already sub-optimal, the introduction of additional blur further hampers their detection. The addition of other augmentations does not enhance the Default configuration. For subsequent experiments, the model with all augmentations except blur is utilised as the baseline.

4.2.2. Scale Match

This experiment aims to assess the performance impact in terms of precision, recall, and inference time using both versions of the Scale Match method for training a model. The distinction between the two versions lies in the way the extra dataset is transformed; in the first version, the images are re-scaled using the average size of their bounding boxes, whereas in the second version, the bounding boxes are copied, resized, and pasted onto another image (as illustrated in Figure 10). As observed in Table 8, utilising the Scale Match base version’s dataset for training improves every metric compared to the current baseline. When examining objects with smaller aspect ratios (those whose area is less than or equal to 100 p x 2 ), it also enhances every metric, although it cannot reduce the false positives, which are most common among smaller sizes. The performance of the copy-and-paste version is inferior across all metrics. This outcome can be attributed to the technique of pasting the bounding box into another image, which can potentially alter the original structure of the image. Another issue might be the images used for the pasting process, as most of the weapons present appear in dark environments, and there are images used where the predominant colour is black, so changing the images used to brighter and more colourful images might yield better precision and recall.

4.3. Third Phase

4.3.1. Detection Scheme’s Modules

These experiments aim to assess the impact on performance of using the second classifier, the temporal window, and both in conjunction with the model in terms of accuracy and inference time. As depicted in Table 9, utilising only the trained model achieves better accuracy than using any of the modules. However, using only the temporal window yields the worst results, as many detections are filtered out, since it requires three consecutive frames to be considered a true positive. Using both modules compensates for the poor performance of the temporal window, although it produces inferior results compared to using the model alone or the model with the second classifier. Reducing the confidence of the model significantly affects the number of false positives, with an increase of over 90%. However, it has almost no effect when either the second classifier, the temporal window, or both are added, resulting in increases of 8%, 0%, and 0%, respectively.
Table 9 illustrates that using only the model yields a significant number of false positives. In contrast, incorporating the second classifier with the model reduces the number of false positives by 71%, but results in a decrease of 13 FPS in the number of frames per second. On the other hand, employing the temporal window in conjunction with the model results in zero false positives while maintaining the inference time. However, using only the temporal window leads to a substantial decrease in the F1-score of the model, as it requires waiting for a certain number of consecutive high-confidence detections of the same object to be confirmed as a true positive. Combining both methods—the temporal window and the second classifier—with the model results in zero false positives and, as expected, a corresponding reduction in frames per second, similar to when using the second classifier, as the temporal window has almost no effect on the inference time.
It is worth mentioning that most of the false positives that the baseline captures are everyday objects which have a low aspect ratio and can be seen as a weapon. In Figure 11, it can be seen, in the first row, that a mobile phone held like a gun is detected by the baseline detector. In this case, using the second classifier, the temporal window, or both can eliminate the false positive. The second row demonstrates how the second classifier module enhances the system’s performance by elevating the level of confidence in weapon detection for cases with initially low confidence scores (from 0.57 to 0.88). Furthermore, it is evident that relying solely on the temporal window can result in a greater number of true positives being overlooked, underscoring the superiority of a combined approach that incorporates both modules. Nevertheless, there are instances where the system may still malfunction, as illustrated in the third row, where even the second classifier is unable to accurately determine whether the object in question is a weapon.

4.3.2. Detection Scheme’s Effectiveness on Darker Scenes

A critical aspect of evaluating weapon detection systems under CCTV networks is their performance under various environmental conditions, particularly in low-light scenarios that may impede detection accuracy. As the entirety of the test subset comprises images featuring illuminated scenes, a 75% brightness reduction was applied to the testing subset through the use of gamma correction, which can be seen in Figure 12. This methodology enables the simulation of challenging low-light scenarios, thereby facilitating an evaluation of the system’s performance in such conditions.
As seen in Table 10, the system’s accuracy degrades significantly in low-light environments, likely due to lower image quality and increased difficulty in distinguishing objects. There is also a reduction in the number of false positives, but this might be due to the model’s conservative approach under low-light conditions, potentially leading to missed detections rather than false positives.

4.3.3. Comparison on Different Combinations

In a final experiment, we conducted several tests with different possible combinations, where each possibility represents different subsections of the experiment carried out. The results obtained can be seen in Table 11. This approach allows us to observe, in a more comprehensive manner, the impact of the different combinations on the model.
As can be seen, the temporal window module helps us to minimise the number of false positives at the cost of lowering the number of true positives and their metrics (F1, mAP50, mAP75, and mAP). The best combination using the temporal window is the marked middle row with an F1 score of 68.89. If we seek to maximise detections, even if we detect some false positives, we can see how the marked penultimate row, in which all techniques (background images, data augmentation, scale match, and the second classifier) except the temporal window are used, achieves good results in all metrics (F1 score: 75.72) and reduces false positives to 48.

5. Discussion

Some key points can be drawn from the results. Firstly, the choice of the model version depends on the system being used; for example, it is recommended to utilise either the YOLOv8 small model for less powerful systems or the medium model for more powerful systems, whereas the large model may not be suitable for real-time tasks, and the improvement compared to the medium version is marginal.
Regardless of the choice of version, it is crucial to apply augmentations that replicate the conditions of a CCTV system, keeping in mind that certain augmentations may negatively impact results due to lower image quality within the dataset. Additionally, using the Scale Match method is very useful for improving the detection of weapons with a low scale ratio.
Due to the use of the YOLOv8’s library to carry out the inference, the current detection system processes video feeds from multiple cameras sequentially. Specifically, the system handles the first frame from the first video, followed by the first frame from the second video, and so forth. While this approach ensures consistent detection accuracy across different streams, it introduces potential latency as the number of cameras increases, potentially impacting real-time performance due to the cumulative delay in processing each video feed. To address the scalability issue, a potential solution is to implement parallel processing techniques that allow the system to handle frames from multiple video feeds simultaneously. By leveraging multi-threading or GPU parallelism, the system could process frames in parallel, significantly reducing the latency introduced by sequential processing. This approach would enhance the system’s ability to maintain real-time performance even as the number of video feeds increases.
Finally, whether to use the modules depends on whether a preference for minimal false positives and lower precision outweighs acceptance of a higher number of false positives for better accuracy. In general, these modules are contingent on the base model, which performs the initial detections on which the modules are based. As a result, using these modules with a poor model will lead to ineffectual precision in detecting weapons.

6. Conclusions

In this paper, we present an analysis of various versions of YOLO trained on authentic CCTV imagery and weapons detection datasets. Additionally, we have developed a new dataset called Disarm-Dataset by combining existing datasets with our own created datasets. To test the effectiveness of models against CCTV systems in real-world situations, we have released the test subset of this dataset. Moreover, we utilised the Scale Match method to address the scale mismatch between datasets, subsequently improving the model’s accuracy for detecting small objects. Finally, we designed a detection pipeline that utilises a secondary classifier and a time window in conjunction with the main detector, resulting in a reduction in the number of false positives. This has led to the development of a real-time CCTV weapons detection system, which has been implemented in a production environment by reaching an agreement with a company, and the results were deemed satisfactory.
As future work, improvements can be made to optimise the system’s performance; thus, we have some thoughts on how we can accomplish it in the future. Primarily, enhancing the model’s architecture by adding attention modules may improve the accuracy of detecting smaller weapons or weapons in complex scenarios. Another improvement could involve using an architecture that takes into account previous detections to detect objects through a temporal flow or modifying the YOLO architecture to consider that flow. Moreover, the experimental setup utilised for this study involves a system with relatively high computational power, which has proven sufficient for maintaining real-time performance under the tested conditions. However, the scalability of the detection system on lower-power devices, such as edge devices, remains unexplored due to the lack of access to such hardware. To address this limitation and enhance the applicability of our method across a broader range of hardware configurations, future work will focus on optimising the detection system for deployment on edge devices. Potential optimisation techniques include model pruning and quantisation, which can reduce the model’s size and computational requirements without significantly compromising accuracy. Additionally, exporting the model to TensorRT could further enhance inference speed and efficiency, making it feasible to deploy the system on devices with limited processing power. This modification would enable better detection of blurry or occluded weapons. Additionally, we aim to introduce body cams as the main images to detect whether the suspect has a weapon or not in some crime scenes or police interventions.

Author Contributions

Conceptualization, J.A.Á.-G., J.L.S.-G. and L.M.S.-M.; methodology, Á.T.-D.; software, Á.T.-D.; validation, Á.T.-D.; formal analysis, Á.T.-D.; investigation, Á.T.-D.; resources, J.A.Á.-G. and L.M.S.-M.; data curation, Á.T.-D. and J.L.S.-G.; writing—original draft preparation, Á.T.-D.; writing—review and editing, Á.T.-D., J.A.Á.-G., J.L.S.-G. and L.M.S.-M.; visualization, Á.T.-D.; supervision, J.A.Á.-G., J.L.S.-G. and L.M.S.-M.; project administration, J.A.Á.-G.; funding acquisition, J.A.Á.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the project PID2021-126359OB-I00 funded by MCIN/AEI/10.13039/501100011033.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

We acknowledge the donation of an A100 48GB GPU from the NVIDIA Hardware Grant granted to our colleague Miguel A. Martinez-del-Amor. We also acknowledge our colleague José Morera-Figueroa for helping us with dataset gathering and labelling and helping with the implementation of the temporal window module.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. González, J.L.S.; Zaccaro, C.; Álvarez García, J.A.; Morillo, L.M.S.; Caparrini, F.S. Real-time gun detection in CCTV: An open problem. Neural Netw. 2020, 132, 297–308. [Google Scholar] [CrossRef]
  2. Velastin, S.A.; Boghossian, B.A.; Vicencio-Silva, M.A. A motion-based image processing system for detecting potentially dangerous situations in underground railway stations. Transp. Res. Part C Emerg. Technol. 2006, 14, 96–113. [Google Scholar] [CrossRef]
  3. Qi, D.; Tan, W.; Liu, Z.; Yao, Q.; Liu, J. A Dataset and System for Real-Time Gun Detection in Surveillance Video Using Deep Learning. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Melbourne, Australia, 17–20 October 2021; pp. 667–672. [Google Scholar] [CrossRef]
  4. Gu, Y.; Liao, X.; Qin, X. YouTube-GDD: A challenging gun detection dataset with rich contextual information. arXiv 2022, arXiv:2203.04129. [Google Scholar]
  5. Qiao, L.; Li, X.; Jiang, S. RGB-D Object Recognition from Hand-Held Object Teaching. In Proceedings of the International Conference on Internet Multimedia Computing and Service, ICIMCS’16, Xi’an, China, 19–21 August 2016; pp. 31–34. [Google Scholar] [CrossRef]
  6. Olmos, R.; Tabik, S.; Herrera, F. Automatic handgun detection alarm in videos using deep learning. Neurocomputing 2018, 275, 66–72. [Google Scholar] [CrossRef]
  7. Olmos, R.; Tabik, S.; Lamas, A.; Pérez-Hernández, F.; Herrera, F. A binocular image fusion approach for minimizing false positives in handgun detection with deep learning. Inf. Fusion 2019, 49, 271–280. [Google Scholar] [CrossRef]
  8. Romero, D.; Salamea, C. Convolutional models for the detection of firearms in surveillance videos. Appl. Sci. 2019, 9, 2965. [Google Scholar] [CrossRef]
  9. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
  10. Pang, L.; Liu, H.; Chen, Y.; Miao, J. Real-time concealed object detection from passive millimeter wave images based on the YOLOv3 algorithm. Sensors 2020, 20, 1678. [Google Scholar] [CrossRef]
  11. Wang, G.; Ding, H.; Duan, M.; Pu, Y.; Yang, Z.; Li, H. Fighting against terrorism: A real-time CCTV autonomous weapons detection based on improved YOLO v4. Digit. Signal Process. 2023, 132, 103790. [Google Scholar] [CrossRef]
  12. Ahmed, S.; Bhatti, M.T.; Khan, M.G.; Lövström, B.; Shahid, M. Development and Optimization of Deep Learning Models for Weapon Detection in Surveillance Videos. Appl. Sci. 2022, 12, 5772. [Google Scholar] [CrossRef]
  13. Castillo, A.; Tabik, S.; Pérez, F.; Olmos, R.; Herrera, F. Brightness guided preprocessing for automatic cold steel weapon detection in surveillance videos with deep learning. Neurocomputing 2019, 330, 151–161. [Google Scholar] [CrossRef]
  14. Pérez-Hernández, F.; Tabik, S.; Lamas, A.; Olmos, R.; Fujita, H.; Herrera, F. Object Detection Binary Classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowl.-Based Syst. 2020, 194, 105590. [Google Scholar] [CrossRef]
  15. Salido, J.; Lomas, V.; Ruiz-Santaquiteria, J.; Deniz, O. Automatic Handgun Detection with Deep Learning in Video Surveillance Images. Appl. Sci. 2021, 11, 6085. [Google Scholar] [CrossRef]
  16. Ashraf, A.H.; Imran, M.; Qahtani, A.M.; Alsufyani, A.; Almutiry, O.; Mahmood, A.; Attique, M.; Habib, M. Weapons detection for security and video surveillance using cnn and YOLO-v5s. CMC-Comput. Mater. Contin. 2022, 70, 2761–2775. [Google Scholar] [CrossRef]
  17. Goenka, A.; Sitara, K. Weapon Detection from Surveillance Images using Deep Learning. In Proceedings of the 2022 3rd International Conference for Emerging Technology (INCET), Belgaum, India, 27–29 May 2022; pp. 1–6. [Google Scholar] [CrossRef]
  18. Perea-Trigo, M.; López-Ortiz, E.J.; Salazar-González, J.L.; Álvarez-García, J.A.; Vegas Olmos, J.J. Data Processing Unit for Energy Saving in Computer Vision: Weapon Detection Use Case. Electronics 2022, 12, 146. [Google Scholar] [CrossRef]
  19. Hnoohom, N.; Chotivatunyu, P.; Jitpattanakul, A. ACF: An armed CCTV footage dataset for enhancing weapon detection. Sensors 2022, 22, 7158. [Google Scholar] [CrossRef]
  20. Berardini, D.; Migliorelli, L.; Galdelli, A.; Frontoni, E.; Mancini, A.; Moccia, S. A deep-learning framework running on edge devices for handgun and knife detection from indoor video-surveillance cameras. Multimed. Tools Appl. 2023, 83, 19109–19127. [Google Scholar] [CrossRef]
  21. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar] [CrossRef]
  22. Arcos-García, Á.; Álvarez García, J.A.; Soria-Morillo, L.M. Deep neural network for traffic sign recognition systems: An analysis of spatial transformers and stochastic optimisation methods. Neural Netw. 2018, 99, 158–165. [Google Scholar] [CrossRef]
  23. Arcos-García, Á.; Álvarez García, J.A.; Soria-Morillo, L.M. Evaluation of deep neural networks for traffic sign detection systems. Neurocomputing 2018, 316, 332–344. [Google Scholar] [CrossRef]
  24. Yu, X.; Han, Z.; Gong, Y.; Jan, N.; Zhao, J.; Ye, Q.; Chen, J.; Feng, Y.; Zhang, B.; Wang, X.; et al. The 1st tiny object detection challenge: Methods and results. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 315–323. [Google Scholar] [CrossRef]
  25. Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1222–1230. [Google Scholar] [CrossRef]
  26. Hong, M.; Li, S.; Yang, Y.; Zhu, F.; Zhao, Q.; Lu, L. SSPNet: Scale Selection Pyramid Network for Tiny Person Detection From UAV Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  27. Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale Match for Tiny Person Detection. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1246–1254. [Google Scholar] [CrossRef]
  28. Zhang, W.; Wang, S.; Thachan, S.; Chen, J.; Qian, Y. Deconv R-CNN for Small Object Detection on Remote Sensing Images. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 2483–2486. [Google Scholar] [CrossRef]
  29. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
  30. Li, R.; Yang, J. Improved YOLOv2 object detection model. In Proceedings of the 2018 6th International Conference on Multimedia Computing and Systems (ICMCS), Rabat, Morocco, 10–12 May 2018; pp. 1–6. [Google Scholar] [CrossRef]
  31. Liang, Z.; Shao, J.; Zhang, D.; Gao, L. Small object detection using deep feature pyramid networks. In Proceedings of the Advances in Multimedia Information Processing–PCM 2018: 19th Pacific-Rim Conference on Multimedia, Hefei, China, 21–22 September 2018; Proceedings, Part III 19. Springer: Berlin/Heidelberg, Germany, 2018; pp. 554–564. [Google Scholar] [CrossRef]
  32. Li, W.; Zhang, L.; Wu, C.; Cui, Z.; Niu, C. A new lightweight deep neural network for surface scratch detection. Int. J. Adv. Manuf. Technol. 2022, 123, 1999–2015. [Google Scholar] [CrossRef]
  33. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
  34. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
  35. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  36. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  37. Jocher, G. Ultralytics YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 7 March 2024).
  38. Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. Yolov6 v3. 0: A full-scale reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar]
  39. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
  40. Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. Available online: https://github.com/ultralytics/ultralytics (accessed on 4 May 2024).
  41. Lu, S.; Wang, B.; Wang, H.; Chen, L.; Linjian, M.; Zhang, X. A real-time object detection algorithm for video. Comput. Electr. Eng. 2019, 77, 398–408. [Google Scholar] [CrossRef]
  42. Huang, R.; Pedoeem, J.; Chen, C. YOLO-LITE: A Real-Time Object Detection Algorithm Optimized for Non-GPU Computers. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 2503–2510. [Google Scholar] [CrossRef]
  43. Gupta, C.; Gill, N.S.; Gulia, P.; Chatterjee, J.M. A novel finetuned YOLOv6 transfer learning model for real-time object detection. J. Real-Time Image Process. 2023, 20, 42. [Google Scholar] [CrossRef]
  44. Xia, R.; Li, G.; Huang, Z.; Meng, H.; Pang, Y. Bi-path combination YOLO for real-time few-shot object detection. Pattern Recognit. Lett. 2023, 165, 91–97. [Google Scholar] [CrossRef]
  45. Sun, W.; Dai, L.; Zhang, X.; Chang, P.; He, X. RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring. Appl. Intell. 2021, 52, 8448–8463. [Google Scholar] [CrossRef]
  46. Fang, W.; Wang, L.; Ren, P. Tinier-YOLO: A Real-Time Object Detection Method for Constrained Environments. IEEE Access 2020, 8, 1935–1944. [Google Scholar] [CrossRef]
  47. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13024–13033. [Google Scholar] [CrossRef]
  48. Ganesh, P.; Chen, Y.; Yang, Y.; Chen, D.; Winslett, M. YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1311–1321. [Google Scholar] [CrossRef]
  49. Wang, T.; Anwer, R.M.; Cholakkal, H.; Khan, F.S.; Pang, Y.; Shao, L. Learning Rich Features at High-Speed for Single-Shot Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1971–1980. [Google Scholar] [CrossRef]
  50. Wang, R.J.; Li, X.; Ling, C.X. Pelee: A Real-Time Object Detection System on Mobile Devices. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Montreal, QC, Canada, 3–8 December 2018; pp. 1967–1976. [Google Scholar]
  51. Lee, Y.; Hwang, J.w.; Lee, S.; Bae, Y.; Park, J. An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 752–760. [Google Scholar] [CrossRef]
  52. Law, H.; Teng, Y.; Russakovsky, O.; Deng, J. Cornernet-lite: Efficient keypoint based object detection. arXiv 2019, arXiv:1904.08900. [Google Scholar]
  53. Qin, Z.; Li, Z.; Zhang, Z.; Bao, Y.; Yu, G.; Peng, Y.; Sun, J. ThunderNet: Towards real-time generic object detection on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6718–6727. [Google Scholar] [CrossRef]
  54. Shih, K.H.; Chiu, C.T.; Lin, J.A.; Bu, Y.Y. Real-Time Object Detection With Reduced Region Proposal Network via Multi-Feature Concatenation. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 2164–2173. [Google Scholar] [CrossRef] [PubMed]
  55. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  56. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
  57. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  58. Wang, C.Y.; Mark Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar] [CrossRef]
  59. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
  60. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  61. Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar] [CrossRef]
  62. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
  63. Yu, W.; Si, C.; Zhou, P.; Luo, M.; Zhou, Y.; Feng, J.; Yan, S.; Wang, X. Metaformer baselines for vision. arXiv 2022, arXiv:2210.13452. [Google Scholar] [CrossRef]
  64. Fang, Y.; Sun, Q.; Wang, X.; Huang, T.; Wang, X.; Cao, Y. Eva-02: A visual representation for neon genesis. arXiv 2023, arXiv:2303.11331. [Google Scholar] [CrossRef]
  65. Smith, L.N.; Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Baltimore, MD, USA, 15–17 April 2019; Volume 11006, pp. 369–386. [Google Scholar] [CrossRef]
Figure 1. Examples of each dataset, including usdataset (a), usdataset_Synth (b), YouTube-GDD (c), Gun-Dataset (d), FP-Dataset (e), and HOD-Dataset (f).
Figure 1. Examples of each dataset, including usdataset (a), usdataset_Synth (b), YouTube-GDD (c), Gun-Dataset (d), FP-Dataset (e), and HOD-Dataset (f).
Applsci 14 08198 g001
Figure 2. General scheme of the detection pipeline process.
Figure 2. General scheme of the detection pipeline process.
Applsci 14 08198 g002
Figure 3. General scheme of the temporal window.
Figure 3. General scheme of the temporal window.
Applsci 14 08198 g003
Figure 4. Test subset sample images.
Figure 4. Test subset sample images.
Applsci 14 08198 g004
Figure 5. Comparison between datasets according to AP50−95. Note that the results shown are the average obtained by the models trained on these datasets (YOLOV5 (s, m, l), YOLOV7 (base, v7x, w6, e6), and YOLOV8 (s, m, l)).
Figure 5. Comparison between datasets according to AP50−95. Note that the results shown are the average obtained by the models trained on these datasets (YOLOV5 (s, m, l), YOLOV7 (base, v7x, w6, e6), and YOLOV8 (s, m, l)).
Applsci 14 08198 g005
Figure 6. Comparison between models using Disarm-Dataset according to AP and mFPS. Note that the mFPS was obtained using a 720 p video on an RTX 3060.
Figure 6. Comparison between models using Disarm-Dataset according to AP and mFPS. Note that the mFPS was obtained using a 720 p video on an RTX 3060.
Applsci 14 08198 g006
Figure 7. Comparison between models using Disarm-Dataset according to training time in hours.
Figure 7. Comparison between models using Disarm-Dataset according to training time in hours.
Applsci 14 08198 g007
Figure 8. False positive sample images.
Figure 8. False positive sample images.
Applsci 14 08198 g008
Figure 9. Comparison between models using Disarm-Dataset with different background images’ percentages, according to average precision (AP) and False Positives (FP).
Figure 9. Comparison between models using Disarm-Dataset with different background images’ percentages, according to average precision (AP) and False Positives (FP).
Applsci 14 08198 g009
Figure 10. Comparison between versions of the Scale Match method.
Figure 10. Comparison between versions of the Scale Match method.
Applsci 14 08198 g010
Figure 11. Examples of different detections made by the detection system. In the first row, an example of how using the second classifier, the temporal window, or both can eliminate a false positive. In the second row, there is an example of how using the second classifier module elevates the confidence in weapon detection when the first confidence is low. The third row illustrates a potential instance of system malfunction. Note that (a) is the detection made by the baseline detector, (b) is the detection made using the second classifier module, (c) is the detection made using the temporal window module, and (d) is the detection made using both modules.
Figure 11. Examples of different detections made by the detection system. In the first row, an example of how using the second classifier, the temporal window, or both can eliminate a false positive. In the second row, there is an example of how using the second classifier module elevates the confidence in weapon detection when the first confidence is low. The third row illustrates a potential instance of system malfunction. Note that (a) is the detection made by the baseline detector, (b) is the detection made using the second classifier module, (c) is the detection made using the temporal window module, and (d) is the detection made using both modules.
Applsci 14 08198 g011
Figure 12. Comparison between original images (first row) and images with a brightness reduction of 75% (second row).
Figure 12. Comparison between original images (first row) and images with a brightness reduction of 75% (second row).
Applsci 14 08198 g012
Table 1. Comparison of various state-of-the-art weapons detection methods. AP is Average Precision and mFPS is mean frames per second. Note that the results are obtained on different datasets and that they are tested on different setups.
Table 1. Comparison of various state-of-the-art weapons detection methods. AP is Average Precision and mFPS is mean frames per second. Note that the results are obtained on different datasets and that they are tested on different setups.
ModelsDatasetInput SizeAPmFPS
Salazar et al. [1]usdataset
+ usdataset_Synth
+ UGR
-67.7-
Olmos et al. [6]UGR1000 × 1000-5.3
Wang et al. [11]usdataset
+ usdataset_Synth
+ new images
416 × 41681.7569.1
Ashraf et al. [16]Modified UGR416 × 416-25
Hnoohom et al. [19]ACF512 × 51249.59
Berardini et al. [20]CCTV dataset416 × 41679.305.10
OursDisarm-Dataset1280 × 72051.3634
Table 2. Overview of the datasets used. Note that the “Others” category pertains to images where no weapons are present.
Table 2. Overview of the datasets used. Note that the “Others” category pertains to images where no weapons are present.
DatasetClassNo. of
Images
No. of
Instances
% of
Low-Ambient-Light
Images
usdatasetGun152025110.0
Others3629-
usdataset_SynthGun86411100.0
Others136-
usdataset_v2Gun152025110.0
Others3629-
Gun-DatasetGun46,23849,64418.58
Others--
YouTube-GDDGun450071537.16
Others500-
HOD-DatasetGun--10.0
Others1000-
FP-DatasetGun--0.0
Others36,000-
Disarm-DatasetGun60,08164,99414.58
Others2650-
Table 3. Overview of YOLO models (retrieved from the official website of each version). It should be noted that, in both YOLOv5 and YOLOv7, the authors used an NVIDIA v100 for training, while in YOLOv8, they used an Amazon instance (EC2 P4d), which uses an NVIDIA A100.
Table 3. Overview of YOLO models (retrieved from the official website of each version). It should be noted that, in both YOLOv5 and YOLOv7, the authors used an NVIDIA v100 for training, while in YOLOv8, they used an Amazon instance (EC2 P4d), which uses an NVIDIA A100.
ModelInput
Size

Parameters
(M)
Inference
Time
(ms)
YOLOv5s6407.26.4
YOLOv5m64021.28.2
YOLOv5l64046.510.1
YOLOv5x64086.712.1
YOLOv764036.96.21
YOLOv7-X64071.38.77
YOLOv7-W6128070.411.9
YOLOv7-E6128097.217.85
YOLOv7-D61280154.722.72
YOLOv7-E6E1280151.727.77
YOLOv8n6403.20.99
YOLOv8s64011.21.20
YOLOv8m64025.91.83
YOLOv8l64043.72.39
YOLOv8x64068.23.53
Table 4. Median results (YOLOV5 (s, m, l), YOLOV7 (base, v7x, w6, e6), and YOLOV8 (s, m, l)) grouped by dataset. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
Table 4. Median results (YOLOV5 (s, m, l), YOLOV7 (base, v7x, w6, e6), and YOLOV8 (s, m, l)) grouped by dataset. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
DatasetAP50AP75APF1FP
Disarm-Dataset42.8050.9042.8069.70196
Gun-Dataset60.6034.5029.2052.80163
YouTube-GDD0.500.300.200.002
usdataset_v2 + synth0.700.700.600.400
usdataset_v20.000.000.000.000
usdataset + synth0.000.000.000.000
usdataset0.000.000.000.000
Table 5. Results of each YOLO model trained with Disarm-Dataset. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
Table 5. Results of each YOLO model trained with Disarm-Dataset. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
ModelAP50AP75APF1FP
YOLOv8l75.7771.3562.3471.10401
YOLOv8m73.1269.2260.9468.70196
YOLOv8s77.1268.1059.1572.50348
YOLOv7-d659.0047.0041.4072.00106
YOLOv7-e659.4049.5040.8072.90139
YOLOv7-w659.8050.9042.8072.80256
YOLOv7x19.9018.9014.6058.0060
YOLOv740.1038.8032.9056.6031
YOLOV5l66.5058.8048.4061.40356
YOLOV5m67.6058.1048.3061.90186
YOLOV5s59.7048.6040.3049.00214
Table 6. Results of models trained with Disarm-Dataset and different background (BG) images’ percentage. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
Table 6. Results of models trained with Disarm-Dataset and different background (BG) images’ percentage. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
% BG
Images
AP50AP75APF1FP
10%76.3571.9061.9971.60171
7%74.2267.5158.3768.30180
5%77.2173.1862.8172.50293
2%74.3069.8360.2468.90261
Baseline77.1268.1059.1572.50348
Table 7. Median results grouped by dataset. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
Table 7. Median results grouped by dataset. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
AugmentAP50AP75APF1FP
no_blur_noise77.2672.4062.3472.60287
no_blur_cutout77.3572.3062.6973.20245
no_blur_bright76.1371.8061.2671.60117
no_flip76.8970.5559.1573.00223
no_blur77.2373.0662.7773.10202
Default75.9071.2159.8971.60157
Baseline77.2173.1862.8172.50293
Table 8. Comparison of using different Scale Match versions for training or not. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
Table 8. Comparison of using different Scale Match versions for training or not. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), and False Positives (FP). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
AreaDatasetAP50AP75APF1FP
AllScale Match paste70.0058.1848.1066.70408
Scale Match77.6873.8863.4173.60164
Baseline77.2373.0662.7773.10202
≤1002Scale Match paste41.2630.4626.0947.01407
Scale Match62.3353.4646.3065.15164
Baseline49.1042.5536.3954.43199
Table 9. General comparison between combinations (temp. wind. refers to the temporal window and 2nd clas. refers to the second classifier). The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), false positives (FP), and mean frames per second (mFPS). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
Table 9. General comparison between combinations (temp. wind. refers to the temporal window and 2nd clas. refers to the second classifier). The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), false positives (FP), and mean frames per second (mFPS). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
CombinationAP50AP75APF1FPmFPS
All61.1459.2851.3640.12034
+ temp. wind.52.6652.4948.1011.82047
+ 2nd clas.75.7271.3161.0971.714834
Baseline77.6873.8863.4173.6016448
Table 10. General comparison between combinations in bright and dark scenes (temp. wind. refers to the temporal window and 2nd clas. refers to the second classifier). The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), false positives (FP), and mean frames per second (mFPS). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
Table 10. General comparison between combinations in bright and dark scenes (temp. wind. refers to the temporal window and 2nd clas. refers to the second classifier). The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), false positives (FP), and mean frames per second (mFPS). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
ScenarioCombinationAP50AP75APF1FPmFPS
BrightAll61.1459.2851.3640.12034
+ temp. wind.52.6652.4948.1011.82047
+ 2nd clas.75.7271.3161.0971.714834
Baseline77.6873.8863.4173.6016448
DarkAll58.9555.3746.3337.77034
+ temp. wind.50.7550.1944.3610.04047
+ 2nd clas.73.1066.0854.9067.163234
Baseline74.1167.9956.7365.458448
Table 11. General comparison between each possible combination. The symbol ‘x’ indicates whether the combination is used. Note that BG images means Background Images, Augs means augmentations, S.M. means Scale Match, temp. wind. means temporal window, and 2nd clas. mean second classifier. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), False Positives (FP), false positives with an area less or equal to 100 p x 2 ( F P 100 ) and mean frames per second (mFPS). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
Table 11. General comparison between each possible combination. The symbol ‘x’ indicates whether the combination is used. Note that BG images means Background Images, Augs means augmentations, S.M. means Scale Match, temp. wind. means temporal window, and 2nd clas. mean second classifier. The metrics used were Average Precision (AP), AP50, AP75, F1 score (F1), False Positives (FP), false positives with an area less or equal to 100 p x 2 ( F P 100 ) and mean frames per second (mFPS). These were calculated using an IoU of 0.7 and a confidence threshold of 0.5. Note that the results in red are the best and those in blue are the second best.
BG
Images
AugsS.M.Temp.
Wind.
2nd
Clas.
F1mAP50mAP75mAPFP FP 100
-----77.1268.1059.1572.50348347
---x-53.2952.0546.4817.4700
----x75.2964.9556.0671.069090
---xx62.6755.9048.8045.8100
-x---76.9772.5262.4873.65205205
-x-x-53.1152.9446.8411.7100
-x--x74.6069.4659.6372.205151
-x-xx58.7455.0647.7338.5500
--x--77.9572.7664.5874.05317315
--xx-55.6554.0149.4624.5400
--x-x75.9970.6262.1673.279696
--xxx62.9259.5152.9946.3311
-xx--77.0871.9163.5473.42334333
-xxx-58.7856.8651.3734.9000
-xx-x75.7269.7161.4073.17108108
-xxxx68.8965.5657.8358.0500
x----77.2173.1862.8172.50293289
x--x-54.6153.8747.5019.2800
x---x75.6971.0760.4971.619090
x--xx65.1461.9953.0250.0200
xx---77.2373.0662.7773.10202199
xx-x-54.6253.3547.8119.1000
xx--x76.4271.8761.3372.568787
xx-xx58.5256.9049.5334.0700
x-x--77.6472.9164.8373.99445444
x-xx-56.4254.8550.1928.8500
x-x-x75.8570.4662.3972.896464
x-xxx65.9362.3155.8954.8100
xxx--77.6873.8863.4173.60164164
xxxx-52.6652.4948.1011.8200
xxx-x75.7271.3161.0971.714848
xxxxx61.1459.2851.3640.1200
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Torregrosa-Domínguez, Á.; Álvarez-García, J.A.; Salazar-González, J.L.; Soria-Morillo, L.M. Effective Strategies for Enhancing Real-Time Weapons Detection in Industry. Appl. Sci. 2024, 14, 8198. https://doi.org/10.3390/app14188198

AMA Style

Torregrosa-Domínguez Á, Álvarez-García JA, Salazar-González JL, Soria-Morillo LM. Effective Strategies for Enhancing Real-Time Weapons Detection in Industry. Applied Sciences. 2024; 14(18):8198. https://doi.org/10.3390/app14188198

Chicago/Turabian Style

Torregrosa-Domínguez, Ángel, Juan A. Álvarez-García, Jose L. Salazar-González, and Luis M. Soria-Morillo. 2024. "Effective Strategies for Enhancing Real-Time Weapons Detection in Industry" Applied Sciences 14, no. 18: 8198. https://doi.org/10.3390/app14188198

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop