1. Introduction
In the last decade, specifically the past two years, the COVID-19 pandemic resulted in a drastic increase in multimedia traffic (video and audio) transmission through networks. According to Cisco 2021, Global Networking Trends Report [
1], an average of 4.7 times more employees are now working from home compared to before the pandemic, which led to 62% of companies deploying video conferencing applications. As a result, Information Technology (IT) is facing a new set of challenges for supporting remote workers, which include security across a more distributed computing landscape, end-user behavior, application performance, and IT operations.
Moreover, Cisco Annual Internet Report (2018–2023) [
2] anticipates that the Machine-To-Machine (M2M) connections’ share will increase from 33% to 50% from 2018 to 2023. This is due to an increasing number of M2M applications, which impacts the growth of devices and connections. Within the M2M connections category (which is also referred to as the Internet of Things (IoT)), connected home applications will represent nearly 50% of the total M2M connections by 2023, which implies a significant demand for bandwidth in the future connected home applications.
According to a recent report published by Allied Market Research [
3], the global video surveillance market was valued at
$42.94 billion in 2019 and was expected to reach
$144.85 billion by 2027, registering a Compound Annual Growth Rate (CAGR) of 14.6% from 2020 to 2027. This will result from the increasing demand for safety in high-risk areas, the integration of IoT in surveillance cameras, and the increasing demand to monitor the COVID-19 cluster in high-risk areas. Moreover, in the Internet of Multimedia Things (IoMT), smart surveillance systems play a significant role in smart cities due to their capabilities in automated human and object recognition; tracking and taking account of risk factors; in enhancing Intelligent Transportation Systems (ITS); and in supporting smart health, with real-time video monitoring of patients [
4].
Several articles [
5,
6,
7] have discussed the problem of weapon detection in surveillance videos and how it is very important to automate such a process, since it will reduce the huge efforts needed to manually review the video streaming, reduce the violation of privacy, and most importantly provide a very fast response that may lead to crime prevention. Finding important scenes in video surveillance is a very important aspect when aiming to trigger certain events according to the content of such frames. The severity of an action detected by a surveillance camera could result in the need for immediate intervention from one or more services such as the police or a hospital.
Detecting hidden weapons has been previously addressed in [
8] by the use of active electromagnetic signals detected using a walk-through metal detector and a sensor array. Previous to their research, weapon detection systems based on metal detectors mainly detected the presence of large metal objects, required setting an adjustable threshold to identify elements of threat, and were affected by the human body. Article [
8] managed to reconstruct an image from the measured electromagnetic signals and use it for weapon detection. The advantage of this system when compared to ours is the ability to detect concealed weapons, yet it is limited by the existence of expensive hardware. Our proposed system is based on camera-based systems which are more commonly used now, as mentioned by [
3], as they are a crucial component of smart surveillance.
Article [
9] provided a comprehensive review of automatic pistol and knife detection in different computer vision-based systems. The article discussed the benefits of creating automatic weapon detection systems and mentioned some of its challenges. For instance, the lengths of the detected weapons differ according to the parameters of the imaging systems, since the position of the camera to the target weapon is not fixed. Furthermore, the high variation in the types, colors, and shapes of the weapons needs to be addressed by all systems and mainly are dependent on the variation of weapons available in the training datasets. Finally, the viewing angle of the weapon could greatly affect the recognition ability of any system.
Object detection and recognition is a domain that is being tackled in deep learning research for the past few years [
10]. Several challenges have been presented when trying to detect objects in an image, for instance, the huge difference in the appearance of each target object class from one scene to another. Another challenge is the balance between the speed and accuracy of a detection algorithm, especially on edge devices. In surveillance applications, it is very important to have a fast detection rate and, since the application is critical, the accuracy of such a system is very important.
YOLO [
11] is an object detection algorithm that was initially presented in 2016 and, since then, YOLO’s first version along with its successors have been employed in a variety of applications such as vehicle recognition [
12] and face recognition [
13] for its fast and accurate detection. The latest release of YOLO, namely YOLOv5, has been proposed by [
14] and presented as several models of different sizes. YOLOv5 architecture comprises three stages: backbone, neck, and head. The backbone is the feature extraction stage which utilizes the CSPDarknet model, the neck stage fuses the extracted features using the PANet model, and finally, the head which comprises a YOLO layer that generates the detection parameters [
15].
Article [
6] has compared different object detection techniques based on both sliding-window and region-based methods. Experiments were conducted on a dataset collected by the researchers and not provided for public use. The results of the analysis have shown that YOLOv4 has produced the highest mean average precision and F1-Score. The authors discussed that real-time detection analysis required the existence of a real-time dataset which is currently still not available.
Video surveillance, as explained before, plays a significant role in smart cities, whether in safety or tracking of COVID-19 cases. However, it is a bandwidth-hungry application. Video surveillance systems generate multimedia traffic that is considered delay-sensitive traffic, especially in the case of early detection of crime and the attempt to prevent the crime or in the case of the fast rescue of victims. Network support for this type of traffic is required to ensure the QoS requirements of the application in terms of bandwidth, delay, and jitter.
Once a weapon is detected in surveillance videos, there is a need for dynamic multimedia traffic management techniques that support the requirements of this type of traffic in terms of delay and bandwidth. Moreover, to cope with the highly dynamic nature of networks, whether the Internet or the IoT, it is required to adopt architectures/paradigms that can be reprogrammed to match current network conditions. The Software-defined Networking (SDN) paradigm has been referred to as a promising technique in recent network management research studies.
SDN is a network paradigm that separates the network’s control logic (control plane) from the traffic-forwarding routers and switches (the data plane). As a result of this separation, network switches become simple forwarding devices, and control logic is implemented in a logically-centralized controller, simplifying policy enforcement, network (re)configuration, and evolution [
16]. The centralized global network view, programmability, flexible management, and separation of the data plane and control plane are the key benefits of using SDN [
17]. Deploying SDN for decentralized IoT network provisioning and management is critical. The OpenFlow protocol is a standard Controller-Data Plane Interface (C-DPI) that allows controllers and data plane devices to communicate.
Several articles [
4,
18,
19,
20] present different proposals to deploy SDN in video surveillance systems; however, performance evaluation results of their proposed platforms either were not provided or insufficient.
Table 1 reviews SDN-based surveillance systems highlighting the artificial intelligence (AI) and SDN roles in these systems. In [
21], filtering videos at the back-end server already consumes bandwidth as the filtering module takes in n input video streams and outputs a subset of k streams to be displayed to the monitoring person. Moreover, the authors did not provide any details about computer vision techniques used in filtering videos at the back-end server and did not provide any performance evaluation results, especially regarding delays.
Moreover, edge computing solves resource-constrained problems by getting computation near the edge of IoT devices. The distribution of edge nodes across the network overcomes the delay and the centralized computation challenges found in the IoT. New edge technologies classify and filter IoT big data generated from an increased number of connected devices before transmitting it to the central cloud data center, which alleviates the challenge of traffic overload and privacy concerns [
23]. Article [
24] highlights the challenges for edge computing and proposes SDN as a solution for these challenges. Since the SDN paradigm depends on a centralized software-based controller, it will relieve simpler edge devices from executing complex networking activities. Moreover, Ref. [
24] explains the reasons behind the emergence of edge computing, namely: real-time QoS, delay sensitiveness, battery lifetime, the regulation of core network traffic, and scalability. Also, Ref. [
25] review the fog computing and SDN solutions to overcome the IoT’s main challenges.
QoS is typically defined as an ability of a network to provide the required services for selected network traffic. The main aim of QoS is to give priority with respect to QoS parameters including, but not limited to: bandwidth, delay, jitter, and loss [
17]. The integrated service (IntServ) model and the differentiated service (DiffServ) model are considered to be the conventional methods for QoS. In the case of SDN, QoS models are implemented by queues and meters in OpenFlow switches.
Article [
17] provided a review of the QoS capabilities of the OpenFlow protocol through its different versions. Moreover, they introduced seven categories in which QoS can benefit from the concept of SDN, namely: multimedia flows routing mechanisms, inter-domain routing mechanisms, resource reservation mechanisms, queue management and scheduling mechanisms, Quality of Experience (QoE)-aware mechanisms, network monitoring mechanisms, and other QoS-centric mechanisms. Furthermore, they highlighted the benefits of using the SDN paradigm in ensuring QoS. These benefits can be summarized as follows:
SDN controller has a global view of the whole network.
Set of flow policies and classes are unrestricted while it is limited in conventional networks because of many vendor-specific firmware at use.
Through the use of an SDN controller, network statistics can be monitored on different levels with respect to per-flow, per-port, and per-device while overcoming conventional network’s limited global view and QoS possibilities, and per-hop decision making.
Motivated by the increase of IoT multimedia traffic generated from surveillance systems, our objective in this research is to develop an intelligent adaptive architecture that ensures QoS over the best-effort network by deploying AI techniques at the edge to decrease the bandwidth requirement of these systems and by leveraging the SDN paradigm to reprogram the allocation of available bandwidth among traffic flows based on the global view of network conditions (made available at the SDN controller through communication with forwarding devices over OpenFlow protocol).
The contribution of this article is as follows:
The article investigates the deployment of different paradigms to support real-time video surveillance application that is considered one of the key applications in smart cities. It proposes the application of deep learning models at the edge, as an exploration of the edge computing paradigm. The proposed deep learning model employs the most recent lightweight version of YOLO to manage real-time surveillance and detection of weapons. This YOLO version was chosen after extensive experiments over several YOLO versions with different parameters.
Moreover, network support for video surveillance applications was introduced by deploying a software-defined networking paradigm (SDN) to play the role of the network core, to control bandwidth allocation among different traffic flows to speed up crime prevention upon weapon detection achieved through AI models implemented at the edge.
In addition, an investigation is carried out through simulation to justify that the integration of AI techniques with software-defined networking paradigms fulfills IoT-based multimedia application constraints in terms of delay and bandwidth.
Finally, it provides recommendations and directions for future work in the domain of using AI and SDN in the IoT-based surveillance systems in smart cities.
The rest of the article is organized as follows:
Section 2 presents a detailed explanation of our intelligent and adaptive QoS framework after giving a brief background on OpenFlow protocol.
Section 3 highlights the simulation results of our proposed framework. We summarize the findings of the article in the concluding section.