1. Introduction
Fueled by technological advancements and a growing emphasis on security awareness, video surveillance has garnered extensive adoption across smart cities [
1,
2]. Nevertheless, the abundance of surveillance images demands increased labor costs, a process that consumes time and fails to ensure the precision of pedestrian identification. The integration of intelligent target detection models rooted in deep learning into smart city contexts holds promise for enhancing urban security systems [
3], thereby reducing the property damage and casualties caused by criminals with a lower cost and higher efficiency. This model can be applied to 24 h pedestrian target detection in smart cities thanks to the creation of a comprehensive dataset that enhances its adaptability to various scenarios. This model also exhibits distinctive characteristics in nighttime recognition. By implementing this approach, it becomes possible to alert security personnel only when a relevant target is detected, thereby greatly improving their efficiency and ensuring the area’s security.
In smart city environments abundant with cameras, video capture devices are frequently connected to the city’s intelligent system through 4G or 5G networks. This integration enables the smart city’s security monitoring management module to remotely access real-time video feeds, facilitating the vigilant monitoring of day-to-day security threats within the urban environment. Nonetheless, numerous network factors necessitate thorough consideration. As an illustration, within the context of our smart park experiment, we deployed a total of 220 cameras. However, owing to the constraints related to hardware performance and network bandwidth, conducting simultaneous experiments with all 220 cameras was not feasible. Consequently, we selected 50 cameras for real-time video stream retrieval, which were then uploaded to the cloud for experimentation. Notably, our findings indicate that this operation demanded a network bandwidth of up to 183 Mbps. Extrapolating from this result, it became evident that to accommodate the retrieval of video streams from all 220 cameras, a network bandwidth of approximately 805 Mbps would be required. The substantial uplink speeds entail significant expenses, and a substantial portion of the video streams end up as redundant resources. Consequently, we witnessed the emergence of edge computing-driven target detection algorithms and hardware solutions [
4]. In recent times, Nvidia has unveiled a suite of edge computing-based target detection computing platforms exemplified by Jetson Xavier, Jetson Nano, and Jetson Xavier AGX [
5,
6]. However, a notable discrepancy arises between the cost and performance of these computing platforms. Moreover, their capabilities are typically confined to managing a limited number of camera video streams. This limitation proves incongruent with the demands of target detection across the expansive networks of large-scale cameras. The exigency of addressing these challenges has spurred the necessity for a novel solution. Within this paper, we introduce a groundbreaking scheme involving the incorporation of high-performance CPUs and GPUs into the Intranet of the smart city infrastructure. Here, the CPU is tasked with the efficient retrieval of multiple camera video streams through multi-threading. In parallel, the GPU undertakes the computation of the target detection model and subsequently transmits the computed results to the management system. This methodology, which involves processing within the Intranet and subsequently uploading the results to the extranet, offers a substantial reduction in network costs while enabling concurrent detection across multiple video streams. Lastly, we undertake the optimization of the Yolov5s model, customizing it to be better aligned with the demands of pedestrian target detection in smart cities. This endeavor culminates in a well-balanced alignment of accuracy, speed, and cost considerations.
In the domain of deep learning, two fundamental types of target detection models exist. The first type entails a network model that performs region proposals prior to carrying out target detection. This approach is distinguished by networks with an exceptional detection accuracy, albeit at the expense of a slower computing speed, for example, Mask R-CNN [
7], Fast R-CNN [
8], and Faster R-CNN [
9,
10]. The second type is a single-stage detection model, which operates with a single neural network from input to output. This approach ensures swift inference without the need for generating candidate frames. For example, YOLO [
11,
12,
13,
14,
15] and SSD [
16,
17]. This paper employs the YOLO-based target detection algorithm to achieve real-time target detection across multiple cameras. In the context of a smart city, objects situated farther from the camera encompass smaller pixels. Additionally, smart cities host a wealth of information and are prone to greater occlusions. Consequently, target detection within the smart city domain represents a significant and valuable research endeavor.
When delving into target detection within urban environments, some scholars have undertaken improvements to existing detection models. Bodla et al. [
18] elevated the count of prediction frames by adapting the target prediction frame fraction strategy, thereby enhancing the performance concerning occluded targets. However, the potential for better generalization remains a point of consideration. XUE et al. [
19] introduced an innovative real-time pedestrian detection algorithm, the multimodal attention fusion YOLO. This algorithm adeptly adjusts to pedestrian detection efficacy during nighttime using the Darknet53 framework. It establishes a loss function, alongside generating anchor frames through the application of the K-means algorithm. The outcomes of their study underscore the method’s effectiveness, as it achieves a notable enhancement in pedestrian detection. PUSTOKHINAIV et al. [
20] put forth a strategy that merges Faster-RCNN with a hybrid Gaussian model within intricate backgrounds. This approach aims to eliminate the impact of video backgrounds on images, elevate the image resolution, and ultimately, enhance the overall effectiveness of detecting pedestrian targets. Nonetheless, this approach encounters challenges linked to substantial model sizes and extended training periods. HSU et al. [
21] pioneered a ratio-aware mechanism to fine-tune the aspect ratio of images, effectively addressing false target pedestrian detections through the segmentation of the initial image. This method notably enhances the accuracy of target pedestrian detection. However, challenges persist, particularly in rectifying instances of occluded and overlapping pedestrians that result in both missed detections and erroneous identifications.
In summary, substantial progress has been achieved in pedestrian target detection within urban environments. However, the majority of research efforts have been directed towards enhancing model performance through scaling up the model’s size [
22]. This approach, while seeking improvement, introduces challenges such as a sluggish detection speed, extensive model scale, and elevated hardware costs, leading to diminished practicality. Within this paper, we address and rectify these challenges, delving into the YOLOv5 model, while concurrently aligning with our specific requirements. Our investigation culminates in the creation of the YOLOv5-MS target detection model. By optimizing the model’s BackBone structure, we manage to reduce its size, thereby bolstering the inference speed. The integration of the SE module within the network enhances the detection accuracy. Furthermore, we employ the K-means technique to generate varied-sized prior frames and amplify the performance using the Retinex image enhancement algorithm. Ultimately, the incorporation of the Focal-EIOU loss calculation method strengthens the alignment between the context of this experimental study and the YOLOv5 algorithm, leading to a more seamless integration.
2. Models
2.1. Optimized Video Stream Acquisition Method
In the YOLOv5 model, images are acquired by simultaneously fetching multi-camera video streams through multiple threads. Subsequently, the model proceeds to perform inference by separately detecting each frame captured with the cameras. This approach gives rise to several challenges: Firstly, when hardware capabilities fall short of fulfilling real-time detection requirements, image detection may experience delays. As the backlog of images accumulates, it could lead to target detection referring to images from hours ago, consequently failing to uphold real-time detection standards. Secondly, excessive image accumulation can strain memory resources, undermining the model’s stability. Lastly, if a specific video stream experiences a prolonged delay, it might lead to automatic disconnection from the model, thereby affecting the model’s detection process for that particular camera. In light of the scenarios outlined above, when detecting numerous camera images, this paper addresses these challenges through enhancements in camera video image acquisition. By employing a multi-threaded strategy, each individual thread undertakes the acquisition of a camera’s video stream. The role of each thread involves a continuous extraction of the latest frames from its respective camera. This separation ensures that during model image detection, a steady stream of the latest images is available. Module separation enables the model to detect the most recent real-time images after completing the previous round of image detection. This capability fulfills the real-time target detection requirements of large-scale cameras. A visual representation of the video stream reception method is depicted in
Figure 1.
By employing this approach, the aforementioned challenges can be effectively mitigated, enabling the achievement of real-time processing for camera video streams even in scenarios involving extensive detection. The detection performance of this method relies on the capacity of the computer hardware. If it can detect every frame from 50 cameras within 0.5 s, this means two real-time images per camera can be detected in one second. The distinction between detecting thirty frames or only two frames in one second has a negligible effect on the results. As a result, in this experiment, the proposed method is utilized to capture a wide array of camera streams, allowing the model to simultaneously process up to 50 or even 100 camera streams at the highest level of efficiency.
2.2. YOLO Basic Principle
The primary aim of this research is to achieve pedestrian target detection within smart cities to enhance urban security. With the deployment of the model at the edge, real-time responsiveness is of paramount importance. Furthermore, considering potential extended operational durations, both the model’s mean average precision and stability emerge as vital considerations. The Yolo series effectively satisfies these prerequisites. The YOLO model idea is to treat the detection of objects as a regression problem and predict the input images as detection frames and target class probabilities using a neural network. The YOLOv1 model is the initial version of the whole YOLO series. The idea is first to transform the image into having a 448 × 448 resolution and divide it into 7 × 7 cells, predicting the confidence and category scores of the boxes based on the position and content of each cell.
YOLOv5 is a contemporary and GPU-optimized target detection model. Within the current YOLOv5 framework, distinct variants exist, namely YOLOv5s, YOLOv5n, YOLOv5l [
23], YOLOv5m, and YOLOv5x [
24]. Among this array of network models varying in size, YOLOv5s stands out due to its purposeful network simplification, which is tailored for edge deployment to facilitate real-time target detection. Furthermore, YOLOv5s boasts notable advantages, including a heightened detection stability, accuracy, and simplified deployment. Its performance surpasses those of YOLOv5n, YOLOv7, and YOLOv8 in specific scenarios, rendering it an optimal selection for undertaking comprehensive inspections using large-scale camera systems. YOLOv5s comprises three primary components: the Neck, the Backbone, and the prediction. The effective interplay between these distinct modules contributes to its commendable performance. YOLOv5 introduces several enhancements, encompassing Mosaic data augmentation at the input stage along with adaptive anchor frame computation [
25]. The Backbone component incorporates the focus and CSP_X structures, while the Neck segment integrates the CSP2 structure, all working cohesively to amplify the amalgamation of network features [
26]; the use of spatial pyramidal pooling to fuse different sensory fields; the CIOU loss is utilized as the loss function of the bounding box [
27]; and the overlapping targets are improved via NMS non-maximal suppression [
28]. The structure of YOLOv5 is shown in
Figure 2.
2.3. Backbone Structure
In complex settings, the diverse shooting angles of cameras capture pedestrians of varying sizes. Directly utilizing the YOLOv5’s algorithm Backbone, network extraction results in slowness and an increased likelihood of missing or misidentifying instances. Consequently, to enhance the recognition and detection of pedestrians spanning various target sizes, refinement is applied to the convolutional layer of the Backbone. Leveraging the reparameterization concept from RepVGG [
29], the original convolution within the Backbone is substituted with RepvggBlock. This replacement effectively curtails the channel count within the convolutional layer. RepVGG builds upon the conventional VGG architecture, while incorporating a residual structure [
30]. It introduces varied training approaches based on distinct principles governing model training and inference phases. The fundamental notion driving RepVGG involves augmenting model performance by integrating a multi-branch structure into the training network through structural reparameterization.
The RepvggBlock encompasses Conv3 × 3 + BN (Batch Normalization), Conv1 × 1 + BN, and identity branches. These branches are assigned weights prior to their activation function application in subsequent network iterations. The RepvggBlock configuration is illustrated in
Figure 3.
The core idea of RepvggBlock is as follows:
(1) The convolutional layer and BN layer are fused. The parametric calculation of the convolutional layer is shown in Formula (1), where W and B are the weights and biases, respectively. The parameter operation of the BN layer is shown in Formula (2).
where U and ϑ are the mean and variance, respectively, Y is a learnable parameter, and β is the bias. By substituting the parameters in Formula (1) into Formula (2), the convolutional and BN layers can be combined into one convolutional layer with the bias, as shown in Formula (3).
(2) The convolutional kernel is transformed into a 3 × 3 scale size. Following the initial step, the model branches into two: a 1 × 1 convolutional layer and an identity branch. For 1 × 1 convolution, the convolution kernel evolves from a 1 × 1 configuration to a 3 × 3 convolutional kernel, with the weight of the null convolution set to 0. In the identity branch, a 3 × 3 convolution kernel is employed with the center weight set to 1 and the remaining weights set to 0. This substitution is depicted in
Figure 4. Ultimately, the convolution structure of every branch in the model is transformed into a 3 × 3 convolution. The corresponding weights and biases are subsequently aggregated, amalgamating the branches into a unified 3 × 3 convolution.
Following the principles of network structure reparameterization, the RepvggBlock is skillfully crafted to leverage multiple branches, enhancing the model’s detection efficacy during the training phase. Subsequently, these branches are consolidated into a unified structure for inference, thereby boosting the model’s inference speed. This characteristic aligns seamlessly with the detection requisites of multiple cameras within a smart city context.
2.4. Squeeze-and-Excitation Model
In this study, upon the integration of the RepvggBlock module within the Backbone, a notable observation emerges: although the detection speed experiences enhancement, there is a simultaneous decline in detection accuracy. This outcome does not align favorably with the developmental goals of enhancing security within smart city environments. To further enhance the model’s accuracy while satisfying its speed requisites, this study introduces an SE module based on the Backbone structure, aiming to optimize the overall performance [
31]. The SE module, which is short for Squeeze-and-Excitation, is a crucial component in deep learning networks. It boosts the power of Convolutional Neural Networks (CNNs) by dynamically adjusting the feature maps during training. By learning how to scale each channel, it helps the network focus on valuable information, improving the performance in tasks like image classification and object detection. Through the incorporation of the SE module to refine functionality, the model achieves synergistic amalgamation. This amalgamated model retains its capability to augment detection performance while upholding a compact model size. This attribute aligns well with the demands of large-scale camera detection within the context of this study.
Figure 5 illustrates the incorporation of the SE module, specifically the channel attention module. The SE module aims to obtain more critical feature information utilizing a weight matrix that gives different weights to different image positions from the perspective of the channel domain. It is added to the original residual block via extra pathways. This module employs a global pooling layer to calculate channel weights for the initial channels. Subsequently, these calculated weights undergo refinement for each channel, involving a sigmoid activation function and two fully connected layers. In the final step, the original channels are multiplied by their respective channel weights. Subsequently, during network training, the model’s detection performance is enhanced through gradient descent. The squeezing operation is elucidated in Formula (4).
where
is the squeeze movements and u indicate the input. U∈R H × W × C. c indicates the channel. The excitation operation is shown in Formula (5).
where
presents the excitation operation;
,
, r indicates the hyperparameter, which is 16, indicating the dimension reduction coefficient of the first fully connected layer. The scale operation is shown in Formula (6).
where
presents the scale operation.
2.5. YOLOv5-MS Structure
The network model structure of YOLOv5-MS is presented in
Figure 6. To enhance the model’s detection mAP and speed, YOLOv5s is subjected to the following improvements: (1) incorporating an SE module into the model to enhance its feature extraction capabilities; (2) substituting the original Conv convolution in the model with RepvggBlock to decrease the model’s complexity and increase its detection speed; (3) introducing a focus structure to preserve essential information by reducing the dimensionality, thereby mitigating the risk of model overfitting to some extent.
5. Conclusions
This paper introduces an advanced pedestrian target detection model, referred to as YOLOv5-MS. It is built upon the YOLOv5 architecture and specifically designed for robust pedestrian target detection within a smart city context. Initially, YOLOv5 serves as a foundational framework, ensuring the efficacy of target detection performance. Subsequently, optimization is applied to the acquisition of multiple video streams within the model. To further enhance efficiency, RepvggBlock is introduced into the BackBone segment, replacing the original convolutional layer, and thereby, expediting the model’s inference process. To enrich the model’s capabilities, an SE attention module is integrated into the network, enabling the extraction of more pertinent information from images and, consequently, augmenting the overall model performance. This integrated approach ensures the advancement of pedestrian target detection within the context of smart cities. The model’s performance gains are also attributed to the integration of the K-means algorithm and the Retinex image enhancement technique. Additionally, the adoption of the Focal EIOU loss, replacing the previous CIOU loss, contributes to the model’s advancement. These enhancements are methodically validated through ablation experiments. The experimental outcomes demonstrate a notable enhancement of the model’s mAP by 2.0%, accompanied by a substantial 21.3% improvement in inference speed compared to that of the original model. Furthermore, the model’s superiority is highlighted by its outperformance when compared to the performances of the other state-of-the-art models. These discoveries have led to improved applications for pedestrian target detection using large-scale cameras in smart city environments.
Nevertheless, there is untapped potential for further improvement in this model. In the forthcoming research endeavors, we intend to explore the opportunities for optimizing the model’s structure, while upholding the detection accuracy. This optimization aims to enable simultaneous detection across a larger number of cameras, rendering the model more suitable for pedestrian detection within complex urban environments. By fine-tuning the model’s architecture, we aspire to bolster its applicability and performance in the dynamic and intricate landscapes of smart cities. At the same time, we will focus on multi-target model detection, a research direction that will empower us to address the diverse challenges within smart cities with a combination of high accuracy and cost-effectiveness.