1. Introduction
Thanks to advances in Deep Learning, computer vision applications have reached heights that were unthinkable a decade ago. Deep Learning allows computational models composed of multiple layers to learn data representations with multiple levels of abstraction [
1]. Over the years, the computational, energetic and infrastructure costs of Deep Learning architectures have grown to become unsustainable [
2] and inaccessible to many small or medium companies that cannot pay the cost of computation (e.g., “DeepMind trained its system to play Go, it was estimated to have cost
$35 million”). In addition, in some real environments, usually in computer vision, inference tasks are designed to run real-time processing [
3] and usually operate 24 h a day [
4], which makes them energetically expensive. Especially since many of these models run on a Graphics Processor Unit (GPU), one of the most power-consuming components of a computer [
5]. Add to this the increase in energy prices and constant fluctuations in these prices, and these systems can account for a large portion of the company’s total energy expenditure. This energy consumption has an impact not only on economics but also on the environment. Electricity consumption and its impact on climate change are one of humanity’s most significant challenges in the 21st century.
At the same time, the massive expansion in the use of the Internet and its data centers has caused servers to perform tasks that can be computationally expensive and to increase the number of requests they have to handle [
6]. In addition, many companies have started offering online services to perform tasks that used to be carried out on local machines. This increase in cloud computing and Internet use has increased demand for network performance. Standard NIC (Network Interface Card) devices assist the CPU in performing a variety of network functions, but in some scenarios, the tasks performed on the servers are too heavy, and even the combination of servers with standard NICs is not enough to provide performance that meets the speed demands of the network. To address these problems, an extension of the NIC device with the ability to solve more complex tasks was developed. These devices are called Data Processing Units (DPU) [
7].
In this study, we will use a basic detection model running on the GPU with a DPU to reduce the working time and energy expenditure of a dedicated armed intrusion detection server designed to operate 24/7 on a CCTV.
Our key contributions are the following:
We explore the potential benefits of using a DPU to alleviate the workload on a computer vision inference task.
We present a study focused on measuring the reduced workload on a server dedicated to weapon detection on a real CCTV video stream recording.
The remainder of this paper is structured as follows,
Section 2 presents the background to DPUs and discusses scenarios in which they have been used in the past.
Section 3 explains the experimental setup, including the dataset used.
Section 4 develops the approach followed,
Section 5 contains experimentation, and finally, the findings and conclusions are drawn in
Section 6.
2. Related Works
The most widely used definition of a DPU was coined by Alan Freedman (
https://www.computerlanguage.com/results.php?definition=Smart+NIC, accessed on 28 September 2022) who described it as “A network interface card that offloads processing tasks that the system CPU would normally handle. Using its onboard processor, the DPU can perform any combination of encryption/decryption, firewall, and network processing, making it ideal for high-traffic web servers.”
We can find some examples of DPUs as workload relievers in the state-of-the-art. Liu et al. [
7] conducted a study that focused on characterizing the network and computational aspects in a 100 Gbps Ethernet environment. The benchmark involved a series of tests that compared the performance of this DPU with that of a server. Although DPU failed to deliver server performance for specific tasks (intrusion detection, network address translation, virtual private networks), it was found to perform much better in encryption/decryption, memory operations that involve heavy contention, and interprocess communication tasks. In [
8], it is demonstrated that the DPU is capable of offloading IPsec encryption to its hardware accelerators, which offer significant performance improvement compared to encrypting Ethernet traffic on workstation CPUs. This method is promising for improving network performance and adding confidentiality, integrity, and authentication to Ethernet traffic. The work of Barsellotti et al. [
9] presents three different use cases for edge scenarios that exploit the recently introduced innovative programmability enabled by DPUs. Five ML models are, in turn, evaluated in the context of network intrusion detection, concluding that the use of DPUs enables the native integration of artificial intelligence-based operations directly into the network fabric.
In Deep Learning, DPU has been used in different phases of model training, such as data augmentation or model validation. With this approach, Jain et al. [
10] were able to increase training performance by up to 15%. However, they did not find an offloading scheme that could optimally accelerate the training of all models in their study. In an extension of this work [
11], the same authors show a consistent improvement for Convolutional Neural Networks (CNNs) and Transformer models with weak and strong scaling on multiple nodes.
Many proposals can be found concerning computer vision detection, but they generally focus on detecting medium or large objects such as vehicles, cyclists, pedestrians or animals [
12,
13,
14,
15]. However, the detection of weapons by computer vision adds some difficulties, among other things, due to the small size of the objects to be detected [
16,
17]. In [
18], Salazar et al. propose a challenging dataset that included these difficulties and a Deep Learning weapon detection model based on a Faster-RCNN (Region Convolutional Neural Network) with FPN (Feature Pyramid Network), which improved the state-of-the-art in two-phase training.
As in [
10,
11], our proposal makes use of the DPU in a Deep Learning environment, but unlike those where it is used in the training phases of the model, in our case, the card will be used to perform filtering tasks to help an already trained model to reduce the inference workload. We were motivated to try using a DPU as a filter for a video stream because of its ability to alleviate the load on the system. However, because the tasks and objectives of our experiment differed from those described in previously mentioned DPU articles, it is non-viable to make a direct comparison. Our goal in using a DPU as a filter was to reduce the working time of a server dedicated to video stream processing.
In previous works [
19,
20], frameworks have been proposed that use frame filtering-based methods to skip similar consecutive frames in order to reduce the computational cost of detection-driven applications using reinforcement learning. However, these solutions cannot be applied to our study on detecting weapons in public spaces because the appearance of weapons in such contexts is anomalous, making it impossible to perform reinforcement training. Additionally, using skip-length techniques in this context could potentially decrease detection accuracy. As far as we know, our work is the only one that uses a DPU device to reduce server workload and decrease power consumption in a CCTV weapon detection system.
4. Using DPU as Server-Intensive Task Offloader
4.1. Motivation: Reduce Server Workload on Computer Vision Tasks
Since the server takes 47 milliseconds to process one frame, the detection model can handle 16 frames per second, leading to a server running 752 milliseconds every second. In other words, the GPU works about 75.2% of the time in a 24/7 system, which is a significant power consumption.
Given the results of [
7,
10,
18], we decided to explore the advantages of using a DPU in this computer vision scenario. We aim to install a DPU on the server to reduce the server’s working time by using the DPU as a traffic light that alerts the server when needed. To do this, we need a mechanism to discriminate the video frames and classify them based on whether they need to be processed. Although the range of such mechanisms is not very wide, we have considered two options that could be effective. The first is to use a human detection model, as weapons need to be carried by people. Therefore, when the DPU’s human detection model classifies a frame as positive, it will alert the server to apply weapon detection. Another option is to install a motion detector in the DPU that alerts the server when it detects changes in the image. The latter mechanism is usually faster, but it also has a slightly lower hit rate as it may be triggered by false positives, such as the movement of leaves or animals entering the camera’s range of vision.
4.2. Our Approach
To test these mechanisms on the DPU side, we decided to use OpenCV, a widely used library whose effectiveness has been previously demonstrated in literature [
24,
25]. This library has functions based on frame subtraction that we have used to perform motion detection. OpenCV also has some detection algorithms we have used in this work to test human detection performance on the DPU, such as the Histogram of Oriented Gradient (HOG) or its Haar cascade classifier algorithm [
26,
27], which are based on the gradient. We also tested other classification algorithms, such as Yolo, a popular Deep Learning object detector known for its excellent performance in terms of time and accuracy [
28,
29,
30]. Nonetheless, Yolo is designed to run on GPU, and its performance on CPU (or DPU, as in our case) decreases significantly in terms of time.
Figure 2 shows a comparison of the time required for each algorithm to analyse a single frame.
As we can observe from the comparison, motion detection takes significantly less time to perform than methods based on human detection, which is ideal for a device such as a DPU that is not powerful enough to support computationally expensive methods. Among the human detection algorithms, the Haar algorithm is the fastest, taking 70 milliseconds to process a single frame. The Haar cascade classifier algorithm [
31] uses a cascading window approach and attempts to compute features in every window and classify whether it could be a human. However, even with this algorithm, we would not be able to process the 16 FPS of the 1280 × 720 resolution video in real time. Therefore, we have decided to implement the motion detection algorithm on the DPU side, which only takes ten milliseconds per frame to perform the detection.
The motion detection algorithm applied is based on the classical Inter-frame difference [
32] that uses the absolute value of the difference obtained from the subtraction of two consecutive frames (Equation (
1)) to determine the movement using an established threshold (Equation (
2)). The algorithm detects a movement if the difference is greater than the threshold.
The function calculates the absolute value of the difference between two consecutive frames ( and ), where u and v represent the coordinates of a pixel in an image. This is calculated to determine motion () based on the threshold T (empirically established for each camera using video intervals where no motion occurs).
The next step was to find a way to provide the necessary information to the server where fine detection is performed. When a server uses a DPU as a network card, two modes of operation are possible: either the information arrives at the server after passing through the DPU, or the information arrives simultaneously at both the server and the DPU. The first mode allows us to read the video stream on the DPU and, when fine processing is needed, bring the entire frame to the server using gRPC (Google Remote Procedure Calls). However, exchanging frames between the card and the server is a slow and cumbersome task, so we decided to use the second mode and have both the server and the DPU read the video stream. The following figure shows a graph of the information flow in our approach. By default, only the DPU analyses each frame. When the DPU detects motion in a frame, it passes the frame identifier to the CPU, and upon receipt, the GPU applies weapon detection to that frame.
In our final implementation, both the server and the DPU read the video stream, where each frame has an associated time identifier. However, instead of the server constantly applying the weapon detection model to each frame, the DPU does the hard work. As the frames arrive at the DPU, the DPU applies the motion detection model. If no motion is detected, it is unnecessary to perform fine processing of the frame, and the server will not apply the weapon detector to it. However, if any motion is detected, the DPU notifies the server and passes information (via gRPC) about the frame identifier. The server is also reading the video stream, so sending the frame from the DPU to the server is unnecessary; the frame identifier being the only one necessary. From a weapon detection perspective, type B and C frames are discarded, as no weapon has been detected in them; in fact, type C frames have not even applied the weapon detection model.
As shown in
Figure 3, the action frames are from an indoor camera and belong to a public dataset [
18] containing a set of frames tagged during a mock attack. For privacy reasons, no high-quality public streams are available for identifying people, so the tests were performed using the public video stream from the University of Michigan. Although detecting weapons at such long distances is unlikely, this public stream allows us to test the solution’s efficiency using a DPU, and the system could be seamlessly transferred to private CCTV streams. Measurements were taken every 10 s using the “powertop” tool (
https://github.com/fenrus75/powertop, accessed on 28 September 2022) and “nvidia-smi” command (
https://developer.nvidia.com/nvidia-system-management-interface, accessed on 28 September 2022), respectively, to monitor the energy consumption of both the DPU and the GPU.
5. Results
As discussed in previous sections, our model uses a GPU to perform the processing. During the execution of the model, this card runs at full power, reaching its maximum power consumption level. However, since the server does not have a graphics interface, the power consumption of the graphics card is minimal when it is not used for video processing. It is crucial to consider that the benefits of workload reduction and energy conservation will depend on the flow of people, which implies whether there are activity periods. In a scenario with no flow of people, the most significant savings will be achieved as the GPU’s energy cost will be minimised to zero. On the other hand, in a scenario with a constant flow of people (i.e., activity in 100% of the frames), there will not only be any savings, but the energy cost of the DPU will also need to be added to the energy cost of the GPU. To determine the point at which this methodology becomes less energy effective, we will calculate the percentage of time with activity at which GPU and DPU consumption is higher than a traditional GPU-only system.
As shown in Equation (
3), the consumption produced during inactivity periods, along with consumption during periods of activity in which the DPU and GPU are working, should be lower than the consumption of a traditional detection system without filtering.
In the following equation, the percentage calculation is shown in detail:
where
is the time percentage in which there is no motion, and
is the remaining period when DPU does detect motion. Moreover,
and
are the power consumption of both devices. Therefore, considering that GPU’s average consumption is 274.18 kWh and 6.4 kWh for the DPU, we have obtained that if the period of activity is above 97.67% of the time, our system is not energetically efficient. For example, in 24 h, it would be necessary to be motion over 23.5 h for both DPU and GPU consumption to be higher than a traditional GPU-only system.
Based on the flow of people, we have established two thresholds and the following ranges of low, medium, and high savings as follows:
High-Activity: Busy scenario with a high flow of people. An interval of time is classified as High-Activity when there are people in the camera’s viewing range for more than 75% of the time. In this situation, the stress on the server side could be as intense as usual.
Medium-Activity: Not very busy scenario with an intermittent flow of people. We consider an interval of time to belong to this range of activity when the time at which there are people within the camera vision range is between 10% and 75%.
Low-Activity: Clear or no-people-flow scenario. When there are people in the area of view of the camera for less than 10% of the total time.
The 24-h recording provided by the clock tower camera gives us intermediate activity for much of the day, but it also has intervals of time when the scenario has high-activity (e.g., school start time) and intervals of low-activity where the flow of people is virtually non-existent, such as in the night hours. However, security systems should remain active throughout the day in a public building.
Figure 4 shows the percentage of GPU activity recorded throughout the day on the clock tower camera. Half-hour intervals have been considered.
According to
Figure 5, there is a direct relationship between server workload and energy consumption. When we let the server do the work, we get an average power consumption of 274.18 kWh as the frames are being processed non-stop for the entire process. However, when we use pre-filtering, we obtain a drastic reduction in consumption because specific frames do not need to be processed on the server. In the low-activity time interval (from 9:00 p.m. to 7:00 a.m.), only the DPU performs video processing, reducing the energy consumption of the entire system. We also have a small peak of high-activity between 15:45 and 16:15 due to the high flow of people. Note that there is a slight increase in power consumption during busy periods compared to the server working alone. This increase is because, in this situation, both the DPU and the GPU are working, so the consumption of the DPU has to be added to that of the server. However, due to the low power consumption of the DPU, this increase is not too high.
Figure 6 shows the accumulation of the values in
Figure 5. As expected, the constant consumption of the server working alone results in an almost uniform curve because at each instant of the process, the consumption is practically the same because the server analyses all the video frames. However, using DPU to preprocess the video reduces the number of frames in which fine processing is needed.
In the recording used for the experimentation, a value of 39,353.58 kWh was obtained in the case of the server working alone and 16,970.57 kWh in the case of the server assisted by the DPU, which represents a saving of 43.12% over the 24 h recording when using this methodology. This difference in power consumption will depend on the scenario since, as mentioned above, the increase or decrease in the flow of people will have a direct relationship with the server’s workload and, consequently, with the power consumption.
As shown in
Figure 7, the most significant percentage of savings occurred during periods of low pedestrian traffic when detection tasks did not heavily tax the GPU.
In the low-activity intervals (which in our case make up 41.67% of the hours of the day), the savings achieved are enormous, reaching 98.07%. However, in the high-activity range, no significant savings have been achieved, as the number of frames to be analysed is very similar to those analysed by the server without the help of the DPU. In the video used in our test, a saving of 20.54% was achieved in the high-activity time intervals (which represent 2.08% of the total time of the day). For the medium-activity periods, an average saving of 68.7% has been achieved using pre-filtering (medium-activity intervals account for 56.35% of the total hours of the day).
As a final experimental study, the false positive ratio of our system was tested for the 24 h of video. As shown in
Figure 8, many false positives occur during periods of inactivity, which is normal because this period is more prevalent than activity ones. These false detections in periods of inactivity would result in multiple incorrect alarms, as for a weapon to appear on the scene, movement by people must occur. Therefore, besides saving energy for the system, the method proposed in this study would avoid a higher rate of false positives.
6. Conclusions and Future Work
As far as we know, this is the first time a DPU has been used as a pre-filter in a computer vision detection task. The code used in this work is available on our GitHub account and can be adapted for use in other computer vision systems where the pre-filtering of input data is possible. As demonstrated in the previous sections, using a pre-filtering device can significantly reduce server workload and energy consumption. In our tests, energy savings ranged from 20.54% during periods of high-activity, to 68.7% during periods of medium-activity and up to 98.07% during periods of low-activity. This methodology is beneficial when there is no constant flow of people, but active surveillance is necessary at all times, such as in universities, public institutions, military buildings, and museums.
Additionally, the use of DPUs as load balancers allows for the distribution of workload in real-time video processing, as discussed in [
23]. While we could not test this method due to only having one server capable of weapon detection, it is worth noting that the DPUs could distribute video frames among multiple servers rather than discarding them.