1. Introduction
The recent growth of industrial applications for object detection stimulates the research community toward novel solutions. Intelligent video analysis is the core of several industry applications such as transportation [
1], sentiment analysis [
2], and sport [
3,
4].
Deep learning lies today at the core of state-of-the-art techniques for object detection, such as Faster RCNN [
5], YOLO [
6] and SSD [
7]. Thanks to GPUs, object detection solutions based on deep learning can support real time applications; the edge-computing market now offers a variety of relatively inexpensive devices for Artificial-Intelligence (AI): microprocessors [
8], hardware accelerators [
9], up to complete Systems on Module (SoM), such as the Jetson series by NVIDIA [
10], and machine vision cameras such as the JeVois A33 and Sipeed Maix Bit, used in [
11]. These tools rely on GPUs and a collection of software optimisations to deploy computationally intensive tasks, such as AI inference, on resource-constrained hardware. Real-time object detection on embedded devices still represents a major issue, as that goal involves quite complex architectures for deep learning. In practice, one needs a trade-off between accuracy and latency to tune each method to the target scenario.
This paper addresses the detection of small objects, which typically take up a few tens of pixels. State-of-the-art approaches often exhibit poor performances when dealing with very small objects, due to the apparent difficulty in discriminating these features from one another and from the background [
12].
Figure 1 presents an example, including three candidate sub-regions extracted from as many frames in a tennis-match video. While the rightmost frame actually includes the ball, the other patches do resemble a tennis ball but represent misclassification errors.
Human observers face a similar challenge when looking for tiny objects in a wide scene. The detection task, in fact, gets simpler if the target moves with respect to a still background, since the human vision system can combine motion information with the visual aspect of the object.
Figure 2 clarifies this concept: the image on the left is the frame (at time
) drawn from the tennis video. The image on the right merges the frames from time
up to
. In the former case, the ball is hardly distinguishable even by a human viewer, not just for its small size, but also because motion blur hinders the detection of fast-moving objects. In the rightmost image, instead, motion information makes the tennis ball clearly detectable.
The approach presented here deploys the detection of tiny moving objects in wide scenes on limited hardware resources. The method adjusts the basic building blocks of resource-constrained computer vision, and proposes a custom deep neural network for the recognition task. The T-RexNet framework improves over generic hardware-aware detectors, which only rely on visual features, and combines those features with motion information. The framework processes three consecutive frames from the video source, and prompts a set of bounding boxes around the detected objects. The overall architecture includes two stacked blocks, for feature extraction and subsequent object detection.
The dedicated pair of parallel convolutional paths in the network support that image/motion fusion process. As compared to generic object detectors, the computational overhead brought about by the two-tiered feature-extraction network is mitigated by reducing the network depth. As a matter of fact, focusing on tiny objects allows to leave out the deep layers operating at low resolution.
Single-Shot-Detector (SSD) architectures are quite popular for resource-constrained object detection. The custom feature-extraction module overcomes the well-known limitations of SSD in detecting tiny objects. The resulting feature-extraction architecture is quite shallow, and the object detection block relies on one of the least demanding available State-of-Art (SoA) solutions. In summary, the integration of these two features yields a viable solution for the real-time detection of small objects by constrained devices.
Experimental results prove that, in that context, T-RexNet improves significantly over state-of-the-art methods for generic object detection. As compared to application-specific solutions, T-RexNet exhibits a satisfactory accuracy vs/speed balance in several complex scenarios such as aerial and/or civilian surveillance and high-speed detection, tackling medium-sized to tiny objects, and varying target densities. In other words, it manages to achieve high detection rates without sacrificing accuracy too much.
The paper is organised as follows.
Section 2 overviews the state-of-the-art in object detection, moving-object detection, and in the specific domains used for testing.
Section 3 presents the T-RexNet approach in detail.
Section 4 discusses the test scenarios considered, whereas
Section 5 makes some concluding remarks. Project website with downloadable resources:
http://sealab.diten.unige.it/ accessed on 8 June 2020.
2. Tiny Moving Object Detection: State of the Art
The identification of small moving objects is a subset of a wider research field in object detection. Existing solutions and techniques can be arranged into three main groups, namely, Single-image solutions, Background-subtraction solutions, and Spatio-temporal CNNs.
2.1. Single-Image General-Purpose Solutions
Typical object-detection models handle one image at a time, even when spatio-temporal information might be available. State-of-the art approaches, relying on deep learning, can be divided into region-based and single-shot detectors.
In the former models, such as R-FCN [
13] and Faster R-CNN [
5], a dedicated algorithm first extracts a set of Regions-of-Interest (ROIs), that is, sub-portions of the image that are likely to contain an object; then fine-detection and classification modules analyze each ROI. Single-shot detectors such as YOLOv3 [
6], SSD [
7] and DSSD [
14], instead, avoid looping over several ROIs, and tackle the input image in a single shot.
These methods apply a library of predefined bounding boxes (anchor boxes), which have various shapes and sizes and cover the likely locations of objects in the image. The inference phase takes care of fine tuning each anchor box in terms of size and position. Region-based detectors usually prove more accurate that single-shot detectors, but are computationally demanding, as they require a loop for each single ROI [
15].
In the case of small objects at low resolutions, both region-based detectors and single shot detectors tend to exhibit poor performances. Several techniques have been proposed recently to overcome that issue [
16]:
Multi-scale representation: high- and low- resolution feature maps stem from different levels of a feature-extraction network; after super-sampling low-resolution maps, features fuse together by applying either element-wise sum (Multi-scale deconvolutional single shot detector (MDSSD) [
17]) or concatenation (Diverse region-based CNN (DR-CNN), [
18]).
Contextual information: the network takes into account explicitly the contextual information around a candidate object. For example, ContextNet [
19] applies a custom region-proposal network specifically aimed to small objects, and for each candidate region an enlarged region is used to process contextual information.
Super resolution: generative adversarial networks generate a higher-resolution version of the candidate object, thus improving accuracy in the detection of small objects (Perceptual generative adversarial networks (PGAN)) [
20]).
Mixed methods: features with distinct scales are extracted from different layers of a convolutional neural network; they are concatenated together, and then used to generate a series of pyramid features [
21].
These methods all exhibit an increase in both computational and memory load. This brings about lower update frequency, higher latency, and ultimately might compromise implementations on resource-constrained devices for embedded applications.
2.2. Background Subtraction and Frame-Difference Solutions
In complex applications such as aerial surveillance, camera views can cover wide areas. Target objects (e.g., pedestrians and cars) usually span just a few tens of pixels, and the detection techniques discussed above [
22] are ineffective. At the same time, in those applications the majority of input images are quasi-static and only target objects move in the scene, hence conventional background-subtraction approaches are widely adopted, even in the era of deep learning. The basic idea consists in working out the difference between a frame and the background model of the scene acquired by the same camera; the time-difference information highlights the changes caused by moving objects.
Methods differ in terms of computational cost, robustness and accuracy—Mixture of Gaussians (MOG) [
23] approaches model each pixel as a random variable with a gaussian mixture model; mean-filtering [
24] techniques extract the background by averaging the values of each pixel over the last N frames, whereas methods for frame-difference background subtraction [
24] only consider the pixel differences between the current frame and the previous one. The latter approach is very fast but possibly less robust to noise; moreover, by disregarding any sequence of past frames, frame differences only apply when the camera is slowly moving.
Since these methods typically process gray-scale (or even B/W after threshold) images that highlight changes at a given time, the actual detection of moving objects requires some post-processing. This might possibly include morphological transformations, blob detection [
25], or more complex computations [
26,
27,
28], to the detriment of detection speed.
2.3. Spatio-Temporal Convolutional Neural Networks (CNNs)
The literature witnesses the growth of spatio-temporal CNNs, which take into account both visual and motion data. In MODNet [
29], the authors proposed a two-stream neural network that processed input RGB images and optical flows, thus learning object detection and motion segmentation at the same time. The research presented in [
30] adopted an end-to-end approach for video classification. A pseudo-3D neural network learned spatio-temporal information by considering multiple consecutive frames, which were processed by a series of convolutional filters in both the spatial (1× 3 × 3) and the temporal (3 × 1 × 1) domains. The 3D neural networks virtually replaced explicit image pre-processing steps such as background subtraction or optical-flow computation.
A spatio-temporal CNN supported the detection of vehicles in Wide Area Motion Imagery (WAMI) [
31]. In the 2-stage approach, a CNN first handled 5 consecutive images (taken by an aerial surveillance system) and highlighted promising regions. The second stage completed fine detection within each region. The TrackNet approach [
3] applied spatio-temporal CNNs to track small fast-moving objects in sport applications; a fully convolutional neural network could accurately track a tennis ball by processing 3 consecutive video frames (taken by a steady camera). The CNN prompted a heatmap of the possible positions of the ball, subsequent blob detection eventually yielded the predicted location.
Spatio-temporal CNNs for object detection can prove effective, but also exhibit some drawbacks: they are often computationally heavy; the various approaches are normally tailored to specific applications, and application-independent detection of small objects has not been proved yet.
2.4. Summary of Contribution
The methods discussed above all exhibit some features that make them unsuitable to support the Real-Time detection of small moving objects on resource-constrained devices; specific shortcomings possibly include the inability to recognize tiny objects, impractical computational loads, or lack of general applicability. The approach described in this paper can perform detection of small moving objects by maintaining some crucial features—it is lightweight and suitable for embedded devices, accuracy keeps comparable to SoA approaches and improves over them in particularly challenging conditions, the system is end-to-end trainable, and finally the method is application independent, as it performs satisfactorily in different scenarios.
3. Methodology
T-RexNet combines several of the techniques mentioned above to detect small moving objects in a fast, lightweight manner. The system benefits from the versatility of an end-to-end fully convolutional neural network, it processes differences between frames to involve motion information, and relies on the efficiency of MobileNet-based convolutions to integrate visual and motion data. Single-shot detectors attain real-time performances. Thus T-RexNet can be regarded as a spatio-temporal, single-shot, fully convolutional deep neural network, as per
Section 2. With only 2.38 M parameters, T-RexNet turns out to be one of the most lightweight networks in the object detection field.
Figure 3 outlines the three-step structure of T-RexNet. Three time-consecutive gray-scale images
,
,
make up the system input, where
denotes the 2D matrixes of pixel intensities at different time steps. The algorithm first works out a pair of motion-augmented pictures,
M and
K, which undergo a feature-extraction process based on two separate parallel convolutional paths. The actual object-detection results stem from the third SSD-based step.
3.1. Step 1: Extracting Motion-Augmented Images
This module received in input three gray-scale input frames, , , . Since gray-scale images are represented as matrices of size [ × × 1], stacking three of them we obtain a [ × × 3] matrix, which is equivalent to the size of a single colored image. In other words, compared to traditional object detection methods, we substituted color with temporal data. The input of the network is processed in order to generate the pair {M, K} of motion-augmented images, as explained in the following.
The image
M includes three channels that are worked out as:
where the superscripts
refer to the channel number and the
is the absolute-value operator.
Figure 4 illustrates the overall process in a graphic form. Channel
preserves visual features, while channels
and
bring in motion information via frame differencing, which proves much faster than conventional background-subtraction techniques. It must be noted that preserving single-frame visual features in one of the three channels of the image makes the network able to detect, in principle, also non-moving object.
The image K is the concatenation of the first and the last channels of M, hence it only holds motion data without any visual feature.
3.2. Step 2: Feature Extraction
Feature-extraction networks typically include stacks of convolutional layers and pooling layers, in which lower layers involve the details of the image, whereas the topmost layers extract object-related information [
32]. From a spatial point of view, the deeper is the feature map in the network, the larger is the receptive field of each of its “pixels”.
T-RexNet aims to detect small objects, hence high level information can be disregarded, and the number of stacked layers in the feature extraction network reduces accordingly. This feature also entails a beneficial effect on latency. In principle, high-level features might provide context information and therefore help localise small objects; at the same time, reducing contextual information makes the feature extractor more independent of any specific scenario and therefore maximally flexible. Feature extraction in T-RexNet involves two convolutional paths that process visual-motion mixed data (image M), and only motion-related data (image K), respectively.
The rightmost path in
Figure 3 processes image
M and relies on a custom network drawn from the MobileNet [
33] model. This is a family of Neural-Networks (NN) architectures specifically designed for low-latency execution on mobile devices, and yields a promising balance between computational cost and accuracy. T-RexNet inherits from MobileNet the use of bottleneck residual block as a main building block, as shown in
Figure 5, to limit the sensitivity to high-level, context-dependent information.
The leftmost path in
Figure 3 takes into account the motion-related data held in image
K. The architecture features a stack of several 2D convolutions, as per
Figure 5. The stride is set to 2, hence the input image is downsampled to match the output resolution of the parallel convolutional path.
Finally, the outputs of the two paths are concatenated channel-wise.
3.3. Step 3: Object Detection
The object detection block relies on SSD [
7], which mitigates computational costs as compared with region-based approaches and better fits real-time applications. Since, in the inference phase, the method prompts predictions for the whole list of predefined anchors, execution time turns out to be image independent.
Detection in the basic SSD involves several feature maps that are extracted at different levels of the feature-extraction network (the
base network in [
7]). This technique improves the robustness to different object scales. Since T-RexNet is targeted at detecting small objects, the output of the first stage just involves one feature map to contain computational costs.
T-RexNet associates each element of the feature map (i.e., each position in the map grid) with the dimension/position information and the classification (car/pedestrian/background etc.) of the corresponding anchors. The anchor size is set to , where I is a squared input image and is a function which returns the height and width of the image. The anchor’s aspect ratios depend on the shapes of the target objects. The standard values are {0.5, 1, 2}, which correspond, respectively, to horizontal shape, squared shape and vertical shape.
5. Results
This section illustrates the results of the experiments performed in the three test scenarios. According to our previous findings [
38] we observed that moving the same architecture from the Desktop to the embedded platform has negligible impact on the detection accuracy, while it mostly affects speed and memory footprint. Therefore, in
Section 5.1,
Section 5.2 and
Section 5.3 we first illustrate, for each scenario, our achievements using the Desktop platform and, then, in
Section 5.4, we analyze the impact of deploying T-RexNet on the Jetson Nano.
Table 3,
Table 4 and
Table 5 give an overview of the comparisons, in terms of F1 scores, with existing methods in the literature.
5.1. Aerial Surveillance
Figure 7 shows the ROC curves achieved by T-RexNet and other state-of-the-art algorithms on the AOI 2 test set. The ROC curve for T-RexNet was added to the original plot reported in [
31]. To assess the balance between recall and precision the experiments applied various threshold values on the detection confidence. We remind the reader that Recall =
; Precision =
; where
,
,
are True/False Positives/Negatives.
T-RexNet outperformed the other comparisons in terms of accuracy, with the exception of ClusterNet [
31], which scored near-optimal performances. As reported in [
31], however, ClusterNet required 2–3 s per image on a Titan X GPU board, depending on the number of selected regions; a time span of 3 s covered the inference phase to inspect the whole image for fine detection. By contrast, the inference time for T-RexNet was 310 ms per image on our Desktop platform featuring an NVIDIA GTX 1080 Ti board, which is similar to a Titan X in terms of hardware resources and computational performances.
5.2. Civilian Surveillance
Figure 8 gives the ROC curve scored by T-RexNet in object detection within one image. The obtained results are compared with the corresponding curves attained by SSD [
7] (with MobileNetv2 [
33] as backbone network) and Faster R-CNN [
5] (with ResNet50 [
39] as backbone network). The Figure gives two curves for each comparison: the Full mark refers to experiments on whole images, whereas Small curves refer to tests only performed on the upper halves of images, where perspective made people appear smaller.
The ROC curves in
Figure 8 witness that motion information greatly helped T-RexNet achieve the best performance. More, T-RexNet was the only architecture that attained satisfactory results when focusing on tiny objects.
5.3. Tennis Ball Tracking
Figure 9 shows the ROC curves measured by applying T-RexNet on the test sets Court A, Court B, and Court C. The graph also give the associate ROC curves obtained by MobileNetv2-SSD, which represented the single-image architecture from which T-RexNet evolved. The comparison pointed out the significant impact of involving motion data in the detection of the target object.
Experimental outcomes prove that T-RexNet featured a remarkable improvement over State-of-the-Art, application-independent approaches. When considering application-specific solutions, TrackNet [
3] had generated our ground-truth labels and proved more accurate than T-RexNet in tennis-ball tracking. As reported in the original paper, TrackNet attained on average higher F1 scores than 0.84, which was consistent with the test performed in this research. At the same time, TrackNet proved significantly heavier than T-RexNet: Python implementations of both, running on the Desktop platform, resulted in 2. 2 FPS for TrackNet and 47 fps for T-RexNet, that is ∼21 times faster. The limited resolution of input images allowed to increase the batch size in the inference phase up to 10 consecutive frames, while still fitting the memory of the test GPU. This batch approach allowed T-RexNet to run at 96 fps, at the price of an increased latency from 21 ms to 104 ms.
5.4. Deployment of T-RexNet on the Jetson Nano
This section presents the results of the deployment on Jetson Nano.
Table 6 shows on the rows the power setting of the board. Columns are divided into couples. The first pair reports the result for input size 512 × 512, the second refers to 300 × 300. The first column of each pair refers to an optimized model with FP16 representation. The second column indicates the original TF model.
The results reveal that T-RexNet can be deployed in embedded systems with real-time performances. In Max-N configuration, the network can process a frame in 70.28 ms. In other words, the device could elaborate 13 FPS, which is acceptable for many applications. The comparison with native TensorFlow solutions highlights the importance of optimization combined with FP16. A similar observation holds for 5W power mode.
Memory requirements for this network are quite limited. The
pb file, that is the TensorFlow’s ProtoBuf file containing the description of the network, measures around 3.0 MB. The memory strategy implemented on Jetson Nano allocates a large amount of memory that is not directly dependent on the model size. Accordingly, a direct measure would yield biased results. Indeed, literature proves that similar models can be deployed in devices using a smaller memory footprint [
38].