For academic completeness, application of convolutional neural networks, as the representative of state-of-the-art deep learning based techniques for object detection, and the specific approach used for compression-based model decomposition, in particular, are introduced here.
2.1. Convolutional Neural Networks
Convolutional neural networks have been proposed to implement object detection tasks. According to previous studies [
18], CNNs are widely used for providing different solutions for a variety of scenarios of image processing and object detection problems and have shown outstanding performance [
19]. The development of CNNs has been exploited rapidly as research studies globally frequently endeavour to implement and optimize various algorithms.
The depth of the convolutional neural network is critical to the performance of a CNN-based model. In 2012, Krizhevsky et al. [
20] applied the concept of deep convolutional neural networks (DCNNs), which generally perform better than traditional hand-crafted architecture, on ImageNet for the first time [
21]. They proposed a new architecture named AlexNet, consisting of eight neural network layers, five convolutional layers and three fully connected layers [
22]. Fundamentally, this offered a seminal approach to designing a convolutional layer with an activation function, involving max pooling and multiplying it to realize a deep network.
Most successful applications of DCNNs have the capability to progressively learn more complex features. Notably, when the number of network layers is increased, the network can obtain better results theoretically with the extraction of more sophisticated feature patterns. However, experiments have demonstrated that, as deep convolutional networks become deeper, a degradation problem often occurs during the training period. That is, when the depth of the network increases, the accuracy of the network saturates or even decreases. Such studies indicate that DCNNs commonly suffer from the problem of gradient vanishing or gradient exploding, which makes it difficult to train models with too many layers. Furthermore, He et al. [
23] conducted an empirical experiment to demonstrate that a maximum threshold exists for the depth of CNN models, plotting the training and test errors of a 20-layer CNN in contrast with those of a 56-layer CNN. The outcomes of this investigation contradict the previous theory that only overfitting would account for the failure of training. This implies that adding extra unnecessary layers may also cause higher training errors and test errors in the network.
2.2. Object Detection Models
A modern CNN-based object detector is usually composed of three consecutive parts: a backbone, a neck and a head. The backbone, which is used for image feature extraction, may often be implemented via VGG [
24], ResNet [
25], or DenseNet [
26]. The neck is used to exploit the features extracted from different stages by the backbone, normally consisting of several bottom-up paths and several top-down paths. Typical neck modules include a feature pyramid network (FPN) [
27], a path aggregation network (PAN) [
28], a BiFPN [
29], and a NAS-FPN [
30]. The head, which is used to predict the classes and bounding boxes of objects, is typically categorized into two types, namely, a one-stage detector and a two-stage detector.
Deep learning methods have exhibited encouraging characterization and modelling ability and can learn hierarchical feature representation automatically, with highly promising performance. Empowered by the outstanding feature learning and classification ability of DCNNs, detectors based on the fast region convolutional neural networks have been frequently applied to serve as a detection framework. Nevertheless, certain single-stage detectors are also popular as they are much faster to execute and simpler to implement when compared with two-stage methods despite their relative lower accuracy. In this paper, both approaches are adopted to evaluate and contrast the performance of different algorithms.
The most representative two-stage detectors are those belonging to the R-CNN series (including fast R-CNN [
31], faster R-CNN [
32], R-FCN [
33], and Libra R-CNN [
34], and the most representative one-stage models are YOLO [
35], SSD [
36], and RetinaNet [
17]. Most of the two-stage methods are based on the example of Faster R-CNN [
32]. The region-based convolutional neural network R-CNN is the initial architecture that inspired the development of Faster R-CNN, exploiting information regarding the regions of interests and passing it to a convolutional neural network [
31]. R-CNN tends to explore the areas that may involve an object, identifying and localizing objects by combining region proposals with CNNs. R-CNN has been used as a reference model for object detection in recent years; however, it has the constraint of inputting fixed-sized images and the algorithm speed is limited. To reduce, if not eliminate, these limitations, He et al. [
25] proposed a spatial pyramid pooling network (SPP-net), which enables the network to generate fixed-sized outputs from arbitrarily sized images. Nevertheless, there are also notable drawbacks within SPP-net since the training process remains overweighted due to it being a multi-stage pipeline [
37].
Fast R-CNN evolved thanks to the progress of R-CNN and SSP-Net [
31]. Instead of repeatedly processing potentially interesting image regions hundreds of times, this method passes the original image to a pretrained network just once for end-to-end training. The procedure of selective search is retained on the basis of the output feature map of the previous step [
38]. It adopts a region-of-interest (RoI) pooling layer and multi-task loss to estimate observed object classes by a softmax classifier and to predict the bounding box localization by linear regression, respectively [
37]. Fast R-CNN has been further developed, resulting in a Faster R-CNN that employs a new network named the region proposal network (RPN), which shares full-image convolutional images with it, thereby reducing detection processing time. This is very helpful since detection processes are, in general, extremely time-consuming, especially for generating detection frames (e.g., OpenCV AdaBoost deploys a sliding window and an image pyramid to produce the required frames). Faster R-CNN abandons the traditional approach and selective search; instead it directly deploys RPN to generate detection frames. This represents a major advancement as Faster R-CNN significantly increases the speed of object detection.
Recently, a number of advanced algorithms have been developed which enhance Faster R-CNN by introducing different architectures with different features [
27]. For instance, the feature pyramid network (FPN) resolves scale variance through pyramidal predictions [
27], and Cascade R-CNN extends Faster R-CNN by adding another procedure to produce a multi-stage detector [
39]. Also, Mask R-CNN reshapes the bounding box with a mask branch by instance segmentation, becoming a classic milestone for another branch [
40]. Libra R-CNN explicitly alleviates the imbalance at the objective, feature and sample levels using an overall balanced design, which integrates three novel components [
34]. Double-Head R-CNN includes a two-head structure, dividing the classification task and bounding box regression into a fully connected head and a convolution head [
41]. These methods have made significant progress through consideration of different challenges and scenarios.
Generally speaking, the approach taken by a so-called two-stage detector is to first select objects by selective search [
38], which is referred to as region proposal, and then to perform object recognition on the selected objects to generate target regions. However, as the selected size of the objects may be different, object recognition may only involve classification, or it may include feature extraction plus classification during the training period. After that, the network passes the region proposals through the pipeline to implement object classification and bounding box regression. Whilst such a two-stage detector can normally obtain the highest accuracy, these methods are often slower than the one-stage method.
In contrast to two-stage methods, single-stage object detectors have become popular due to the introduction of YOLO (you only look once) and SSD (single shot multiBox detector), which regard object detection as a simple regression task [
36,
42]. Such detectors take an input image and learn both the class probabilities and the bounding box coordinates [
36]. Redmon [
42] designed YOLO such that only one forward propagation was required to make predictions, providing the output of recognized objects with bounding boxes. It can achieve high accuracy while also being able to operate in real-time. However, YOLO performs poorly when trying to detect objects in edge areas. In further developing such techniques, fully convolutional one-stage object detection (FCOS) has been proposed as a classic anchor-free method which uses a simple and flexible framework [
43]. It outperforms the accuracy of both one-stage and two-stage methods, completely avoiding complicated computation and hyper-parameter adjustment related to anchor boxes.
2.3. Low-Rank Decomposition Based Model Compression
Model compression and acceleration refer to the distillation of redundant parameters in a neural network in order to obtain a small-scale model with fewer parameters and a more compact structure under a certain degree of algorithm completion. Low-rank utilized to accelerate convolution has a long history (e.g., separable 1D filters were introduced using a dictionary learning approach [
44]). Regarding deep neural network (DNN) models, efforts have also been made for low-rank approximation, as reported in [
45]. In such work, the speed of a single convolutional layer is increased by a factor of two, but the classification accuracy is decreased by 1%. In [
46], a different tensor decomposition scheme was proposed, achieving a 4.5-fold speedup with the same rate of accuracy loss.
There exist a number of low-rank methods for compressing 3D convolutional layers. For example, canonical polyadic (CP) mechanisms for kernel decomposition adopt nonlinear least squares to implement expected decomposition [
47]. Also, batch normalization (BN) is employed to transform the activation of internal hidden units [
48], aiming at training low-rank constrained CNNs from scratch. Moreover, many approaches have been proposed to exploit low-rankness in fully connected layers, including the use of such methods to reduce the volume of dynamic parameters [
49]. A specific development is for acoustic modeling, where low-rank matrix factorization of the final weight layer is introduced [
50]. To obtain compact deep learning models for multi-tasks, truncated singular value decomposition (SVD) has been adapted to decompose fully connected layers [
51]. Of direct interest to the present work is the attempt to adopt tensor train (TT) [
52] decomposition to compress the convolutional layers and fully connected layers in a network. This facilitates significant compression rates with only a slight drop in accuracy.
2.4. Evaluation Metrics
Object detection models are expected to be fast with little memory and high prediction accuracy. To evaluate whether the present research implements these objectives, the following commonly used performance metrics are utilized in the subsequent experimental investigations:
Precision and Recall: These are perhaps the most common performance indices used to assess the quality of the classification task, which are calculated as follows:
Average Precision (AP): This criterion integrates precision and recall, representing the area under the precision–recall curve. In the following definition of mAP, the “m” denotes the average across different classes, meaning that mAP is the average AP value for various classes. Its value lies within the range of [0, 1], with values closer to 1 indicating better model performance.
Mean Average Precision (mAP): Object detection involves both localization and classification tasks. For localization, the intersection over union (IoU) metric is commonly used to measure accuracy. It quantifies the correlation between predicted bounding boxes and ground truth bounding boxes, with higher IoU values indicating stronger correlation and higher prediction accuracy. The IoU formula is given by:
Model Computational Complexity (FLOPs): The computational complexity of a model can be represented by FLOPs (floating-point operations), which is calculated as follows:
This formula calculates the total sum of the multiplication and addition operations. Similarly, the computational complexity of each fully connected layer is estimated using the formula below.
Model Parameter Quantity (Params): This index is influenced by the number of model parameters. The parameter quantity within each convolutional layer is calculated using the formula:
where
inside the parentheses accounts for a bias parameter, with
o representing the output,
i the input, and
C the number of channels. If batch normalization is used, this term is not necessary. Similarly, the parameter quantity of each fully connected layer is calculated as: