Next Article in Journal
Lightweight Hot-Spot Fault Detection Model of Photovoltaic Panels in UAV Remote-Sensing Image
Previous Article in Journal
Vulnerable Road Users and Connected Autonomous Vehicles Interaction: A Survey
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DPSSD: Dual-Path Single-Shot Detector

1
School of Mechanical Engineering, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250300, China
2
School of Information and Automation Engineering, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250300, China
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(12), 4616; https://doi.org/10.3390/s22124616
Submission received: 9 May 2022 / Revised: 13 June 2022 / Accepted: 13 June 2022 / Published: 18 June 2022
(This article belongs to the Topic Artificial Intelligence in Sensors)

Abstract

:
Object detection is one of the most important and challenging branches of computer vision. It has been widely used in people’s lives, such as for surveillance security and autonomous driving. We propose a novel dual-path multi-scale object detection paradigm in order to extract more abundant feature information for the object detection task and optimize the multi-scale object detection problem, and based on this, we design a single-stage general object detection algorithm called Dual-Path Single-Shot Detector (DPSSD). The dual path ensures that shallow features, i.e., residual path and concatenation path, can be more easily utilized to improve detection accuracy. Our improved dual-path network is more adaptable to multi-scale object detection tasks, and we combine it with the feature fusion module to generate a multi-scale feature learning paradigm called the “Dual-Path Feature Pyramid”. We trained the models on PASCAL VOC datasets and COCO datasets with 320 pixels and 512 pixels input, respectively, and performed inference experiments to validate the structures in the neural network. The experimental results show that our algorithm has an advantage over anchor-based single-stage object detection algorithms and achieves an advanced level in average accuracy. Researchers can replicate the reported results of this paper.

1. Introduction

After the success of deep convolution neural networks (DCNN) [1] in the field of image classification, the object detection algorithm also introduces deep-learning technology and has achieved significant progress [2,3]. These new algorithms based on deep learning are much better than the traditional algorithm because the feature of the manual design is replaced with the feature representation computed via convolution neural networks. However, multi-scale feature learning is a critical problem of the detection algorithms based on deep learning. To optimize this problem and improve the detection effect of the single-stage multi-scale detector based on the anchor box, we conducted a relevant literature search and experiments.
In general, the objects are placed in a complex environment and have a large variance in scale; for example, in applications such as pedestrian detection, face detection and autonomous driving, the algorithm has to be robust to changes in the scale of the object [4]. It is critical to train a robust and discriminate feature to obtain good detection performance. There are four main paradigms to address the multi-scale feature learning problem: the image pyramid, the prediction pyramid, integrated features and the feature pyramid (Figure 1). SNIP [5] uses the image pyramid to solve the multi-scale problem, where each layer is responsible for a certain range of scales (Figure 1a). In this way, the same sample needs to be converted into different scales and repeatedly input to the network for training. This results in many redundant calculations. By fusing the shallow features rich in space and the deep features rich in semantics, the newly constructed features contain rich information and, thus, can detect the objects of different scales. Single-Shot MultiBox Detector (SSD) [6] and Multi-scale Deep Convolutional Neural Network (MSCNN) [7] use both shallow features rich in geometric information and deep features rich in semantic information to predict objects at different scales, which we call multi-scale prediction, using a prediction pyramid, where each layer is responsible for a certain scale of objects, as shown in Figure 1b. The Inside-Outside Network (ION) [8] and HyperNet [9] use integrated features to combine multiple layers of features to build a single feature map, and they make a final prediction based on it (Figure 1c). The Feature Pyramid Network (FPN) [10] uses the feature pyramid to integrate different scale features with lateral connections in a top-down fashion to build a set of scale-invariant feature maps to train multiple scale-dependent classifiers. This method also combines the deep semantic-rich features and the shallow spatially rich features (Figure 1d). FPN has significantly improved the performance of object detection algorithms and has achieved advanced state-of-the-art results in learning multi-scale features. However, these paradigms can only use information from a single layer of feature maps at different scales.
Feature fusion is the merging of different feature maps. In order to fully capture information at all levels of the feature map at different scales [4], we propose a paradigm to address multi-scale feature learning problems by using the dual-path feature pyramid (Figure 1e), which uses the structure of prediction pyramid and two methods of feature fusion, i.e., residual connection [1] and concatenation connection [11]. Figure 2 shows the overall structure of our detector. “Element-wise sum” denotes the matrix addition and is abbreviated as “Elts”. The overall framework consists of a base network, a feature fusion module and a prediction module. We used the idea of dual-path networks [12] to design our base network for the single-stage detector to obtain robust and discriminate features. Our dual-path network can generate six different resolutions of feature maps for multi-scale object detection. After experimental validation, we used a 3-by-3 convolution and deconvolution operation to fuse two feature maps adjacent to the resolution, and the fusion module contains five sub-networks to obtain five different scales of feature maps for prediction. The advantages of the single-stage object detection algorithm include a simple model training strategy and network structure and fast computing speed [4]. The whole detector retains the advantages of the single-stage algorithm and enables end-to-end training. Our object detector has the following innovations:
(a)
We first introduced a dual-path network in the single-stage object detector by proposing a paradigm called the “Dual-Path Feature Pyramid”, as shown in Figure 1e. It combines two feature fusion methods, i.e., residual connection and concatenation connection.
(b)
After experimental validation, a new feature fusion module was proposed to enhance the fusion of high-level semantic and low-level spatial features to further optimize the multi-scale feature learning problem.

2. Related Work

Currently, object detection can be divided into single stage and two stage. Two stage means that the object detection process is divided into a region proposal stage and a detection stage, while single stage means that both of these stages are carried out simultaneously. Object detection has many applications, such as face detection, pedestrian detection and automatic driving. However, multi-scale detection is the key to realize these applications [4].
The representative two-stage detectors include R-CNN [2], Fast-RCNN [13], Faster-RCNN [3] and R-FCN [14]. These algorithms first generate pooling and bounding box regression. The detection accuracy of two-stage detectors is good, but the frameworks limit the detection speed. Some researchers have been devoted to single-stage detectors, such as OverFeat [15], SSD [6] and YOLO [16]. The advantage of these detectors is that there is no need to generate region proposals, and each position on the input image may be the target object, using the end-to-end training method, so the detection speed is very fast. However, these methods have similar architectures for solving multi-scale detection problems. SNIP [5] and R-CNN [2] adopt the structure of Figure 1a in solving multi-scale problems. Fast-RCNN [13], Faster-RCNN [3], OverFeat [15], SSD [6], R-SSD [17] and R-FCN [14] use the structure of Figure 1b. The Inside-Outside Network (ION) [8], HyperNet [9] and STDN [18] use integrated features to combine multiple layers of features to build a single feature map, and they make a final prediction based on it (Figure 1c). DSSD [19] and FPN [10] use the feature pyramid paradigm to develop multi-scale detectors. Recent research advances, including M2Det [20], BPN [21] and ASFF [22], have proposed efficient feature fusion networks under the feature pyramid paradigm (Figure 1d).
However, current feature pyramid paradigms do not take full advantage of feature information at different scales when constructing feature pyramids, which limits the detection of multi-scale detectors [4]. To solve this problem, we propose a new feature pyramid structure, as shown in Figure 1e, which is mainly derived from the ideas of DSSD [19] and R-SSD [17]. Fu et al., proposed a Deconvolutional Single-Shot Detector (DSSD) [19], which adds a residual block to each feature map, then performs the element-wise product on the different scale feature maps for feature fusion. The advantage of DSSD is that it introduces a residual operation to scale the feature map. Jeong et al., proposed R-SSD [17], which uses rainbow concatenation through both pooling and deconvolution to improve the accuracy of the conventional SSD [6]. The advantage of R-SSD is that it uses a concatenation method to enrich the features of each level.
In sum, there are generally two main methods to further improve the detection accuracy of multi-scale objects. One is to use a residual block to build various feature pyramid structures, as in DSSD [19]. Another is the concatenation that uses the concatenation of multi-layer features to detect objects, as in R-SSD [17]. We combine these two approaches and propose the dual-path feature pyramid to optimize the multi-scale object detection problem, as shown in Figure 1e.
We used our improved dual-path network to extract the features of different resolutions, used the feature fusion module to fuse the different levels of features, and enhanced them through convolution and deconvolution operations. Our detector combines the advantages of two methods and, finally, obtains six robust and discriminate features to make a prediction.

3. Dual-Path Single-Shot Detector

In this section, the entire structure of the neural network is described, and the internal structure of each module is further detailed. In addition, the whole process of model training will be introduced in detail, including the construction of the programming environment, the setting of the training hyper-parameters and the loss function.

3.1. Convolution Neural Network

Our proposed network consists of three parts: a base network for feature extraction (Conv3, Conv5, Conv6, Conv7, Conv8 and Conv9); a feature fusion module for adjacent feature fusion; and a prediction module. We first use our dual-path network to extract features at six resolutions. The output of the last layer of the base network, Conv9, has the lowest resolution and is fed directly into the prediction module, which implements a linear combination of multiple channel features for classification and prediction by using one-by-one convolution operations and residual connections. Then, we input the Conv8 and Conv9 to the feature fusion module, and then input the obtained results and the features of the Conv7 to the feature fusion module, repeat the above process to obtain five fused feature maps with different resolutions, and finally pass them to the prediction module for the final object classification and localization.

3.1.1. Dual-Path Network

The improved dual-path network combines the core ideas of ResNet [1], ResNeXt [23] and DenseNet [11]. The first stage contains a 7-by-7 convolution layer and a maximum pooling layer, and the remaining eight stages have similar structures. The first layer of each stage can be divided according to the channel dimension of the residual connection and the concatenation connection in the features and can choose whether to perform the down sampling operation or not. The later layers of each stage can increase the number of feature channels, deepen the number of network layers and improve the learning of the network capability. To retain as much sub-layer information as possible and ensure that we obtain the features at different scales [12], we skip Conv1, Conv2 and Conv4 and select the outputs of the Conv3, Conv5, Conv6, Conv7, Conv8 and Conv9 as the original feature maps, whose structure is shown in Table 1.
In the remaining layers of each stage, the two groups of features output from the first layer are channel merged first and then input to the next layer; the output of the next layer can again be divided into two groups by one-by-one or channel separation, and the two output features of the first layer corresponding to the channels are summed and channel merged, respectively, as shown in Figure 3. Figure 3 shows the specific implementation of each layer in Table 1, which is the structure of the dual-pathway base network and corresponds to the top row in Figure 2. Feature segmentation refers to the method of cutting existing features in the dimension of the channel to obtain multiple sets of features. We studied the ablation of these two feature segmentation paradigms.

3.1.2. Feature Fusion Module

The base network outputs feature maps of different resolutions that are responsible for predicting objects of different scales, and in the assignment principle, we continue the anchor frame matching principle in the SSD [6]. The DSSD [19] and RetinaNet [24] have demonstrated that, for single-stage object detection algorithms, the fusion of features at different levels can improve the detection effect. Therefore, we combine our own base network and experimental tests to design an efficient feature fusion module, which accepts two input feature maps to generate a fused feature. A 3-by-3 convolution operation is performed for the features with larger resolution in the input features, and conversely, a 3-by-3 deconvolution operation is performed for the smaller features; finally, the two results are summed to obtain the fused feature map, as shown in Figure 4. The effect of deconvolution is similar to bi-linear interpolation, which can improve the resolution of the feature map. The following describes the specific implementation process of the deconvolution kernel, as shown in Figure 5.
The deconvolution operation can be considered the inverse operation of convolution in terms of resolution. As is well known, there is a mathematical relationship between the resolution of the input and output in the convolution layer, and the mathematical expression is as follows:
O n = [ I n + 2 P n K n S n ] + 1 ( n = h , w )
where O n is the layer of output size, I n is the layer of input size, P n is the layer of padding size, K n is the operation kernel size, S n is the convolution stride size and n represents the two optional dimensions of height and width. The output channel depends on the number of convolution kernels.
As the inverse of convolution, the mathematical formula for the deconvolution is expressed as the following:
O n = ( I n 1 ) × S n + K n 2 P n + m ( n = h , w )
where m is the layer of output padding, and it ranges from 0 to S n 1 . Due to the rounding operation, one input will correspond to m outputs when S n is greater than 1. The deconvolution seems to be the inverse process of convolution; however, there is no reversible relationship between the two in terms of numerical computation, except for feature resolution. The deconvolution layer is just an ordinary convolution layer, which is also needed in order to learn by gradient descent in a neural network. Therefore, for each deconvolution layer, we can actually use another convolution layer to perform the recovery as well.
The feature fusion module can fuse features of different channels at different levels. We set up ablation experiments with the number of fusion channels, fusion method and fusion layers, as shown in Figure 4.

3.1.3. Prediction Module

The prediction module has two sub-networks, one for classification prediction and the other for localization prediction, operating independently on each feature map. To ensure the reuse of the prediction results for the lower-level features, we designed a jump connection to connect the first layer and the sum of the final output layer features, as shown in Figure 6, and the experimental results are shown in Table 2.

3.2. Training Model

Our detector is developed using the PyTorch [25] framework. It is trained on NVIDIA TITAN Xp GPU. Our training strategy is almost the same as SSD [6], including a data augmentation trick following SSD [6], e.g., random flip, random scale, random crop, random brightness and random rotation, and an SGD (Stochastic Gradient Descent) solver. We perform pre-training on the Imagenet+5k, which means that the network has been pre-trained on Imagenet5k before being fine-tuned on Imagenet1k, and then further trained on the PASCAL VOC datasets and COCO datasets using the strategy of batch size of 14, learning rate of 0.001 and 120,000 iterations, with a 10-fold learning rate reduction at the 80,000th and 100,000th batches of the training process, and obtain two models with input image resolutions of 320 and 521.
In training the model with an input image of 320 pixels, the six feature map anchor boxes have step parameters of 8, 16, 32, 64, 107 and 320; minimum size parameters of 21, 45, 99, 153, 107 and 320; maximum size parameters of 45, 99, 153, 207, 261 and 315; and feature map resolution parameters of 1, 3, 5, 10, 20 and 40. For experiments on the PASCAL VOC datasets, anchor box aspect ratios of 1.6, 2.0 and 3.0 were used to generate eight anchor boxes per anchor point, and for experiments on the COCO datasets, anchor box aspect ratios of 2:3 were used to generate six anchor boxes per anchor point for comparison with the relevant models and following DSSD [19].
The training loss function consists of the combination of the localization loss Smooth L1 and the classification loss Softmax. The offset encoding of the ground truth of object localization is required before training, which can effectively reduce the learning difficulty, and the mathematical formula is as follows:
L ( X , c , l , a , g ) = 1 N ( L c o n f ( X , c ) + α L l o c ( X , l , a , g ) )
L c o n f ( X , c ) = i P o s N X i j p log ( c i p ^ ) i N e g log ( c i 0 ^ ) .  
where   c i p ^ = exp ( c i p ) p exp ( c i p )   ( 5 )
L l o c ( l , a , g ) = i P o s N m { c x , c y , w , h } s m o o t h L 1 ( l i m g j m ^ ) .  
g j c x ^ = ( g j c x a i c x ) / a i w .  
g j c y ^ = ( g j c y a i c y ) / a i h .  
g j w ^ = log ( g j w / a i w )
g j h ^ = log ( g j h / a i h )
smooth L 1 ( x ) = { 0.5 x 2 if   | x | < 1 | x | 0.5 otherwise .  
where X is the prediction vector for classification,   α . is to balance the importance of the two losses, c is the classification label, l is the prediction vector for localization, a is the coordinate of the anchor box and g indicates the offset of the ground truth with respect to the anchor box. L c o n f is the classification loss, L l o c is the localization loss, i is the index of anchor boxes, j is the index of the ground truth in an image and p is the index of each category in the classification vector. In Equations (7)–(10), x and y are the coordinates of the center point of the bounding box; w and h are the width and height of the bounding box.
Equations (7)–(11) are brought into Equation (6) to obtain the complete expression of the localization loss, Equation (5) is brought into Equation (4) to obtain the classification loss, and finally, Equations (4) and (6) are substituted into Equation (3) to obtain the total loss expression.

4. Experiments

4.1. Experiment Consideration

Our detector was evaluated on the PASCAL VOC [26] and COCO [27] datasets, the former with 20 object classes and the latter with 80 object classes. For the PASCAL VOC datasets, we followed the protocol in [10] and combined VOC 2007 trainval and VOC 2012 trainval as training sets for training and testing on the VOC 2007 test. For the COCO datasets, to compare with the previous algorithm, we combined train2014 and valminusminival2014 as training sets for training and testing on the test-dev2015 test set.
We used the mean accuracy (mAP) as the core criterion for evaluation. For PASCAL VOC, we used an IOU (Intersection over Union) threshold of 0.5 to report the mAP score. For COCO, we used the evaluation matrix provided by the datasets itself. Experiments on PASCAL VOC and COCO are to verify the effectiveness of our proposed dual-path pyramid paradigm. Ablation experiments on the PASCAL VOC were used to explore different network structures of DPSSD.
The GPU we used was TITAN Xp/PCle/SSE2, and the CPU was Intel Core I7-8700K CPU @ 3.70ghz × 12. The training time of the model on PASCAL VOC datasets was 19 h and on the COCO datasets was 41 h.

4.2. Experiment on PASCAL VOC

We designed a dual-path network that generates feature maps with different depths and resolutions and enables the fusion of feature maps with different resolutions through a feature fusion module we designed specifically for it, which is a multi-scale object detection paradigm that learns more discriminate features.
We compared it with similar algorithms which are improved based on the SSD [6] to demonstrate that our proposed multi-scale feature learning paradigm has better detection results. SSD [6] belongs to (b) structure in Figure 1. STDN [18] belongs to (c) structure in Figure 1. DSSD [19] belongs to (d) structure in Figure 1. DPSSD (ours) belongs to (e) structure in Figure 1. The experimental results are shown in Table 3. The accuracy of DPSSD320 is 2.6% higher than that of DSSD321; DPSSD512 is 1.4% higher than DSSD513; DPSSD320 is 1.9% higher than STDN321; DPSSD512 is 2.0% higher than STDN513; DPSSD320 is 3.7% higher than that of SSD300; DPSSD512 is 3.4% higher than SSD512. Our model provides a significant improvement in terms of complexity and computational cost and enables the reuse of shallow features and the exploration of new features, which helps to generate a robust and discriminate feature with good detection performance.
These results reflect the improvement in the accuracy of our multi-scale detector and the effectiveness of our dual-path feature pyramid in object detection. We believe that the reason for this is that our designed dual-path network with the feature fusion module increases the amount of information in the feature pyramid at different scales of feature maps and enhances connection between low-level features and high-level features. It simultaneously maintains a good computational speed, as shown in Figure 7.
The datasets contain a total of 20 classes of objects, and our model achieves good detection accuracy for aeroplane, bird, boat, bottle, car, chair, cow, person, plant and sofa. The gap between our method and the methods listed in the table is no more than 1.5% in terms of multi-scale detection effectiveness on other categories. This further validates the generalization ability of the model for multi-scale detection.

4.3. Ablation Experiment on the PASCAL VOC

We designed a series of comparative experiments on PASCAL VOC2007 [26] to verify the effectiveness and rationality of each module in DPSSD. The results are shown in Table 2.
In Table 2, DPN denotes the dual-path network, FFM denotes the feature fusion module and PM denotes the prediction module. DPN + PM indicate that we used our dual-path network to extract the CNN features at different depths to perform object detection. DPN + FFM indicate that we tried to obtain multi-scale feature maps by using a feature fusion module. Using only the base network, the mAP of DPSSD320 was 78.9% and was 5.1% higher than that of SSD [6]. This proves that the base network is effective.
From the first to the fifth rows of Table 2, we can see that the four different feature fusion modules more or less improved the detection accuracy, and the feature fusion module with the best effect increased the accuracy from 78.9% to 81.2%. We believe that the reason for this is that humans need to consider the geometry and category properties of an object in recognizing its category and locating its position, while the fusion of deep semantic information features and shallow geometric spatial features of neural networks is exactly in line with our human localization and recognition of objects in spatial locations.
The second and sixth row of Table 2 show that a prediction module with a residual connection can slightly improve the object detection accuracy from 80.9% to 81.2%. The reason for this is that the shortcut increases the reuse of features.
The division method of channels in each stage of our designed base network contains both one-by-one convolution and channel segmentation. As shown in the second and seventh row of Table 2, the channel segmentation approach had better detection results than the one-by-one convolution approach.

4.4. Experiment on the Microsoft COCO

We also trained two models, DPSSD320 and DPSSD512, on the Microsoft COCO datasets [27] to further evaluate our detectors, and the results are shown in Table 4 We trained on the union of train2014 and valminusminival2014 and tested on test-dev2015. The different train datasets of the methods listed in Figure 4 do not affect the evaluation under the same test datasets because the different train datasets only affect the training stage. The evaluation indicators in the table were carried out on the same datasets, and the whole evaluation process was carried out on the official server; the real labels are not open to the public.
We focus on comparing the four methods of SSD [6], DSSD [19], STDN [18] and DPSSD (ours) because they are single-stage methods and are the same except for the structure of the feature pyramid. The average accuracy of SSD300, DSSD321, STDN300 and DPSSD320 (ours) on the test-dev2015 test set reached 25.1%, 28.0%, 28.0% and 30.6%, respectively. SSD512, DSSD513, STDN513 and DPSSD512 (ours) reached 28.8%, 33.2%, 31.8% and 33.9%, respectively. It can be seen that the average accuracy of our proposed model for multi-scale object detection holds an advantage. The experimental results further validate the effectiveness of our proposed dual-path feature pyramid paradigm. As shown in Figure 1e and Figure 3a, the dual-path convolution block can improve the efficiency of the feature pyramid in object detection.
However, it can be seen that our algorithm DPSSD513 was slightly lower than DSSD513 in terms of average accuracy and average recall for small and large objects. The density of the proposed area and the accuracy of the location will affect the recall rate [28]. We used 6 anchor boxes for training at each anchor point. The number of anchor boxes determines the density of the proposed area. The denser the proposed area, the fewer missed detections and the greater the recall rate of the model, but it will greatly increase the computational cost. In addition, considering the detection accuracy on objects of different scales, we find that the model achieves 51.2% detection accuracy for medium-scale objects and 20.6% and 64.3% for small and large objects, respectively. Through comparison, it can be concluded that the detection effect of the DPSSD in small objects has a lot of space to improve. Small objects have fewer features to use because they have a small area in the image. The semantic information of the environment should be used to improve the small-object detection effect [29]. Modeling the semantic relationship between environment and object through neural networks is our next research focus.

4.5. Experiment on Inference Speed

We tested 4952 images from the PASCAL VOC2007 test datasets on a Titan Xp and Intel Core i7-8700K CPU @7.70GHz device at a batch size of 1 to calculate the inference speed of our DPSSD model. The main factors affecting the detection speed include the complexity of the model, the calculation and the transmission speed of the hardware.
As shown in Table 5. For comparison, we replicated the official codes and training models of SSD [6] and DSSD [19] and conducted the test on the same hardware environment. We focused on comparing the four methods of SSD (copied), DSSD (copied), STDN [18], and DPSSD (ours) because they are the same except for the structure of the feature pyramid.
We plotted a scatter plot of accuracy and speed, as shown in Figure 7, to visualize the advantages and disadvantages of each algorithm. A good detector should gradually move closer to the top right corner of the graph. It can be seen that the DPSSD320 had a good trade-off between speed and accuracy, and the DPSSD512 is highly accurate but relatively slow. The dual-path feature pyramid is more effective than other pyramid structures in the field of object detection.

5. Conclusions

Our contribution is validating the effectiveness of a new feature pyramid paradigm, named the dual-path feature pyramid. This paradigm can give researchers a new way of constructing their own feature pyramid to optimize multi-scale problems. We improved a dual-path network and a feature fusion module specifically for the anchor-based object detection algorithm, which greatly improves the quality of features extracted by convolution neural networks with powerful learning capabilities. To verify its effectiveness, we trained the Dual-Path Single Shot Detector (DPSSD) on PASCAL VOC and COCO datasets, following SSD [6] strategy, and used it for comparing with detectors that have different pyramid paradigms. The extensive experiments above show that the dual-path single shot detector can achieve a good trade-off between speed and accuracy. At 30.7 FPS, DPSSD320 obtained 81.2 mAP on VOC 2007. At 21.3 FPS, DPSSD512 obtained 82.9 mAP. It can be seen that our detector still has some advantages over the comparable state-of-the-art detection algorithms.
Subsequently, we will continue to do relevant research on object detection. Specifically, we will work on the problem of sample imbalance, explore what kind of technology can further improve the detection effect of small objects and research applications of object detector in edge computing [29].

Author Contributions

Conceptualization, D.S. and Y.X.; methodology, D.S. and Y.X.; software, X.W. and Y.X.; validation, G.Y., C.Z. and M.Z.; formal analysis, Y.X. and D.H.; investigation, Y.X.; resources, D.S.; data curation, C.Z. and G.Y.; writing—original draft preparation, Y.X.; writing—review and editing, X.W. and P.Z.; visualization, M.Z. and D.H.; supervision, D.S. and P.Z.; project administration, D.S.; funding acquisition, D.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Project of Shandong Provincial Major Scientific and Technological Innovation, grant No. 2019JZZY010444, No. 2019TSLH0315; in part by the Project of 20 Policies of Facilitate Scientific Research in Jinan Colleges, grant No. 2019GXRC063; in part by the Project of Shandong Province Higher Educational Science and Technology Program, grant No. J18KA345; and in part by the Natural Science Foundation of Shandong Province of China, grant No. ZR2020MF138.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Researchers can replicate the reported results of this paper. Please refer to this link for a detailed explanation. https://github.com/Willie-Xu/DPSSD, 11 February 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  2. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  3. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Wu, X.W.; Sahoo, D.; Hoi, S.C.H. Recent advances in deep learning for object detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef] [Green Version]
  5. Singh, B.; Davis, L.S. An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3578–3587. [Google Scholar]
  6. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  7. Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 354–370. [Google Scholar]
  8. Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2874–2883. [Google Scholar]
  9. Kong, T.; Yao, A.; Chen, Y.; Sun, F. Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 845–853. [Google Scholar]
  10. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  11. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  12. Chen, Y.; Li, J.; Xiao, H.; Jin, X.; Yan, S.; Feng, J. Dual path networks. Adv. Neural Inf. Processing Syst. 2017, 30. [Google Scholar]
  13. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  14. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Processing Syst. 2016, 29. [Google Scholar]
  15. Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
  16. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  17. Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by concatenating feature maps for object detection. arXiv 2017, arXiv:1705.09587. [Google Scholar]
  18. Zhou, P.; Ni, B.; Geng, C.; Hu, J.; Xu, Y. Scale-transferrable object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 528–537. [Google Scholar]
  19. Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
  20. Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 9259–9266. [Google Scholar]
  21. Wu, X.; Sahoo, D.; Zhang, D.; Zhu, J.; Hoi, S.C.H. Single-shot bidirectional pyramid networks for high-quality object detection. Neurocomputing 2020, 401, 1–9. [Google Scholar] [CrossRef] [Green Version]
  22. Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
  23. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
  24. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Processing Syst. 2019, 32. [Google Scholar]
  26. Everingham, M.; Van, L.; Christopher, G.; Williams, K.I.; Winn, J.; Zisserman, A.; Everingham, M.; Gool, L.V.; Williams, C.; Winn, J. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
  27. Lin, T.-Y.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  28. Oksuz, K.; Cam, B.C.; Kalkan, S.; Akbas, E. Imbalance problems in object detection: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3388–3415. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  29. Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikainen, M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Four paradigms of multi-scale object detection. (a) Image Pyramid: It learns multiple detectors from different scale images. (b) Prediction Pyramid: It predicts on multiple feature maps. (c) Integrated Features: They predict on a single feature map generated from multiple features. (d) Feature Pyramid: It combines the structure of the prediction pyramid and integrated features. (e) Dual-Path Feature Pyramid: It uses the structure of the prediction pyramid and two methods of feature fusion.
Figure 1. Four paradigms of multi-scale object detection. (a) Image Pyramid: It learns multiple detectors from different scale images. (b) Prediction Pyramid: It predicts on multiple feature maps. (c) Integrated Features: They predict on a single feature map generated from multiple features. (d) Feature Pyramid: It combines the structure of the prediction pyramid and integrated features. (e) Dual-Path Feature Pyramid: It uses the structure of the prediction pyramid and two methods of feature fusion.
Sensors 22 04616 g001
Figure 2. The architecture of the dual-path single shot detector. We designed a dual-path network and a feature fusion module to obtain six high-level and low-level features after fusion. Finally, the classification and bounding box regression were carried out by one-by-one convolution. The figure shows that several layers in the base network were extracted as the features for predicting objects of different sizes.
Figure 2. The architecture of the dual-path single shot detector. We designed a dual-path network and a feature fusion module to obtain six high-level and low-level features after fusion. Finally, the classification and bounding box regression were carried out by one-by-one convolution. The figure shows that several layers in the base network were extracted as the features for predicting objects of different sizes.
Sensors 22 04616 g002
Figure 3. Two paradigms of dual-path block: (a) Feature segmentation is realized by channel merging. (b) The 1 × 1 convolution operation is used for feature segmentation.
Figure 3. Two paradigms of dual-path block: (a) Feature segmentation is realized by channel merging. (b) The 1 × 1 convolution operation is used for feature segmentation.
Sensors 22 04616 g003
Figure 4. Four paradigms of feature fusion module. (a) Using a two-layer convolution operation, features are fused by sum. (b) Using a two-layer convolution operation, features are fused by the product. (c) Using a one-layer convolution operation, features are fused by sum. (d) Changing the number of channels for fusion features, and using a two-layer convolution operation, features are fused by sum.
Figure 4. Four paradigms of feature fusion module. (a) Using a two-layer convolution operation, features are fused by sum. (b) Using a two-layer convolution operation, features are fused by the product. (c) Using a one-layer convolution operation, features are fused by sum. (d) Changing the number of channels for fusion features, and using a two-layer convolution operation, features are fused by sum.
Sensors 22 04616 g004
Figure 5. The paradigm of deconvolution operation.
Figure 5. The paradigm of deconvolution operation.
Sensors 22 04616 g005
Figure 6. Two paradigms of prediction module: (a) prediction module with a residual connection; (b) prediction module without residual connection.
Figure 6. Two paradigms of prediction module: (a) prediction module with a residual connection; (b) prediction module without residual connection.
Sensors 22 04616 g006
Figure 7. Accuracy and speed on PASCAL VOC2007.
Figure 7. Accuracy and speed on PASCAL VOC2007.
Sensors 22 04616 g007
Table 1. Dual-path network architecture.
Table 1. Dual-path network architecture.
LayersParametersOutputLayersParametersOutput
Size/StrideGroupsSize/StrideGroups
Conv1_1conv7 × 7/2180 × 80 × 64Conv5_1conv_a1 × 1/1120 × 20 × 1024
20 × 20 × 384
maxpool3 × 3/21conv_b1 × 1/11
Conv2_1conv_a1 × 1/1180 × 80 × 256
80 × 80 × 48
conv_b_13 × 3/132
conv_b1 × 1/11conv_b_21 × 1/11
conv_b_13 × 3/132Conv5_2
-
Conv5_3
conv_b1 × 1/1120 × 20 × 1024
20 × 20 × 512(+128)
conv_b_21 × 1/11conv_b_13 × 3/132
Conv2_2
-
Conv2_3
conv_b1 × 1/1180 × 80 × 256
80 × 80 × 80
conv_b_21 × 1/11
conv_b_13 × 3/132Conv6_1conv_a1 × 1/2110 × 10 × 1024
10 × 10 × 384
conv_b_21 × 1/11conv_b1 × 1/11
Conv3_1conv_a1 × 1/2140 × 40 × 512
40 × 40 × 96
conv_b_13 × 3/232
conv_b1 × 1/11conv_b_21 × 1/11
conv_b_13 × 3/232Conv7_1conv_a1 × 1/215 × 5 × 1024
5 × 5 × 384
conv_b_21 × 1/11conv_b1 × 1/11
Conv3_2
-
Conv3_4
conv_b1 × 1/1140 × 40 × 512
40 × 40 × 128(+32)
conv_b_13 × 3/232
conv_b_13 × 3/132conv_b_21 × 1/11
conv_b_21 × 1/11Conv8_1conv_a1 × 1/213 × 3 × 1024
3 × 3 × 384
Conv4_1conv_a1 × 1/2120 × 20 × 1024
20 × 20 × 72
conv_b1 × 1/11
conv_b1 × 1/11conv_b_13 × 3/232
conv_b_13 × 3/232conv_b_21 × 1/11
conv_b_21 × 1/11Conv9_1conv_a1 × 1/211 × 1 × 1024
1 × 1 × 384
Conv4_2
-
Conv4_20
conv_b1 × 1/1120 × 20 × 1024
20 × 20 × 96(+24)
conv_b1 × 1/11
conv_b_13 × 3/132conv_b_13 × 3/232
conv_b_21 × 1/11conv_b_21 × 1/11
Table 2. Ablation study on PASCAL VOC 2007 test set.
Table 2. Ablation study on PASCAL VOC 2007 test set.
MethodmAPAnchor BoxesInput Resolution
DPN (a) + PM (a)78.917,080320 × 320
DPN (a) + FFM (a) + PM (a)81.217,080320 × 320
DPN (a) + FFM (b) + PM (a)80.617,080320 × 320
DPN (a) + FFM (c) + PM (a)80.817,080320 × 320
DPN (a) + FFM (d) + PM (a)80.517,080320 × 320
DPN (a) + FFM (a) + PM (b)80.917,080320 × 320
DPN (b) + FFM (a) + PM(a)67.117,080320 × 320
Table 3. PASCAL VOC2007 test detection results. All models were trained on the joint training set of VOC 2007 trainval and 2012 trainval and were tested on the VOC 2007 test dataset.
Table 3. PASCAL VOC2007 test detection results. All models were trained on the joint training set of VOC 2007 trainval and 2012 trainval and were tested on the VOC 2007 test dataset.
MethodSSD300 [6]SSD512 [6]STDN300 [18]STDN321 [18]STDN513 [18]DSSD321 [19]DSSD513 [19]DPSSD320 (Ours)DPSSD512 (Ours)
NetworkVGGVGGDenseNet-169DenseNet-169DenseNet-169Residual-101Residual-101DPNDPN
mAP77.579.578.179.380.978.681.581.282.9
aero79.584.881.181.286.181.986.688.587.9
bike83.985.186.988.389.384.986.28788
bird7681.576.478.179.580.582.682.387.1
boat69.67369.272.274.368.474.976.279.9
bottle50.557.852.454.361.953.962.556.566.3
bus8787.887.787.688.585.68988.788.5
car85.788.384.286.588.386.288.788.289
cat88.187.488.388.889.488.988.888.488.4
chair60.363.560.263.567.461.165.267.471.2
cow81.585.481.383.286.583.58784.687.3
table7773.277.679.479.578.778.777.379.2
dog86.186.286.686.186.486.788.286.788
horse87.586.788.989.389.288.7898989.1
mbike83.983.987.88888.586.787.587.887.3
person79.482.576.877.379.379.783.780.985
plant52.355.651.852.55351.751.159.559
sheep77.981.778.480.377.97886.384.386.1
sofa79.57981.380.881.480.981.683.781.9
train87.686.687.586.386.687.285.78786.2
tv76.88077.882.185.579.483.780.682.8
Table 4. COCO test-dev2015 detection results.
Table 4. COCO test-dev2015 detection results.
MethodDataNetworkAvg. Precision, IoU:Avg. Precision, Area:Avg. Recall, #Dets:Avg. Recall, Area:
0.5:0.950.50.75SML110100SML
SSD300 [6]trainval35kVGG25.143.125.86.625.941.423.735.137.211.240.458.4
SSD512 [6]trainval35kVGG28.848.530.310.931.843.526.139.542.016.546.660.8
DSSD321 [19]trainval35kResidual-10128.046.129.27.428.147.625.537.139.412.742.062.6
DSSD513 [19]trainval35kResidual-10133.253.335.213.035.451.128.943.546.221.849.166.4
STDN300 [18]trainvalDenseNet-16928.045.629.47.929.745.124.436.138.712.542.760.1
STDN513 [18]trainvalDenseNet-16931.851.033.614.436.143.427.040.141.918.348.357.2
DPSSD320 (ours)trainval35kDPN30.650.232.210.332.047.626.839.541.516.144.962.6
DPSSD512 (ours)trainval35kDPN33.953.836.314.537.548.728.743.445.720.651.264.3
Table 5. The speed and accuracy of the algorithm are summarized as follows. The training data are the combination of VOC2007 trainval and VOC2012 trainval.
Table 5. The speed and accuracy of the algorithm are summarized as follows. The training data are the combination of VOC2007 trainval and VOC2012 trainval.
MethodBase NetworkmAPSpeed (fps)Anchor BoxesGPUInput Resolution
SSD300 [6]VGG1677.5468732Titan X300 × 300
SSD512 [6]VGG1679.51924,564Titan X512 × 512
SSD300 (copied)VGG1677.6498732Titan Xp300 × 300
SSD512 (copied)VGG1679.72424,564Titan Xp512 × 512
DSSD321 [19]Residual-10178.69.517,080Titan X321 × 321
DSSD513 [19]Residual-10181.55.543,688Titan X513 × 513
DSSD321 (copied)Residual-10178.712.717,080Titan Xp321 × 321
DSSD513 (copied)Residual-10181.39.843,688Titan Xp513 × 513
STDN300 [18]DenseNet-16978.141.513,888Titan Xp300 × 300
STDN321 [18]DenseNet-16979.240.117,080Titan Xp321 × 321
STDN513 [18]DenseNet-16980.928.643,680Titan Xp513 × 513
DPSSD320 (ours)DPN81.230.717,080Titan Xp320 × 320
DPSSD512 (ours)DPN82.921.343,680Titan Xp512 × 512
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Shan, D.; Xu, Y.; Zhang, P.; Wang, X.; He, D.; Zhang, C.; Zhou, M.; Yu, G. DPSSD: Dual-Path Single-Shot Detector. Sensors 2022, 22, 4616. https://doi.org/10.3390/s22124616

AMA Style

Shan D, Xu Y, Zhang P, Wang X, He D, Zhang C, Zhou M, Yu G. DPSSD: Dual-Path Single-Shot Detector. Sensors. 2022; 22(12):4616. https://doi.org/10.3390/s22124616

Chicago/Turabian Style

Shan, Dongri, Yalu Xu, Peng Zhang, Xiaofang Wang, Dongmei He, Chenglong Zhang, Maohui Zhou, and Guoqi Yu. 2022. "DPSSD: Dual-Path Single-Shot Detector" Sensors 22, no. 12: 4616. https://doi.org/10.3390/s22124616

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop