An Efficient Deep Convolutional Neural Network Approach for Object Detection and Recognition Using a Multi-Scale Anchor Box in Real-Time
Abstract
:1. Introduction and Scope
1.1. Contributions
- (1)
- Design and implementation of the model for accurate detection and recognition of objects from a video sequence.
- (2)
- Design a model to understand better the parameters and the hyper-parameters which affect the detection and the recognition of objects of varying sizes and shapes.
- (3)
- Achieve real-time object detection and recognition speeds by improving accuracy.
- (4)
- Develop implementations to take full advantage of the GPU implementations.
1.2. Novelty
- (1)
- Experimentation to detect objects of varying sizes from a video sequence.
- (2)
- Comparison of the existing work with the proposed model.
- (3)
- Generation of a multi-scale anchor box to obtain better results.
- (4)
- Concept of an efficient multi-scale anchor box approach is used in the proposed work to obtain better results by arranging the anchor boxes in descending order (i.e., large scale anchor box first and then moving towards small sized anchor box). If the information is not present in the large-scale anchor box, there is no need to move for a small-scale anchor box. This saves execution time for each time predicting the prediction score of the information present in the given anchor box because it reduces the search space for object recognition.
1.3. Outline
2. Background Study
2.1. Building Blocks of CNN
2.1.1. R-CNN
2.1.2. Fast R-CNN
2.1.3. Faster R-CNN
2.1.4. Mask R-CNN
2.1.5. YOLO Versions
2.1.6. RefineDet512
2.1.7. CenterNet
3. Methods
3.1. Proposed Architecture
3.2. Efficient Multi-Scale Anchor Box Approach
- (1)
- An optimized multi-scale anchor box detects whether the information is present or not. This is done in the following manner:
- Applying the canny edge detector algorithm [23] for only the anchor box portion. The canny edge detector algorithm requires a minimum and maximum threshold value to determine whether the obtained edge is weak or strong. To automate this process, the Otsu binary threshold [24] is used. This binary threshold gives the minValue and maxValue threshold.
- The above step gives an output image wherein the frequency of black and white pixels is calculated. If the white pixels have a frequency of less than 30, then it indicates the absence of information. In all other cases, information is present. The threshold value of 30 is a hyper-parameter and is decided by the trial-and-error method. The value is considered for the Pascal VOC-2007 dataset.
- (2)
- To carry out the small object detection in an optimized way, the anchor boxes are arranged in descending order (i.e., large-scale anchor box first and then moving towards small sized anchor box). If the information is not present in the large-scale anchor box, there is no need to move to a small-scale anchor box. This saves execution time for each time predicting the prediction score of the information present in the given anchor box.
4. Experiments
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.L.; Chen, S.C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. (CSUR) 2018, 51, 1–36. [Google Scholar] [CrossRef]
- Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M.S. Deep learning for visual understanding: A review. Neurocomputing 2016, 187, 27–48. [Google Scholar] [CrossRef]
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
- Palmer, R.; West, G.; Tan, T. Scale Proportionate Histograms of Oriented Gradients for Object Detection in Co-Registered Visual and Range Data. In Proceedings of the 2012 IEEE International Conference on Digital Image Computing Techniques and Applications (DICTA), Fremantle, Australia, 3–5 December 2012; pp. 1–8. [Google Scholar]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Leonardis, A.; Bischof, H.; Pinz, A. SURF: Speeded Up Robust Features. In Computer Vision–ECCV 2006, Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006; Part I; Volume 3951, pp. 407–417. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yi, D.; Su, J.; Chen, W.-H. Probabilistic faster R-CNN with stochastic region proposing: Towards object detection and recognition in remote sensing imagery. Neurocomputing 2021, 459, 290–301. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 29 October 2017; pp. 2961–2969. [Google Scholar]
- Modwel, G.; Mehra, A.; Rakesh, N.; Mishra, K.K. A Robust Real Time Object Detection and Recognition Algorithm for Multiple Objects. Recent Adv. Comput. Sci. Commun. 2021, 14, 331–338. [Google Scholar] [CrossRef]
- IoU Accuracy. Available online: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-ioufor-object-detection/ (accessed on 25 July 2021).
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef] [Green Version]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Salam, H.; Jaleel, H.; Hameedi, S. You Only Look Once (YOLOv3): Object Detection and Recognition for Indoor Environment. Multicult. Educ. 2021, 7, 171. [Google Scholar]
- Hong, Z.; Yang, T.; Tong, X.; Zhang, Y.; Jiang, S.; Zhou, R.; Han, Y.; Wang, J.; Yang, S.; Liu, S. Multi-Scale Ship Detection From SAR and Optical Imagery Via A More Accurate YOLOv3. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6083–6101. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4 Optimal Speed and Accuracy of Object Detection. In Proceedings of the Computer Vision and Pattern Recognition. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- PyTorch Hub. Ultralytics/yolov5: v4.0—nn.SiLU() Activations, Weights & Biases Logging, PyTorch Hub Integration. 2021. Available online: https://codechina.csdn.net/ForrestGump92/yolov5/-/tree/silu (accessed on 26 November 2021).
- Han, G.; Zhang, X.; Li, C. Single shot object detection with top-down refinement. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3360–3364. [Google Scholar]
- Furlán, F.; Rubio, E.; Sossa, H.; Ponce, V. CNN based detectors on planetary environments: A performance evaluation. Front. Neurorobot. 2020, 14, 85. [Google Scholar] [CrossRef] [PubMed]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 6568–6577. [Google Scholar]
- Canny Edge Detector Algorithm. Available online: https://towardsdatascience.com/canny-edge-detectionstep-by-step-in-python-computer-vision-b49c3a2d8123 (accessed on 12 August 2021).
Model | Highlights | Limitations |
---|---|---|
R-CNN | Introduced for obtaining better accuracy as compared to HOG. Uses selective search because CNN cannot run on too many patches created by the sliding window detector. The first model for integrating the RP methods with the CNN; improvement in the performance observed with respect to the previous state-of-the-art methods. | Training is expensive in time and space. Testing is slow. |
Fast R-CNN | end-to-end detector training; design of a layer of RoI pooling; the multi-task objective of having both classification head and bounding box regression head reduces the overall training time and increases the accuracy. | Takes more execution time for real-time applications. |
Faster R-CNN | Instead of a selective search approach, an RPN is proposed for the generation of high-quality region proposals. Introduces invariant translation and multi-scale anchor boxes as references in RPN. Comparatively faster in magnitude than Fast R-CNN without the loss in performance. | Complex training and falls short of real-time. |
Mask R-CNN | Extends Faster R-CNN by adding a branch to predict the object mask along with the existing branch for bounding box prediction; outstanding performance. | Falls short of real-time application. |
YOLO (You Only Look Once) | The classification score for each box is predicted for every class in training. Image is divided into grids having the coordinates—SXS, and hence total boxes predicted are SXSXN. It sees the complete image at once. | Struggles for a small object. |
YOLO v1 | A first efficient unified detector, significantly faster than other previous detectors. | A decrease in accuracy as compared to the state-of-the-art detectors and struggles for small object detection. |
YOLO v2 | Uses the number of existing strategies to improve the accuracy and speed. | Not good in detecting small objects. |
YOLO v3 | At three different scales, detection is carried out by applying the 1 × 1 detection kernel to feature maps of three different sizes at three separate locations in the network. | An increase in the grid size results in an increased number of anchor boxes which leads to an increase in the number of times for detection. |
RefineDet 512 | Refines the sizes and the location of the anchor box with the help of the anchor refinement module and object detection module. | As it is a two-step cascaded process, real-time detection is slow. |
CenterNet | Detects each of the objects as triplets, which improves both the recall and the precision. It explores the central part of the proposal, which is the region close to the geometric centres | Added an extra branch to identify the centre key point, which potentially increases object detection overhead and reduces real-time performance. |
Sr. No. | Model | Year | mAP (%) | Speed (FPS *) |
---|---|---|---|---|
1 | R-CNN | 2014 | 66.0 | 0.05 FPS |
2 | Fast R-CNN | 2015 | 70.0 | 0.5 FPS |
3 | Faster R-CNN | 2015 | 73.2 | 7 FPS |
4 | Mask R-CNN | 2017 | 78.2 | 0.1 FPS |
5 | YOLO v1 | 2016 | 63.4 | 14 FPS |
6 | YOLO v2 | 2017 | 76.01 | 12 FPS |
7 | YOLO v3 | 2018 | 81.7 | 9 FPS |
8 | YOLO v4 | 2020 | 83.5 | 10 FPS |
9 | YOLO v5 | 2021 | 84.12 | 6 FPS |
10 | RefineDet512 | 2018 | 81.8 | 6 FPS |
11 | CenterNet | 2019 | 78.7 | 0.3 FPS |
12 | CenterNet2 | 2021 | 81.69 | 2 FPS |
13 | Proposed Model | 2021 | 84.49 | 11 FPS |
Characteristics | Existing Model | Proposed Model |
---|---|---|
Grid | 7 × 7 grid | 19 × 19 grid |
Layers | 24 convolutional layers | 23 convolutional layers |
Anchor Boxes | Fixed-size anchor boxes | Multi-scale anchor box |
Bounding box/cell | 2 bounding box/cell | 9 bounding boxes/cells, with each bounding box having 9 multi-scale anchors. |
Dataset | PASCAL VOC 2007 dataset with 20 classes and 9963 images containing 24,640 annotated objects. |
Models | Grid Size | mAP (%) |
---|---|---|
YOLO | 7 × 7 | 63.4 |
9 × 9 | 68.93 | |
11 × 11 | 77.39 | |
Proposed Model | 15 × 15 | 80.29 |
19 × 19 | 84.49 |
Layer No. | Layer Type | Filters | Kernel Size | Output | #Parameters | #FLOPS |
---|---|---|---|---|---|---|
1 | Convolutional | 32 | 3 × 3 | 608 × 608 × 32 | 864 | 319,389,696 |
2 | Max | - | 2 × 2 | 304 × 304 × 32 | 0 | 11,829,248 |
3 | Convolutional | 64 | 3 × 3 | 304 × 304 × 64 | 18,432 | 1,703,411,712 |
4 | Max | - | 2 × 2 | 152 × 152 × 64 | 0 | 5,914,624 |
5 | Convolutional | 128 | 3 × 3 | 152 × 152 × 128 | 73,728 | 1,703,411,712 |
6 | Convolutional | 64 | 1 × 1 | 152 × 152 × 64 | 8192 | 189,267,968 |
7 | Convolutional | 128 | 3 × 3 | 152 × 152 × 128 | 73,728 | 1,703,411,712 |
8 | Max | - | 2 × 2 | 76 × 76 × 128 | 0 | 2,957,312 |
9 | Convolutional | 256 | 3 × 3 | 76 × 76 × 256 | 294,912 | 1,703,411,712 |
10 | Convolutional | 128 | 1 × 1 | 76 × 76 × 128 | 32,768 | 189,267,968 |
11 | Convolutional | 256 | 3 × 3 | 76 × 76 × 256 | 294,912 | 1,703,411,712 |
12 | Convolutional | 128 | 1 × 1 | 76 × 76 × 128 | 32,768 | 189,267,968 |
13 | Convolutional | 256 | 3 × 3 | 76 × 76 × 256 | 294,912 | 1,703,411,712 |
14 | Max | - | 2 × 2 | 38 × 38 × 256 | 0 | 1,478,656 |
15 | Convolutional | 512 | 3 × 3 | 38 × 38 × 512 | 1,179,648 | 1,703,411,712 |
16 | Convolutional | 256 | 1 × 1 | 38 × 38 × 256 | 131,072 | 189,267,968 |
17 | Convolutional | 512 | 3 × 3 | 38 × 38 × 512 | 1,179,648 | 1,703,411,712 |
18 | Convolutional | 256 | 1 × 1 | 38 × 38 × 256 | 131,072 | 189,267,968 |
19 | Convolutional | 512 | 3 × 3 | 38 × 38 × 512 | 1,179,648 | 1,703,411,712 |
20 | Convolutional | 256 | 1 × 1 | 38 × 38 × 256 | 131,072 | 189,267,968 |
21 | Max | - | 2 × 2 | 19 × 19 × 256 | 0 | 739,328 |
22 | Convolutional | 512 | 3 × 3 | 19 × 19 × 512 | 235,9296 | 851,705,856 |
23 | Convolutional | 1024 | 1 × 1 | 19 × 19 × 1024 | 524,288 | 189,267,968 |
24 | Convolutional | 512 | 3 × 3 | 19 × 19 × 512 | 4,718,592 | 1,703,411,712 |
25 | Convolutional | 1024 | 1 × 1 | 19 × 19 × 1024 | 524,288 | 189,267,968 |
26 | Convolutional | 512 | 3 × 3 | 19 × 19 × 512 | 4,718,592 | 1,703,411,712 |
27 | Convolutional | 1024 | 3 × 3 | 19 × 19 × 1024 | 4,718,592 | 1,703,411,712 |
28 | Convolutional | 675 | 1 × 1 | 19 × 19 × 675 | 691,200 | 249,523,200 |
23,312,224 =23.31 M | 23,398,622,208 =23.39 BFLOPS |
Class\Model | R-CNN | Fast R-CNN | Faster R-CNN | Mask R-CNN | YOLO v1 | YOLO v2 | YOLO v3 | YOLO v4 | YOLO v5 | RefineDet | CenterNet | CenterNet2 | Proposed Model |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Aeroplane | 77.2 | 84.2 | 84.1 | 88.9 | 77 | 86.3 | 87.2 | 90.1 | 90.5 | 88.7 | 86.1 | 89.6 | 89.7 |
Bike | 76.4 | 78.9 | 81.2 | 86 | 67.2 | 84.3 | 88.3 | 87.6 | 89.6 | 87 | 87.2 | 90.8 | 91.4 |
Bird | 68.8 | 75.2 | 75.4 | 80.2 | 69.2 | 74.8 | 78.9 | 79.5 | 78.5 | 83.2 | 82.4 | 85.4 | 89.1 |
Boat | 50.3 | 53.6 | 56.3 | 61.1 | 43.4 | 59.2 | 65.4 | 70.2 | 75.4 | 76.5 | 67.1 | 73.2 | 69.4 |
Bottle | 36.7 | 49.2 | 62.7 | 67.5 | 42.3 | 68.1 | 72.5 | 74.5 | 77.3 | 68 | 60.1 | 56.4 | 75.2 |
Bus | 75.8 | 77.5 | 79.4 | 84.2 | 68.3 | 79.8 | 83.2 | 86.3 | 84.5 | 88.5 | 83.1 | 87.5 | 84.4 |
Car | 69.6 | 73.2 | 77.2 | 82 | 68.5 | 76.5 | 83.4 | 88.1 | 92.4 | 88.7 | 82.7 | 85.4 | 91.3 |
Cat | 87.3 | 85.8 | 84.9 | 89.7 | 81.4 | 90.6 | 94.2 | 95.3 | 98.5 | 89.2 | 87.7 | 88.5 | 92.7 |
Chair | 42.2 | 45.6 | 57.1 | 61.9 | 53.7 | 64.2 | 68.3 | 70.8 | 71.2 | 66.5 | 61.7 | 64.2 | 68.4 |
Cow | 70.2 | 77.1 | 78.6 | 83.4 | 60.8 | 78.2 | 81.2 | 83.4 | 81.2 | 87.9 | 82.4 | 89.4 | 83.8 |
Table | 52.2 | 53.1 | 62.2 | 70.1 | 58.2 | 63.7 | 71.3 | 72.3 | 75.4 | 75 | 68.4 | 78.3 | 74.9 |
Dog | 85.5 | 86.1 | 85.3 | 90.1 | 77.2 | 89.3 | 94.4 | 96.5 | 97 | 86.8 | 89.4 | 89.9 | 93.1 |
Horse | 78.5 | 80.4 | 82.1 | 86.9 | 72.3 | 82.6 | 87.6 | 83.5 | 82.4 | 89.2 | 84.9 | 88.5 | 89.4 |
Motorbike | 78.8 | 78.9 | 83.6 | 88.4 | 71.3 | 83.4 | 88.1 | 89.5 | 91.5 | 87.8 | 86.3 | 89.4 | 91.2 |
Person | 68.3 | 79.3 | 78.9 | 83.7 | 63.5 | 81.5 | 93.3 | 95.6 | 95.9 | 84.7 | 85.7 | 86.5 | 94.8 |
Plant | 33.1 | 40.1 | 44.2 | 49 | 48.9 | 52.8 | 71.3 | 75.4 | 76.5 | 56.2 | 62.3 | 60.5 | 74.1 |
Sheep | 66.3 | 72.6 | 73.4 | 78.2 | 59.4 | 77.6 | 78.3 | 76.4 | 74.3 | 83.2 | 84.3 | 85.6 | 79.3 |
Sofa | 63.7 | 68.4 | 62.3 | 67.1 | 54.8 | 69.8 | 75.2 | 79.5 | 80.3 | 78.7 | 73.1 | 78.4 | 78.7 |
Train | 76.2 | 80.3 | 81.2 | 87.2 | 73.9 | 85.1 | 85.4 | 85.9 | 84.7 | 88.1 | 85.4 | 85.8 | 89.4 |
TV | 62.9 | 60.5 | 73.8 | 78.6 | 56.7 | 72.4 | 86.5 | 89.6 | 85.3 | 82.3 | 73.9 | 80.5 | 89.5 |
mAP | 66 | 70 | 73.2 | 78.2 | 63.4 | 76.8 | 81.7 | 83.5 | 84.12 | 81.8 | 78.7 | 81.69 | 84.49 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Varadarajan, V.; Garg, D.; Kotecha, K. An Efficient Deep Convolutional Neural Network Approach for Object Detection and Recognition Using a Multi-Scale Anchor Box in Real-Time. Future Internet 2021, 13, 307. https://doi.org/10.3390/fi13120307
Varadarajan V, Garg D, Kotecha K. An Efficient Deep Convolutional Neural Network Approach for Object Detection and Recognition Using a Multi-Scale Anchor Box in Real-Time. Future Internet. 2021; 13(12):307. https://doi.org/10.3390/fi13120307
Chicago/Turabian StyleVaradarajan, Vijayakumar, Dweepna Garg, and Ketan Kotecha. 2021. "An Efficient Deep Convolutional Neural Network Approach for Object Detection and Recognition Using a Multi-Scale Anchor Box in Real-Time" Future Internet 13, no. 12: 307. https://doi.org/10.3390/fi13120307