Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection

Son, Jinhwan; Jung, Heechul

doi:10.3390/app14062232

Open AccessArticle

Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection

by

Jinhwan Son

and

Heechul Jung

^*

Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(6), 2232; https://doi.org/10.3390/app14062232

Submission received: 18 January 2024 / Revised: 24 February 2024 / Accepted: 28 February 2024 / Published: 7 March 2024

(This article belongs to the Special Issue Information Fusion and Its Applications for Smart Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Object detection is a crucial research topic in the fields of computer vision and artificial intelligence, involving the identification and classification of objects within images. Recent advancements in deep learning technologies, such as YOLO (You Only Look Once), Faster-R-CNN, and SSDs (Single Shot Detectors), have demonstrated high performance in object detection. This study utilizes the YOLOv8 model for real-time object detection in environments requiring fast inference speeds, specifically in CCTV and automotive dashcam scenarios. Experiments were conducted using the ‘Multi-Image Identical Situation and Object Identification Data’ provided by AI Hub, consisting of multi-image datasets captured in identical situations using CCTV, dashcams, and smartphones. Object detection experiments were performed on three types of multi-image datasets captured in identical situations. Despite the utility of YOLO, there is a need for performance improvement in the AI Hub dataset. Grounding DINO, a zero-shot object detector with a high mAP performance, is employed. While efficient auto-labeling is possible with Grounding DINO, its processing speed is slower than YOLO, making it unsuitable for real-time object detection scenarios. This study conducts object detection experiments using publicly available labels and utilizes Grounding DINO as a teacher model for auto-labeling. The generated labels are then used to train YOLO as a student model, and performance is compared and analyzed. Experimental results demonstrate that using auto-generated labels for object detection does not lead to degradation in performance. The combination of auto-labeling and manual labeling significantly enhances performance. Additionally, an analysis of datasets containing data from various devices, including CCTV, dashcams, and smartphones, reveals the impact of different device types on the recognition accuracy for distinct devices. Through Grounding DINO, this study proves the efficacy of auto-labeling technology in contributing to efficiency and performance enhancement in the field of object detection, presenting practical applicability.

Keywords:

deep learning; computer vision; object detection; auto-labeling

1. Introduction

In contemporary society, various video sources, particularly CCTV, dashcams, and smartphones, are rapidly proliferating and widely employed across diverse fields such as public safety, traffic management, security systems, and autonomous vehicles [1,2,3,4,5,6,7]. CCTV is predominantly installed in public spaces, commercial facilities, residential areas, and beyond, while dashcams are primarily utilized for recording road situations and accidents [8,9,10,11,12,13,14]. Smartphones serve as portable video recording devices, enabling the swift documentation of events occurring in the vicinity [15,16,17].

In modern society, various video sources are employed for purposes such as enhancing safety, managing traffic flow, and investigating accidents. The innovative development in object detection technology plays a crucial role in advancing these video data applications [18,19,20,21,22,23,24,25,26,27,28,29]. The progress in object detection technology positively impacts diverse fields, including public safety and security, efficiency improvement in traffic systems, and the enhancement of autonomous vehicle performance [4,5,6,7,8,9,10,11,12,13,14].

This study compares the object detection performance of CCTV, dashcams, and smart-phone images, analyzing the influence of each device type on the detection accuracy of different devices. Real-time object detection is a key requirement for various applications, such as safety and security, traffic flow management, and emergency response, using CCTV, dashcam, and smartphone videos [22,23,26]. To address this, this experiment employs the YOLOv8 model, known for its superior performance in terms of speed and parameters [30]. However, real-time object detection algorithms like YOLO, while efficient in speed and parameter count, may have lower detection rates compared to heavier models. Therefore, this study aims to achieve performance improvement while maintaining lightweight characteristics. Additionally, the field of object detection has seen significant progress with the advancement of computer vision and machine learning technologies. Nevertheless, accurate and reliable object detection relies heavily on meticulous labeling, a process prone to difficulty, time consumption, and human errors. In response to these challenges, this study aims to explore automated methods for efficient object detection.

This study proposes utilizing Grounding DINO [29] to enhance object detection performance in CCTV, dashcam, and smartphone videos and alleviate the difficulties of manual annotation through the use of auto-labeling technology. Grounding DINO serves as a high-performance zero-shot object detector capable of efficient auto-labeling. However, it may not be suitable for real-time object detection scenarios due to slower processing speeds compared to YOLO [26].

This study conducts object detection experiments using publicly available labels and employs Grounding DINO as a teacher model for auto-labeling. The generated labels are then used to train YOLO as a student model for object detection experiments, allowing for a comparative analysis of performance.

In conclusion, this study affirms that there is no discernible performance degradation when employing automatically generated labels for object detection. The synergistic use of auto-labeling and manual labeling, employing a mixed-method approach, proves to be an effective strategy for performance enhancement. This research delves into the ways in which auto-labeling enhances the accuracy and efficiency of object detection in comparison to manual labeling. Furthermore, it investigates how different device types influence the detection accuracy across various devices. The findings underscore the significant contribution of auto-labeling technology, specifically Grounding DINO, to efficiency and performance improvement within the realm of object detection. These advancements are anticipated to yield positive impacts across diverse applications, ranging from autonomous driving to intelligent surveillance systems.

2. Related Work

2.1. Object Detection

Object detection has been considered a fundamental topic in computer vision, and the advancement of deep learning technology has revolutionized this field. Continuous research and development have explored various methodologies and models. Here, we provide a brief overview of key methodologies and models in object detection.

Traditional Approaches: Viola and Jones (2001) introduced fundamental work on object detection using Haar-like features and a cascaded classifier. This method laid the foundation for real-time face detection and inspired subsequent research [20].

Region-based CNNs: Girshick et al. (2014) introduced the Region-based CNN (R-CNN) [21], which changed the paradigm of object detection by proposing region proposals using selective search for CNN-based classification. Although effective, R-CNN had high computational costs. Ren et al. (2015) improved R-CNN with Fast R-CNN [22], introducing region-of-interest pooling layers for end-to-end training and faster computation. Mask R-CNN [24] performs simultaneous object detection and instance segmentation (masking) in a deep learning-based model. Based on the Faster R-CNN [23] architecture, it not only identifies the location and assigns a class to objects but also precisely segments the object’s boundaries at the pixel level to generate masks.

Single Shot Detectors (SSDs): Liu et al. (2016) proposed the SSD [25], a model predicting bounding boxes and class scores directly from multiple scale feature maps. The SSD achieved real-time object detection performance by passing the network only once.

You Only Look Once (YOLO): Redmon et al. (2016) introduced YOLO [26] as an innovative method for object detection. YOLO divides the input image into a grid and directly predicts bounding boxes and class probabilities, providing an excellent balance between speed and accuracy. It gained notable attention for its real-time performance, and with continuous development since its initial release, version 8 (v8) was launched in 2023. EfficientDet: Tan et al. (2020) proposed EfficientDet [27], a model that balances accuracy and computational efficiency in object detection. It achieved top-level performance across various datasets by introducing a compound scaling method to optimize model parameters.

DINO (DETR with Improved Denoising Anchor Boxes for End-to-End Object Detection): DINO [28] improves both performance and efficiency relative to earlier DETR-like models. This is achieved through the utilization of a contrastive approach for noise removal training, a blended query selection method for anchor initialization, and two future prediction strategies for box prediction.

Grounding DINO [29]: This technology performs zero-shot object detection, meaning the model operates on new classes without labeled examples during training. Unlike conventional object detection using trained models on annotated data, Grounding DINO can detect objects even without annotated data for new classes. The core concept of Grounding DINO is the interaction (’grounding’) between the visual features of objects and their class names. It utilizes pre-trained DINO to detect objects of new classes and performs interactions by linking the visual features of detected objects with their class names.

2.2. Research on Utilizing Various Video Sources

Utilizing various video sources for object detection is a crucial area, considering the unique characteristics of each source for fast and accurate detection.

In modern society, diverse video sources such as CCTV, dashcams, and smartphones are rapidly spreading. Each of these sources has distinct characteristics, requiring special considerations for object detection. CCTV is widely installed in public places, commercial facilities, and residential areas. Dashcams are used to record situations on the road, while smartphones serve as portable video recording devices for swiftly capturing events in one’s surroundings.

Research on object detection in CCTV videos: Studies [8,9,11] address the detection and tracking of objects in real-time CCTV scenarios. One study [9] specifically explores detecting weapons using various models on real-time CCTV footage. Another study [10] introduces a research approach utilizing heterogeneous training data and data augmentation to maximize detection rates in CCTV scenes. This research focuses on modeling and predicting the evolution of unique camera parameters using spatial transformation parameters of objects, optimizing the detector accordingly.

Research on object detection in dashcam videos: One study [11] presents an example of using state-of-the-art image processing algorithms on dashcam videos to safely detect traffic signals while driving. Another study [12] proposes the Dynamic Spatial Attention (DSA) Recurrent Neural Network (RNN) to collect and publish datasets for predicting accidents using dashcam videos.

Additionally, in another study [13], the utilization of dashboard cameras is proposed to develop a practical anomaly detection system. Focusing on driver safety issues such as lane departure and following distance, traditional model-based computer vision algorithms have limitations in addressing the diversity of risks on the road. This study emphasizes the importance of an approach specialized for dashcam data. Furthermore, another research study [14] aims to detect road anomalies, such as potholes, in dashcam videos to alert drivers about road irregularities and reduce accidents.

Research on object detection in smartphone videos: Various studies explore object detection using smartphone videos. One study focuses on detecting pests using smartphone videos [15], while others [16,17] propose real-time object recognition systems optimized for speed and minimal performance degradation in the constrained resource and power consumption environment of smartphones.

Furthermore, research [6,7] aims to improve the accuracy of object detection by combining information from various devices at different viewpoints. The growing number of studies on object detection and tracking using videos collected from multiple cameras demonstrates continuous and effective development in the field. These diverse studies underscore ongoing efforts to understand and enhance object detection in various environments, from CCTV and dashcam to smartphone images.

3. Proposed Method

Dataset: This study used datasets from ‘The Open AI Dataset Project (Multi-Video Same Situation and Object Identification Data)’. All data information can be accessed through AI-Hub (www.aihub.or.kr (accessed on 27 February 2024)). Figure 1, Figure 2 and Figure 3 represent images captured from CCTV, a dashcam, and a smartphone, respectively, in the dataset.

This dataset was collected using CCTV, dashcam, and smartphone devices in the same scenarios, with 12 images captured for each device in each scenario. Currently, it is in a limited beta version with restricted public access, containing incidents related specifically to collisions between humans and bicycles. The images are provided in full HD (1920 × 1080) resolution. As the data were collected in identical scenarios, the frame rate is consistent, ensuring a uniform number of images across all device types. However, some slight inconsistencies may exist in publicly available datasets due to issues such as omissions and errors. The object detection classes include (1) person, (2) scooter, (3) vehicle, and (4) bicycle. However, the scooter class is currently not included in the publicly available dataset. These three types of data can be denoted as CCTV (CT), dashcam (black box—BB), and smartphone (SP), respectively.

After a brief examination of the dataset to identify device-specific characteristics, it was observed that the CCTV footage appears to be captured from an overhead perspective, resembling actual CCTV footage. The dashcam footage exhibits darkness due to vehicle tinting, indicating an interior view from within the vehicle. In contrast, the smartphone footage shares a similar screen angle with dashcam footage but comparatively presents a cleaner screen.

The total dataset consists of 3811 images for CCTV, 3816 images for the dashcam, and 3804 images for the smartphone, showing a slight inconsistency in the total number of images for each device. Overall, there are 11,431 image–label pairs in the dataset. The dataset was manually split into training, validation, and test sets. According to Table 1, the training set is divided into CCTV (2088 images), black box (2088 images), and smartphone (2087 images) for a total of 6263 images; the validation set into CCTV (576 images), black box (576 images), and smartphone (576 images), for a total of 1728 images; and the test set into CCTV (1147 images), black box (1152 images), and smartphone (1141 images), for a total of 3440 images. Training the YOLO [26] model: To train the YOLOv8 object detection model, text files with the same names as the image files are required. These text files must contain the class number of the object and bounding box information. The bounding box format for the YOLO model requires normalized values in the order of the center coordinates (x, y), width (w), and height (h) as [x, y, w, h]. Annotation information for the publicly available dataset is in the JSON file format, containing various pieces of detailed information that can be utilized for different research purposes. However, for this experiment, many unnecessary details were present, and the bounding box format did not match the YOLO bounding box format. Therefore, as shown in Figure 4, a conversion was performed.

Utilizing auto-labeling with Grounding DINO [29]: Although the YOLO model boasts excellent speed and parameter efficiency, its detection performance on this dataset did not meet expectations. Considerations for performance improvement included altering the model structure, employing more effective transfer learning, or enhancing label quality. Since the YOLO model is well-developed and transfer learning has been effectively conducted on the COCO dataset [31], making changes to it posed difficulties. Therefore, the primary approach focused on enhancing the label quality of this dataset, anticipating that it would contribute to performance improvement.

Recently, the zero-shot object detection model, Grounding DINO, has demonstrated outstanding detection performance. However, with a speed of 8.37 FPS on an A100 GPU, which is significantly lower compared to YOLO’s 300 FPS or higher, Grounding DINO exhibits a suboptimal speed for real-time object detection purposes. To address this, the results of zero-shot object detection from Grounding DINO are utilized for auto-labeling. The advantages of auto-labeling include cost and time savings, as well as ensuring consistency and accuracy compared to manual labeling efforts.

Figure 5 proposes a method where images are provided to the Grounding DINO model without training set labels, and it detects four target labels, person, scooter, vehicle, and bicycle, using the bounding box information of these results as YOLO’s training labels. To measure the reliability of auto-labels, the mean average precision (mAP) with the training set labels was calculated, resulting in a high performance of 0.789. This indicates that auto-labeling effectively generated labels similar to manual labels. This approach functions like a teacher–student relationship, where the Grounding DINO model imparts knowledge to the YOLO model.

Combining auto-labels with manual labels: When experimenting with training object detection using auto-labels generated by Grounding DINO, the performance did not degrade. While the auto-labels created by Grounding DINO could be meaningful from the perspective of dataset construction, they alone did not significantly improve object detection performance in this experiment. Therefore, a method was considered to enhance the quality of training labels by using both auto-labels and manual labels simultaneously. The combination of auto-labels and manual labels was experimented with, without a deep consideration of the strengths and weaknesses of each method. Although auto-labels and manual labels have their respective advantages and disadvantages, combining them straightforwardly may lead to conflicting coordinates for reliable and definite objects, resulting in dual labeling for a single object. Moreover, less reliable objects are likely to exist in only one label. This approach aims to assign weights to important objects, considering these effects.

The fusion method can be observed as the fusion label, obtained by adding the object information of the manual label and auto-label, as shown in Figure 6. Sets with a high probability of being the same object are indicated with blue and red boxes based on the coordinates of the classes, while classes not displayed indicate objects not detected by auto-labeling. In the fusion label, this can be interpreted as double-labeling in the fusion label for objects with high reliability.

4. Experiments and Results

4.1. Dataset Group Splitting

To investigate the impact of cross-device training on testing, the training and validation data were divided into seven groups (Table 2 and Table 3: group 1 trained on CT, BB, and SP altogether; group 2 trained on CT only; group 3 trained on BB only; group 4 trained on SP only; group 5 trained on CT and BB; group 6 trained on CT and SP; and group 7 trained on BB and SP). For result verification, the test group was divided into four categories for experimentation (Table 4: group 1 tested on CT, BB, and SP altogether; group 2 tested on CT only; group 3 tested on BB only; and group 4 tested on SP only). Details of the group division for training and testing can be found in Table 2, Table 3 and Table 4.

4.2. Manual Label-Based Object Detection Experiment

The first experiment involves conducting object detection using manually labeled data provided by AI Hub. The experimental setup utilizes the YOLOv8n.pt model, which loads pre-trained parameters from the extensive COCO dataset. Training is performed with a batch size of 32 over 100 epochs.

Figure 7 presents the results of object detection experiments trained with manual labels for the entirety of data group 1. The visual confirmation of stable training is evident as the loss decreases and the mAP values increase during training. Additionally, the confusion matrix allows for an examination of the class distribution in the data and the number of correctly predicted classes. Figure 8 shows the confusion matrix for the object detection experiment trained with manual labels for group 1 (left: original matrix; right: normalized matrix). The class distribution in the data is confirmed to be people, bicycles, and vehicles, with no scooters. Furthermore, in terms of precision, bicycles exhibit the highest results.

Figure 9 illustrates the results of predicting bounding boxes for batches of training data, visually confirming the detection of all objects without any omissions.

Up to this point, we have examined performance metrics during training. Starting from the next section, the focus shifts to the results on the test dataset. Additionally, experiments were conducted by dividing the data into seven training groups and four testing groups. Metrics such as precision, recall, mAP50, and mAP50-95 were measured. To facilitate comparison across different groups, emphasis is primarily placed on mAP50, a performance metric commonly considered crucial. While Table 3 represents the division of the validation group, it is predominantly expressed as the training group in the context of result analysis.

Table 5 presents the test results for manual label-based training across different groups. Upon examining the results, it is observed that dashcam (black box—BB) and smartphone (SP) exhibit similar trends, possibly due to the similarity in their screen displays. However, the overall performance of the dashcam is consistently lower than that of the smartphone, possibly attributed to the tinted nature of dashcam screens. Additionally, when looking at test group 2, a decline in performance is noted compared to training groups 1, 5, and 6, despite having a substantial dataset. This decline is attributed to the fact that training with data from different devices, where the angles of CCTV footage differ, leads to a decrease in performance. Examining training group 3, the overall performance is the lowest, suggesting potential issues with the quality of dashcam data or difficulties in identification.

The subsequent experiment was conducted without utilizing a pre-trained model. Although it is generally acknowledged that fine-tuning a model pre-trained on a large dataset yields superior performance, there is a possibility of performance degradation due to differences in data domains or issues like overfitting.

The results can be observed in Table 6, and in all experiments, the performance was consistently lower than the results obtained using a pre-trained model, as shown in Table 5. To assess whether the insufficient amount of training data was a contributing factor, we conducted training for group 1 up to 200 epochs. The results of training up to 200 epochs were superior to those of training up to 100 epochs, but it was evident that they were significantly inferior to the model trained for 100 epochs with pre-trained weights. From this experiment, we conclude that fine-tuning with a pre-trained model leads to superior performance.

4.3. Auto-Label-Based Object Detection Experiment

Next, we trained and evaluated object detection using the auto-labels obtained through Grounding DINO. The results can be found in Table 7. While there were instances where the performance was lower compared to object detection using manual labels, it was observed that the performance was higher for group 1, trained on the entire dataset. Although determining overall superiority is challenging, the performance was generally satisfactory.

4.4. Combining Manual and Auto-Labels for Object Detection Experiment

Finally, we trained and evaluated object detection using labels that combined both manual and auto-labels. The results are presented in Table 8, and it was observed that the overall performance was highest when training with combined manual labels, auto-labels, and both. This suggests that assigning weights to important objects contributed to the superior performance in the combined label scenario.

When training group 2 with auto-labels, the overall evaluation performance was lower compared to manual labels. It is interpreted that the auto-labels generated by Grounding DINO, trained on a large-scale dataset, may not be well-suited for angles looking from top to bottom, such as in CCTV footage. It is assumed that the quality of auto-labels is relatively lower for CCTV videos. Detection test results trained with manual labels show a significant performance difference among device groups, whereas test results using auto-labels reduce the gap between groups. This indicates that auto-labels provide consistent labeling without significant bias compared to manual labels.

Confirmation of no performance degradation when using auto-labels for object detection, and the ability to enhance performance by combining auto-labels with manual labels, was observed. To improve accuracy and efficiency, a comparison between auto-labels and manual labels was conducted, along with an analysis of the impact of different device types on each other.

5. Conclusions and Future Works

In this study, it was confirmed that there is no degradation in object detection performance when utilizing automatically generated labels. Furthermore, combining auto-labels with manual labels resulted in an enhancement of object detection performance. These results are applicable only when objects can be detected by Grounding DINO. Additionally, experiments and analyses were conducted using a dataset that includes data collected from various devices such as CCTV, dashcams, and smartphones. This aimed to investigate the impact of each device type on the accuracy of object detection. The auto-labeling technology of Grounding DINO demonstrated its efficiency and performance improvement in the field of object detection, providing evidence for its practical applicability.

Future research should focus on the integration and effective utilization of images obtained from various devices, as they offer broader perspectives and rich information. Additionally, beyond simple combination, there is a need to explore more synergistic and efficient methods for integrating auto-labels and manual labels.

Author Contributions

J.S.: conceptualization, methodology, software, investigation, data curation, writing—original draft preparation, writing—original draft preparation, visualization. H.J.: conceptualization, project administration, supervison, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-2020-0-01808) supervised by the IITP (Institute of Information & Communications Technology Planning & Evaluation). This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. RS-2023-00241123).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=71467 (accessed on 27 February 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Lyon, D. Surveillance technology and surveillance society. Mod. Technol. 2003, 161, 184. [Google Scholar]
Lyon, D. Surveillance, power and everyday life. In Emerging Digital Spaces in Contemporary Society: Properties of Technology; Palgrave Macmillan: London, UK, 2010; pp. 107–120. [Google Scholar]
Javed, A.R.; Shahzad, F.; ur Rehman, S.; Zikria, Y.B.; Razzak, I.; Jalil, Z.; Xu, G. Future smart cities: Requirements, emerging technologies, applications, challenges, and future aspects. Cities 2022, 129, 103794. [Google Scholar] [CrossRef]
Murugesan, M.; Thilagamani, S. Efficient anomaly detection in surveillance videos based on multi layer perception recurrent neural network. Microprocess. Microsyst. 2020, 79, 103303. [Google Scholar] [CrossRef]
Jha, S.; Seo, C.; Yang, E.; Joshi, G.P. Real time object detection and trackingsystem for video surveillance system. Multimed. Tools Appl. 2021, 80, 3981–3996. [Google Scholar] [CrossRef]
Hashmi, M.F.; Pal, R.; Saxena, R.; Keskar, A.G. A new approach for real time object detection and tracking on high resolution and multi-camera surveillance videos using GPU. J. Cent. South Univ. 2016, 23, 130–144. [Google Scholar] [CrossRef]
Strbac, B.; Gostovic, M.; Lukac, Z.; Samardzija, D. YOLO multi-camera object detection and distance estimation. In Proceedings of the 2020 Zooming Innovation in Consumer Technologies Conference (ZINC), Novi Sad, Serbia, 26–27 May 2020; pp. 26–30. [Google Scholar]
Chandan, G.; Jain, A.; Jain, H. Real time object detection and tracking using Deep Learning and OpenCV. In Proceedings of the 2018 International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 11–12 July 2018; pp. 1305–1308. [Google Scholar]
Bhatti, M.T.; Khan, M.G.; Aslam, M.; Fiaz, M.J. Weapon detection in real-time cctv videos using deep learning. IEEE Access 2021, 9, 34366–34382. [Google Scholar] [CrossRef]
Dimou, A.; Medentzidou, P.; Garcia, F.A.; Daras, P. Multi-target detection in CCTV footage for tracking applications using deep learning techniques. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 928–932. [Google Scholar]
Gavrilescu, R.; Zet, C.; Foșalău, C.; Skoczylas, M.; Cotovanu, D. Faster R-CNN: An approach to real-time object detection. In Proceedings of the 2018 International Conference and Exposition on Electrical and Power Engineering (EPE), Iasi, Romania, 18–19 October 2018; pp. 165–168. [Google Scholar]
Chan, F.H.; Chen, Y.T.; Xiang, Y.; Sun, M. Anticipating accidents in dashcam videos. In Proceedings of the Computer Vision—ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part IV 13. Springer: Berlin/Heidelberg, Germany, 2017; pp. 136–153. [Google Scholar]
Haresh, S.; Kumar, S.; Zia, M.Z.; Tran, Q.H. Towards anomaly detection in dashcam videos. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 1407–1414. [Google Scholar]
Sen, S.; Chakraborty, D.; Ghosh, B.; Roy, B.D.; Das, K.; Anand, J.; Maiti, A. Pothole Detection System Using Object Detection through Dash Cam Video Feed. In Proceedings of the 2023 International Conference for Advancement in Technology (ICONAT), Goa, India, 24–26 January 2023; pp. 1–6. [Google Scholar]
Chen, J.W.; Lin, W.J.; Cheng, H.J.; Hung, C.L.; Lin, C.Y.; Chen, S.P. A smartphone-based application for scale pest detection using multiple-object detection methods. Electronics 2021, 10, 372. [Google Scholar] [CrossRef]
Jeong, K.; Moon, H. Object detection using FAST corner detector based on smartphone platforms. In Proceedings of the 2011 First ACIS/JNU International Conference on Computers, Networks, Systems and Industrial Engineering, Jeju, Republic of Korea, 23–25 May 2011; pp. 111–115. [Google Scholar]
Martinez-Alpiste, I.; Golcarenarenji, G.; Wang, Q.; Alcaraz-Calero, J.M. Smartphone-based real-time object recognition architecture for portable and constrained systems. J. Real-Time Image Process. 2022, 19, 103–115. [Google Scholar] [CrossRef]
Aziz, L.; Salam MS, B.H.; Sheikh, U.U.; Ayub, S. Exploring deep learning-based architecture, strategies, applications and current trends in generic object detection: A comprehensive review. IEEE Access 2020, 8, 170461–170495. [Google Scholar] [CrossRef]
Xiao, Y.; Tian, Z.; Yu, J.; Zhang, Y.; Liu, S.; Du, S.; Lan, X. A review of object detection based on deep learning. Multimed. Tools Appl. 2020, 79, 23729–23791. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai, HI, USA, 8–14 December 2001; pp. I-511–I-518. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2015; p. 28. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Zhang, L. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
Terven, J.; Cordova-esparza, D. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]

Figure 1. Dataset example (CCTV).

Figure 2. Dataset example (dashcam).

Figure 3. Dataset example (smartphone).

Figure 4. Conversion of bounding box format and labels for YOLO model.

Figure 5. Teacher–student model using Grounding DINO and YOLO.

Figure 6. Example of combining manual labels and auto-labels.

Figure 7. Training results for object detection using manual labels for group 1.

Figure 8. Confusion matrix for object detection training with manual labels for group 1.

Figure 9. Object detection prediction results on training data.

Table 1. Overall dataset composition and partitioning.

	CT	BB	SP	Total
Train	2088	2088	2087	6263
Validation	576	576	576	1728
Test	1147	1152	1141	3440
Total	3811	3816	3804	11,431

Table 2. Training dataset group splitting.

	CT	BB	SP	Total
Training Group 1 (CT, BB, SP)	2088	2088	2087	6263
Training Group 2 (CT)	2088			2088
Training Group 3 (BB)		2088		2088
Training Group 4 (SP)			2087	2087
Training Group 5 (CT, BB)	2088	2088		4176
Training Group 6 (CT, SP)	2088		2087	4175
Training Group 7 (BB, SP)		2088	2087	4175

Table 3. Validation dataset group splitting.

	CT	BB	SP	Total
Validation Group 1 (CT, BB, SP)	576	576	576	1728
Validation Group 2 (CT)	576			576
Validation Group 3 (BB)		576		576
Validation Group 4 (SP)			576	576
Validation Group 5 (CT, BB)	576	576		1152
Validation Group 6 (CT, SP)	576		576	1152
Validation Group 7 (BB, SP)		576	576	1152

Table 4. Test dataset group splitting.

	CT	BB	SP	Total
Test Group 1 (CT, BB, SP)	1147	1152	1141	3440
Test Group 2 (CT)	1147			1147
Test Group 3 (BB)		1152		1152
Test Group 4 (SP)			1141	1141

Table 5. Group-wise testing results for manual labeling training outcomes.

	Group 1 mAP50	Group 2 mAP50	Group 3 mAP50	Group 4 mAP50
Training Group	Group 1 mAP50	Group 2 mAP50	Group 3 mAP50	Group 4 mAP50
Group 1 (CT, BB, SP)	0.618	0.618	0.555	0.706
Group 2 (CT)	0.494	0.636	0.368	0.479
Group 3 (BB)	0.459	0.306	0.478	0.662
Group 4 (SP)	0.558	0.418	0.597	0.732
Group 5 (CT, BB)	0.548	0.612	0.498	0.541
Group 6 (CT, SP)	0.627	0.591	0.626	0.706
Group 7 (BB, SP)	0.579	0.434	0.586	0.783

Table 6. Results of manual labeling training without pre-trained models.

	Group 1 mAP50	Group 2 mAP50	Group 3 mAP50	Group 4 mAP50
Training Group	Group 1 mAP50	Group 2 mAP50	Group 3 mAP50	Group 4 mAP50
Group 1 (CT, BB, SP)100 epoch	0.467	0.474	0.402	0.536
Group 1 (CT, BB, SP)200 epoch	0.508	0.496	0.499	0.537
Group 2 (CT)	0.276	0.393	0.187	0.287
Group 3 (BB)	0.117	0.0391	0.123	0.234
Group 4 (SP)	0.264	0.194	0.263	0.389
Group 5 (CT, BB)	0.419	0.477	0.345	0.397
Group 6 (CT, SP)	0.476	0.464	0.42	0.578
Group 7 (BB, SP)	0.302	0.2	0.312	0.433

Table 7. Training results using auto-labels generated by Grounding DINO.

	Group 1 mAP50	Group 2 mAP50	Group 3 mAP50	Group 4 mAP50
Training Group	Group 1 mAP50	Group 2 mAP50	Group 3 mAP50	Group 4 mAP50
Group 1 (CT, BB, SP)	0.644	0.627	0.639	0.676
Group 2 (CT)	0.473	0.597	0.374	0.423
Group 3 (BB)	0.458	0.345	0.495	0.555
Group 4 (SP)	0.53	0.322	0.578	0.732
Group 5 (CT, BB)	0.567	0.573	0.536	0.593
Group 6 (CT, SP)	0.609	0.579	0.565	0.697
Group 7 (BB, SP)	0.585	0.477	0.597	0.732

Table 8. Combined training results of auto-labels and manual labels.

	Group 1 mAP50	Group 2 mAP50	Group 3 mAP50	Group 4 mAP50
Training Group	Group 1 mAP50	Group 2 mAP50	Group 3 mAP50	Group 4 mAP50
Group 1 (CT, BB, SP)	0.652	0.608	0.635	0.757
Group 2 (CT)	0.435	0.567	0.363	0.339
Group 3 (BB)	0.465	0.365	0.485	0.607
Group 4 (SP)	0.534	0.375	0.53	0.757
Group 5 (CT, BB)	0.611	0.628	0.539	0.7
Group 6 (CT, SP)	0.664	0.647	0.624	0.76
Group 7 (BB, SP)	0.569	0.437	0.58	0.771

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Son, J.; Jung, H. Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection. Appl. Sci. 2024, 14, 2232. https://doi.org/10.3390/app14062232

AMA Style

Son J, Jung H. Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection. Applied Sciences. 2024; 14(6):2232. https://doi.org/10.3390/app14062232

Chicago/Turabian Style

Son, Jinhwan, and Heechul Jung. 2024. "Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection" Applied Sciences 14, no. 6: 2232. https://doi.org/10.3390/app14062232

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Research on Utilizing Various Video Sources

3. Proposed Method

4. Experiments and Results

4.1. Dataset Group Splitting

4.2. Manual Label-Based Object Detection Experiment

4.3. Auto-Label-Based Object Detection Experiment

4.4. Combining Manual and Auto-Labels for Object Detection Experiment

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI