1. Introduction
As of 2020, in terms of traffic accident fatalities per 100,000 population among OECD countries, the United States had the highest rate at 11.6, followed by Mexico at 10.6, Chile at 9.3, South Korea at 6, and Switzerland at 2.6 [
1]. In this context, the probability of death occurring on the road is higher in secondary traffic accidents than in primary ones, making it crucial to quickly identify accidents on the road to prevent subsequent secondary accidents. Consequently, in the field of artificial intelligence, technologies are being actively developed to quickly detect traffic accidents or accurately classify types of accidents [
2,
3,
4,
5,
6].
The technology for classifying accidents, which determines the occurrence of an accident in a given image or a specific segment of a video, involves a process of anomaly detection that identifies abnormal events from normal ones [
7,
8,
9,
10,
11]. The predictive model used for this purpose is crucial for its ability to distinguish abnormal parts within the video sequences that make up the video. The definition of what constitutes an abnormal event can vary and may include a car invading on a sidewalk or vehicles colliding, depending on the training environment.
On the other hand, with the growth of platforms for video content such as YouTube and the rapid expansion of the related market, there has been an explosive increase in video data, and while the amount and duration of video content on these platforms are increasing, there is a growing demand for content summarized in shorter forms, emphasizing the need for technology to identify meaningful segments. Within video content, the sections of interest can vary among users. To address this, new technologies have been developed that allow users to input a query, enabling the predictive model to selectively find related segments in the video. A query refers to a sentence that defines the meaning of a video segment from the user’s perspective. For example, “Cars collide” could be such a query. The video segment related to the user-defined query is then used as a video highlight, serving as training data for the predictive model. Notable technologies for video highlight detection include Moment-DETR [
12], UMT [
13], and QD-DETR [
14].
In this paper, we propose to apply the video highlight detection networks on the traffic accident dataset in order to identify accident segments within video sequences and to investigate the main aspects that should be considered during the application process. To achieve this, we introduce the concept of cross-modality, leveraging the interaction between video, text, and audio data. The video highlight detection networks are utilized to predict video segments relevant to user-defined text queries. These use the video and audio features extracted from the video and text features of the query as inputs to generate outputs that consist of highlighted time intervals, defined by start and end times, along with saliency scores.
Existing video highlight detection models, such as Moment-DETR, UMT, and QD-DETR [
12,
13,
14], have never been applied to the classification of accident videos. These models are typically used with datasets composed of activities like cooking and travel (e.g., QVHighlights [
12]), aiming to categorize specific daily life events. Our work introduces the first application of video highlight detection models for the task of traffic accident video classification.
We generate necessary supplementary information to facilitate the use of existing traffic accident datasets for video highlight detection. This additional information comprises annotations of the start and end segments of traffic accidents, queries of accident types, and saliency scores. These datasets include a range of weather conditions such as clear, snowy, and rainy, as well as various types of traffic accidents. Therefore, they can prevent traffic accident video classification models from becoming biased towards detecting accidents in specific situations.
Furthermore, we create accident classification models using three distinct primary video feature extraction methods [
15] and analyze the accident detection performance of each model.
In summary, the contributions of this paper are as follows:
For the first time, we introduce a video highlight detection network to the task of traffic accident video classification.
We generate necessary additional information to make the existing traffic accident dataset more conducive to video highlight detection.
We analyze the performance of traffic accident video classification from the perspectives of cross-modality interaction, self-attention and cross-attention, feature extraction, and negative loss.
2. Background
Several methods have been proposed for effectively extracting visual features from video data. The Slowfast network [
16] is a neural architecture designed for video classification, capable of detecting objects in images or videos and identifying specific actions or scenes within video data. This approach was first introduced by Facebook AI Research in 2019 and has since achieved first place in the CVPR2019 AVA Challenge for action recognition due to its exceptional performance. The 2D ResNet network [
17] is a deep neural network structure used for image classification, object detection, and region segmentation within videos, proposed by Microsoft Research Asia in 2015, which performed well in the large-scale ImageNet image recognition competition. The MIL-NCE pre-trained S3D network [
18] extracts features from videos to correct misalignments in narrated videos. The S3D, used as the backbone, implements 3D convolutions with 2D convolutions to reduce computational costs [
19].
The PANNs [
20] method has been proposed for the effective extraction of audio features from audio data. This network is trained on a large-scale AudioSet to recognize audio patterns and is designed with systems such as Wavegram-CNN and Wavegram-Logmel-CNN, which allow for the effective extraction of diverse audio data characteristics. The Wavegram-CNN, a system for audio tagging in the time domain, incorporates a feature known as Wavegram to learn frequency information and employs 2D CNNs to capture time-frequency invariant patterns. The Wavegram-Logmel-CNN combines the frequency information of Wavegram with the traditionally used log mel spectrogram in audio analysis to extract a wide range of information across time and frequency domains.
A variety of methods has been suggested for learning the relationship between the video and text. Among them is the CLIP network [
21], which is designed to identify the text that best describes a specific video. CLIP is a network for multi-modal zero-shot model training that was developed and made public by OpenAI in 2021. It is a notable example of the integration of computer vision and natural language processing. The multi-modal structure of CLIP uses more than one type of data to train for a specific objective function and employs a zero-shot learning strategy, which trains the model to classify data that it has not previously seen.
As previously mentioned, among the leading technologies for detecting highlights in videos are Moment-DETR [
12], UMT [
13], and QD-DETR [
14]. Moment-DETR utilizes an encoder–decoder transformer architecture, while UMT is trained with multi-modal architecture that encompasses both video and audio data. QD-DETR employs cross-attention layers and a negative loss function to enhance the relevance between queries and video clips and can be trained on video-only or video–audio multi-modal data. The primary challenges in video highlight detection are moment retrieval and highlight detection tasks.
Moment retrieval is the process of finding moments in a video that are relevant to a provided natural language query. In the video highlight detection network, this involves pinpointing the start and end times of a segment within the temporal domain of the video sequence. Queries are typically about specific activities, and datasets often have a strong bias toward the beginning of the video rather than the end. The QVHighlights dataset, introduced in 2021, was designed to counter this bias by creating queries for multiple moments within the videos, effectively breaking the bias of focusing on the beginning of the videos. The results of a model’s moment retrieval are dependent on the highlight annotations in the training data. If the training data’s highlight segments are annotated as two seconds each, the model’s predictions for moment retrieval will also reflect two-second intervals.
Highlight detection is the process of identifying interesting or important segments within a video [
22,
23,
24,
25]. Traditional datasets [
26] do not provide personalized highlights as they lack queries related to specific video segments. However, video highlight detection models such as Moment-DETR, UMT, and QD-DETR [
12,
13,
14] derive meaningful information from saliency score annotations, which rate the relevance between clips and queries. Thus, humans annotate the relevance between a query and a clip with saliency scores on an integer scale from 0 to 4 (Very Bad, Bad, Fair, Good, Very Good) [
12].
Multi-modal learning is a method that utilizes data with diverse characteristics, not just one type of data. This is similar to the way humans use multiple sensory organs to solve problems. Typically, modalities refer to various data such as images, videos, text, audio, and more. Multi-modal learning is utilized in various application areas, with research being conducted to improve performance in autonomous vehicles, music analysis, speech recognition, natural language processing, and image caption generation, among others.
In short, the adopted methods for feature extraction from the training data include video feature extraction, audio feature extraction, and text feature extraction, all of which commonly utilize pre-trained models to extract features. The aforementioned video highlight detection methods employ a transformer-based network and share the common characteristic of multi-modal learning.
In this paper, we leverage the previously mentioned video highlight detection models and the interaction of various modalities. Moment-DETR learns from video and text data, while UMT and QD-DETR consider interactions between video, audio, and text data to build robust video highlight detection models. Models such as Moment-DETR, UMT, and QD-DETR share a common characteristic of detecting segments related to queries about daily life, such as travel and cooking, within a video. However, for the first time, we apply these existing video highlight detection models to the task of traffic accident video classification and analyze the characteristics of the detailed design considerations during this application process.
6. Conclusions
We are the first to apply existing video highlight detection techniques to traffic accident classification and analyze their performance. Additionally, we have generated supplementary information necessary for video highlight detection within the traffic accident dataset. Our paper analyzes the impact of performance when utilizing heterogeneous information from video or a combination of video and audio. The performance of traffic accident detection across models is compared and analyzed based on three different video feature extraction methods, and the optimal video feature extraction method for traffic accident classification is presented for each model. We also examine the differences between self-attention and cross-attention when combining video and text information, as well as consider the effect of the existing negative loss on the performance of traffic accident detection.
In the future, it will be possible to implement a real-time traffic accident detection and response system based on cross-modality augmented reality. Cameras mounted on vehicles, along with query generation modules, will be able to show drivers augmented reality in real-time. Additionally, the augmented reality system will be able to accurately detect accidents in the real world by utilizing the interaction between video and audio data. Once an accident is detected, the system can immediately display the time of the accident, the type of accident, and the saliency to the driver to help prevent secondary accidents.
In the field of traffic accident detection, efficiently combining video, audio, and text features is crucial. Moving forward, we will research designing an efficient model structure that accommodates these three modalities. We plan to expand our models by applying video with LLM and video with LAM to create a comprehensive system. Through the use of a large language model, we aim to develop a system that can accurately recognizes and interprets text information within videos, such as road signs, traffic lights, and vehicle license plates, to gain a detailed understanding of accident scenarios. Additionally, we intend to enhance the accuracy of accident classification by analyzing various auditory signals from the road, such as the patterns of screeching brakes and collision sounds, through a large auditory model.