1. Introduction
The proliferation of cameras and the growing availability of cheap storage, coupled with the increasing demand for security, have fuelled the development of ever more complex video surveillance systems. While the task of capturing the images of possible transgressions has been greatly facilitated, appointing human controllers (e.g., a security guard) with the duty of analysing repetitive and monotonous images, as well as a multiple-camera perspective, represents a critical flaw. Such an exhausting human effort makes it very difficult for the controller to remain vigilant at all times, which might lead to abnormal events going unnoticed. In safety-critical domains, such as the detection of suspect packages in airports or train stations, neglected occurrences may have dangerous results.
The identification of unexpected events, behaviours or objects can be recast as an anomaly detection problem [
1,
2]. The application of deep anomaly detection methods is essential to develop new surveillance and monitoring systems that do not rely solely on human supervision, reducing the risk of the aforementioned drawbacks. Despite years of research and development, the detection of anomalies in videos remains challenging, and it differs from the traditional classification problem in two large aspects. Firstly, new kinds of anomalies are constantly arising, making it virtually impossible to list all of them. Secondly, the task of collecting sufficient negative samples is costly due to their rarity. A popular method for deep anomaly detection consists of using videos of normal events as training data. A test set is then used to detect the abnormal events which would not conform to the model that was trained [
3,
4]. This approach aims to circumvent the difficulty of gathering sufficient samples representing anomalies. Most of the frames that these systems analyse represent normal scenarios; hence, the gathered data are representative of this low ratio of abnormal snippets, and fully supervised approaches are not viable.
Deep learning approaches to the detection of visual data instances that markedly digress from regular sequences have been mostly focusing on outdoor video-surveillance scenarios, mainly regarding abnormal behaviour and suspicious or abandoned object detection. A pertinent research opportunity for anomaly detection that has been overlooked in the accessible literature is posed by in-vehicle monitoring specially using solely visual data. With the increasing relevance of public transport in urban mobility, several funded projects aiming to develop autonomous surveillance systems in this area have appeared. For instance, Prevent PCP involves some of the biggest transport operators in Europe, which consider that a concerted effort is required to develop systems capable of detecting abnormal behaviours that put passengers’ safety at risk. In the initial stages of this project, the lack of task-oriented datasets has been noted; an effort to acquire and label the required footage to build a dedicated dataset has been programmed. However, in-vehicle monitoring is not limited to public transport, in which large crowds must be monitored; the advent of Shared Autonomous Vehicles [
5], which do not have a driver responsible for maintaining the well-being of passengers, must be accompanied by competent and reliable autonomous in-vehicle surveillance systems.
The development of robust solutions for in-vehicle monitoring is not straightforward, as the conditions in which it must operate are very challenging and different from those that the methods that cover outdoor video surveillance face. Nonetheless, a recapitulation of relevant state-of-the-art techniques and available resources is essential to investigate their applicability and to achieve a deeper understanding of the potential issues raised by the new scenario that these methods did not contemplate. Furthermore, the same principle must be applied to the analysis of available datasets to train and benchmark such models. It is essential to acknowledge if any portion of the available datasets is representative of the real-world settings that the systems will face; if not, the possibility of repurposing and adapting these instances should be studied. The current challenges of developing an application-oriented solution to in-vehicle monitoring have two distinct origins. On the one hand, there are potential issues that are directly linked to the characteristics of the application, which are specifically manifested by the absence of public datasets explicitly dedicated to in-vehicle monitoring. Additionally, the importance of actor independence in Shared Autonomous Vehicles, moving backgrounds and frequent illumination changes caused by the movement of the vehicle are important factors to consider. On the other hand, there are current limitations that are transversal to every anomaly detection technique that has been proposed. As Pang et al. [
6] denote, a series of complex detection challenges remain largely unsolved and are yet to be fully addressed by deep anomaly detection. The first of these challenges is the low anomaly detection recall rate caused by their rare and heterogeneous nature; as they are difficult to identify, sophisticated anomalies are missed. Additionally, since the candidate pool of anomalies is often unbounded, the strategies that these methods employ to deal with novelty must not be overlooked.
As a consequence of the previously mentioned difficulty of gathering abnormal samples for training or validation, there is a significant effort to achieve high data efficiency for learning normality and abnormality. Fully supervised anomaly detection is, for the time being, a virtually impossible endeavour, mainly due to the high cost of collecting large-scale data or generating sufficiently broad artificial dataset solutions. When some labels for anomaly classes are available, they might be incomplete, inexact (e.g., coarse-grained) or inaccurate. The subject of actor independence is relevant as well, as the ones present in the training data could generate a bias due to their lack of representativity (e.g., height, gender, age, type of clothes). In addition to learning expressive representations with a small amount of data, it is also essential to learn models that are generalisable to novel anomalies. This theme also extends to noise-resilient anomaly detection, with noise being equivalent to mislabelled data or unlabelled anomalies. The amount of noise not only differs significantly from dataset to dataset, but it is also irregularly distributed in the data space. Noise-resilient models can leverage this incomplete data to achieve better performance and robustness.
Most currently developed methods are committed to detecting individual instances that are anomalous, which are often regarded as point anomalies. However, more complex anomalies, such as conditional and group anomalies, comprise objectively different dynamics and behaviours. Conditional anomalies also refer to individual anomalous instances, but they only represent abnormal behaviour when they occur in a specific context. Group anomalies are anomalous as a whole, although the isolated behaviour of every member might not be abnormal. Furthermore, many applications require the detection of anomalies with multiple data sources, heterogeneous (e.g., video and audio) or not (e.g., multiple surveillance cameras). The complexity of these systems is yet to be properly addressed by deep anomaly detection strategies, even though high-dimensional anomaly detection has been a long-standing problem [
7]. Identifying intricate feature interactions and couplings is already a challenge when temporal and spatial interdependency relationships are considered.
The success of Machine Learning (ML) has led to a growing interest in the development of Artificial Intelligence (AI) applications capable of providing explanations to their decisions, which are often called Explainable AI [
8]. This information is essential for users to trust, understand and manage these applications. However, every explanation is set within a context that depends on what is expected of the AI system. For in-vehicle monitoring, anomaly identification should be one of the main points of interest, and it should generally be paired with anomaly classification. The detected anomalies should be coupled with cues that demonstrate why a specific data instance is abnormal. The simplest implementation of this practice is to spatially identify the anomaly in a frame (e.g., with a bounding box or a GradCAM activation map). However, most anomaly detection studies focus on detection performance only, ignoring the capability of illustrating the identified anomalies. The complexity of the anomalies calls for developing visually interpretable anomaly detection models, as the provided cues could be essential to identify problems such as under-represented groups. That said, this work appears as the first aggregated critical review on the applicability of deep video anomaly detection to in-vehicle monitoring, making the following three major contributions:
Review of a large number of state-of-the-art methods for deep video anomaly detection, aiming to explain their framework and implementation, thus providing a deeper understanding of potential issues raised by the new scenario that these methods did not contemplate. Benchmarks for their performance were compiled as well as publicly accessible source codes to evaluate the ease of applicability;
Review of a large number of datasets with real anomalies that are used to benchmark state-of-the-art models, investigating if any portion of the available datasets is representative of the real-world settings for in-vehicle monitoring or if the sequences can be repurposed for this matter. As public datasets dedicated to in-vehicle monitoring are lacking, this analysis is vital;
This work initiates an important discussion on application-oriented issues related to deep anomaly detection for in-vehicle monitoring. Other surveys and reviews have disregarded this scenario and its specificities, despite its relevance, as shown by the listed funded projects that seek application-oriented solutions for in-vehicle monitoring. Possible solutions were proposed, aiming to follow up on future work.
This document is organised as follows:
Section 2 presents the working principles, different approaches, and state-of-the-art works in deep anomaly detection for video sequences.
Section 3 reviews currently used datasets to train and benchmark models for anomaly detection. Moreover,
Section 4 discusses funded projects that seek the exploration of in-vehicle monitoring, listing available resources that serve as a starting point for developing application-oriented solutions.
Section 5 examines the challenges that this new scenario of application faces as well as the opportunities that arise with its exploration. Finally,
Section 6 presents the main conclusions of this review.
5. Challenges, Approaches and Opportunities for In-Vehicle Monitoring
Anomaly detection in confined spaces, such as the interior of vehicles, is an interesting new application scenario for these methods. However, as the work of Augusto et al. [
5] demonstrates, the development of solutions for this use case is still fully dependent on the availability of private datasets. A subset of a dataset provided by Bosch Car Multimedia containing videos of nine different actor pairs performing various activities in the backseat of a vehicle was used by the authors. Every video featured two actors in every frame, and the anomalies present in the subset are strictly related to violent interactions between two individuals only (e.g., slapping and punching). However, the relevance of objects was not considered in this work, whether for representing a danger to the passengers or simply as an object that was left behind by one of them. The latter is of significant importance in the sugegsted shared autonomous vehicle scenario.
Creating new datasets or expanding existing ones appears to be an immediate need for considering new use applications for anomaly detection. The former is a complex and costly task that implies allocating resources for staging and recording the desired interactions. An additional bureaucratic effort is also required to obtain permission from the actors involved. Moreover, a post-recording labelling effort is time consuming. Hence, an attractive option relies on synthetic data that could be generated for direct use or to augment available data. The work of Acsintoae et al. [
55] is referred to as an interesting approach to the translation of simulated objects to real-world datasets using a CycleGAN [
56]. Similar hybrid strategies could be employed to circumvent the lack of data for in-vehicle monitoring applications. Furthermore, such strategies could pre-emptively add some artificial variety to the available video sequences. As it was referred, the videos provided by Bosch Car Multimedia that were used by Augusto et al. [
5] contained only nine different actor pairs. The work of Capozzi et al. [
70] has linked the lack of actor independence with the underperformance of the trained models, as a bias is developed linking certain actors to certain actions, instead of learning the pattern of the action. Moreover, a larger pool of actors’ characteristics, artificially increased or not, is essential to expose the model to diverse scenarios. Some additional challenges arise from the type and model of the vehicle that was used. For instance, the shape of the windows affects the background and light conditions in the captured scenes. Furthermore, the seats of the vehicle influence the range of movements of the passengers as well as their pose. The diversity of vehicles and actors are essential factors to produce a robust model.
Choosing the best model for a new use case such as anomaly detection inside of a vehicle is not straightforward. The typical scenario of the reviewed publicly available datasets does not faithfully represent the new environment in which anomalies must be detected; therefore, their use does not produce an authentic benchmark of the proposed methods. Most of these sequences were captured with stationary video cameras that were recording static backgrounds. Although cameras inside vehicles are also stationary, windows on a moving vehicle produce a partially moving background on the recorded sequence. The distance between the cameras and the subjects is much smaller inside a vehicle, increasing the effect of geometric distortions on the captured information. Additionally, headlights of other vehicles, public illumination and occlusions of sunlight produce more frequent illumination perturbations in the scene than those found on datasets that focus on a pedestrian walkway, for instance. The behaviour of the available models in such scenes is uncertain, as these did not have to specifically build and test tools for such problems. However, they cannot be neglected to build a successful application for this use case. Furthermore, in the available datasets, especially the ones regarding pedestrians and crowds, the entirety of the body of the actors is visible; therefore, the models can benefit from this information to detect anomalies. However, inside confined spaces, this might not be possible. Taking into consideration the in-vehicle scenario, due to the limited available camera positions, part of the legs of the passengers are occluded, as
Figure 5c demonstrates. Therefore, in such tasks, the models are limited to partial information regarding the human actors and the area in which the actions take place.
None of the datasets that were analysed in
Section 3 present a convenient tool for training and benchmarking a model for anomaly detection inside a vehicle or similarly confined spaces. The datasets that comprise sequences of pedestrians and crowds were mostly recorded outdoors and cover a great area when compared to the new scenario of interest. Additionally, the normal samples that these sequences present consist of people walking or simply standing, which are actions that would be considered abnormal inside a car. As far as confined spaces are concerned, the datasets that present real-world anomalies, UCF-Crime [
40] and XD-Violence [
59], possess some scenes that fit this context. However, the available labels do not give any information regarding the location in which the sequences take place; therefore, an additional labelling effort would be required. Moreover, they do not present coherence in terms of the placement of the camera or the type of confined spaces presented, as these videos were extracted from films or the internet. On the other hand, SVIRO-Uncertainty [
63] depicts an in-vehicle scenario, despite not presenting relevant information for anomaly detection in terms of abnormal actions perpetrated by the passengers. Its potential remains solely linked to the detection of dangerous or abandoned objects, which is a subset of anomaly detection.
A common issue with the proposed deep anomaly detection techniques was noted by Pang et al. [
6]. Most anomaly detection studies focus on detection performance only, ignoring the capability of illustrating the identified anomalies. Although it would be relevant to classify the abnormal behaviour that was detected, the detection could represent a novel anomaly. Hence, it is crucial to at least provide spatial cues that demonstrate the specific data portion that is anomalous. These cues might prove useful as a tool for interpreting such complex models and identifying scenarios in which they could be missing. Furthermore, the works of Liu et al. [
53] and Landi et al. [
54] have proven that locality is a powerful instrument to improve performance and reduce background bias. Some re-labelling was required to construct both attention-driven models, but robust results were achieved. Additionally, the model proposed by Landi et al. [
54] was able to provide spatiotemporal proposals for unseen surveillance videos leveraging only video-level labels, which is a useful feature for the needed expansion of datasets.
6. Conclusions
In this article, various deep learning methods for anomaly detection in videos were discussed. Studying the defining characteristics of state-of-the-art methods is important not only to gain a better understanding of the general problem of anomaly detection but also to understand how the offered solutions could fit into the new scenario of interest: in-vehicle monitoring. The major contributions of the analysed works are briefly summarised in
Table 4. Additionally,
Table 6 provides a compilation of the available source code; the code present in these repositories comprises an interesting starting point for replicating and improving these models for new applications.
The analysis of state-of-the-art techniques provided a deeper understanding of the background of these models and its influence on their current limitations regarding in-vehicle monitoring. The focus on crowded scenes and outdoor spaces led to a failure to consider problems associated with the nature of this new scenario. For instance, the surveillance of Shared Autonomous Vehicles must consist of a much closer recording of the subjects, which raises questions about the importance of actor independence and the effect of geometric distortions on the captured information caused by the lens of the camera. Moreover, these models assume a mostly static background, although the movement of the car and the presence of windows result in moving backgrounds. Additionally, frequent illumination changes (e.g., a cloud covering the sun) result in a more intense impact on the visual information in such scenarios.
The main limitation of the implementation of anomaly detection solutions to in-vehicle monitoring is the lack of data samples explicitly dedicated to the detection of abnormal behaviours inside a vehicle or similarly confined spaces. Hence, there are currently no public datasets that could be directly used as a tool for training and benchmarking such models. The development of solutions for this use case is still fully dependent on the availability of private datasets. Although newer datasets have been adapted to benchmark models created for anomaly detection tasks, their original focus was related to action recognition tasks. The reviewed synthetic datasets presented high-quality images with an interesting amount of annotations but do not comprise a compatible set of data instances for this task. Although this is a severe challenge, it also provides a great opportunity to study techniques for data augmentation and generation. In this paper, several approaches were proposed to be implemented in future works. They can be summarised in a two-stage process. Firstly, available datasets, mainly the ones covering diverse real-world anomalies, must be extensively studied to find instances representative of the real-world settings that the systems will face, providing an initial reference. Secondly, the expansion and adaptation of similar instances should be contemplated. This could be achieved through the translation of simulated objects or actors to tackle the lack of available sequences but also their reduced diversity of actors, actions and significant illumination changes. This approach could also reduce the labelling effort of new captures.
This review initiates an important discussion on application-oriented issues related to deep anomaly detection for in-vehicle monitoring, which is a field that presents a high potential for exploration in future works. Other surveys and reviews have disregarded this scenario and its specificities, despite its relevance, as shown by the funded projects, such as Prevent PCP, that aim to take advantage of innovative solutions for applying anomaly detection to this scenario. Moreover, in-vehicle monitoring increases the interest in the optimisation effort of anomaly detection models for embedded systems, as its implementation requires the capability of running locally in resource-limited hardware.