1. Introduction
The rise in the development of autonomous vehicles underpins essential safety concerns particularly for vulnerable road users (VRUs) such as pedestrians and cyclists. Concerns have been mounting specifically surrounding whether the autonomous vehicle is able to take them into consideration while operating on public roads. Therefore, it is critical that the autonomous vehicle can detect, classify and predict the intention of the VRUs in real time, and required action is taken not to compromise the safety of other road users. To achieve this, deep learning (DL) techniques have recently been employed for detection and pose estimation to predict the intention of pedestrians and cyclists. For example, Convolutional Neural Networks (CNNs), a type of DL technique, have been highly successful in the field of object detection, particularly, pedestrian detection [
1,
2,
3,
4,
5]. Recent advances of such DL techniques have outperformed previous methods of computer vision problems (see [
6,
7,
8,
9,
10] reduced number of refs). Some DL techniques used for pedestrian detection have achieved miss rates of less than 10% [
11]. Although the miss rate is significantly low, they are yet to reach human levels of detection, and therefore significant research is still necessary. Until detection levels are improved, autonomous vehicles remain a danger to VRUs.
According to the World Health Organisation (WHO), nearly half of road traffic fatalities are experienced by pedestrians and cyclists than any other road users as they do not have any special means of protection (i.e., helmets, clothing, etc.) [
12]. To be able to predict the intention of a pedestrian using identification and pose estimation techniques would provide a higher level of safety for all road users. In 2013, WHO reported that it is expected that traffic accidents will be the fifth leading cause of death by 2030, rising from the current eighth position [
13,
14]. In 2013, VRUs make up more than a quarter of victims of traffic accidents. Of the deaths recorded due to traffic accidents, 42% were pedestrians and 16% were cyclists, with 69% of these fatal accidents occurring in urban locations. In 2017, of all fatalities due to road traffic accidents, 21% were pedestrians, and 8% were cyclists [
15]. In the UK, pedestrians and cyclists accounted for 26% and 6% of road traffic fatalities, respectively [
16] in 2017. Most accidents occurred in rural roads (55%) and Urban areas (37%). It is also worth noting that half of the accidents involving pedestrians occur at night [
17,
18].
Autonomous vehicles aim to make the roads safer for the VRUs through accurate detection. Although detection systems have become more accurate, they have yet to reach human levels. To improve the accuracy of detection systems, the challenges that need to be overcome include occlusion, crowding, weather and lighting conditions. The flowchart in
Figure 1 represents the tasks required by the autonomous vehicles to safely detect and estimate the future actions of VRUs. This process allows the vehicle to safely navigate with respect to the VRU. The interaction between the autonomous vehicle and its surroundings is achieved via sensors which collect information primarily to detect and track objects. The sensor input in
Figure 1 is affected by external sources, which can reduce efficiency. Typically, the sensing method relies on a vision-based approach such as visible band cameras (operating at the spectrum of 400 nm to 700 nm) [
1,
2,
3,
4]. Sensing based on the visible light spectrum is susceptible to ambient light, shadowing and weather conditions. During low-light conditions due to the time of the day, weather, shadowing, etc. can reduce the accuracy of the sensors. A common approach to overcome this problem is to create multiple sensor systems using sensor fusion (e.g., combining visible and infrared band camera) to increase the robustness and accuracy [
3,
4]. The thermal sensor detects the thermal radiation from an object, which allows the detection and tracking of pedestrians and cyclists in low-light conditions.
The accuracy of the classification, detection and pose estimation are based on the quality of the sensor data information. The focus of this paper is to provide an overview of the current pedestrian and cyclist detection and intent estimation techniques and compare the existing techniques. Building upon the vast existing literature in the field of computer vision and object detection, pedestrian and cyclist detection will be explored and discussed. The detection stage allows for identification and location of such objects in images and video frames [
19], therefore making it a vital part of autonomous vehicles [
20,
21,
22,
23]. Detection results are then used for tracking and pose estimation of the pedestrians/cyclists. As DL techniques for VRU detection and intent estimation will be the primary focus, this will not encompass tracking techniques.
The purpose of this survey is to provide comprehensive review of the recent studies undertaken in both pedestrian and cyclist detection and pose estimation based on state-of-the-art sensor fusion and DL techniques. There is limited work focused on cyclists detection compared to pedestrian detection. There is also limited work on using multispectral data for VRU detection. Using sensor fusion techniques with DL can lead to improved results based on previous state-of-the-art methods. Therefore, it is critical to find an optimal fusion technique to improve the detection accuracy of the system. Once detected, pose estimation techniques can be applied to the VRUs.
The organisation of the paper is as follows:
Section 2 will highlight the challenges and importance of detection and intent estimation for autonomous vehicles.
Section 3 and
Section 4 will provide a brief history of object detection techniques and its typical detection pipeline.
Section 5 will explore the state-of-art techniques based on DL currently used in pedestrian and cyclist detection.
Section 6 discusses the architectures of the DL-based detectors for pedestrians and cyclists.
Section 7 outlines the datasets used for pedestrian and cyclist detection.
Section 8 will discuss DL-based sensor fusion approaches for an improving detection.
Section 9 introduces the latest DL approaches that are used for pose estimation and intent estimation. Concluding remarks and future works will be presented in
Section 10.
2. Challenges of Detection and Intent Estimation
Advanced driver assistance system (ADAS) technology, such as cruise control, emergency braking and lane departure system have brought a certain level of safety for vehicles and other road users. Automatic speed control (cruise control) was developed in the early 1990s, based on electronic cruise control technology that was introduced in the late 1960s. It was not widely implemented until the 1980s [
24]. From the cruise control technology, adaptive cruise control was developed. It uses sensors to detect vehicles in front to adjust speed to maintain a distance between the vehicles. These sensors have also been used for emergency braking if an object is detected within a given range. Lane departure systems are used for warning the driver of potentially unintended lane changes. Initially designed for semi-truck drivers, they were adopted by consumer vehicles in 2001 as part of the lane keeping support system by Nissan [
24]. By monitoring the use of the indicators, the system detects if a lane departure is intentional. If the vehicle begins to change lanes without the use of indicators, the system warns the driver. The systems discussed above are dependent on driver intervention and focus on a single aspect of dangers on the road. However, the technology discussed above cannot provide a sufficient level of safety for a fully autonomous vehicle. So, further research is required to increase detection accuracy for autonomous vehicles.
For a fully automated vehicle, the detection of dangers associated with pedestrian/cyclist detection should be a continuous operation, as represented in
Figure 1. This cannot be achieved by a driver driven vehicle as the driver cannot maintain a continuous level of awareness to their surroundings. Even with the considerable progress on autonomous vehicles, further development is pivotal for pedestrian and cyclist detection to address safety concerns. Therefore, this continues to be an area that is being investigated and explored, as in [
25,
26,
27] reduced ref grouping.
3. Detection Techniques: A Brief History
Detection techniques, especially for pedestrians, has been widely researched with several techniques. The first instance of object detection is known as the region of interest (ROI) [
28]. Once the potential location of the desired object (i.e., pedestrian or cyclist) is identified in an image, feature extraction takes place. These features can include edges, shapes, curvature, etc. These features are sent to a classifier for classification [
28] (see
Figure 1).
The Background Subtraction (BS) approach was the first technique applied for detecting a moving object. In this approach, the moving objects are identified by comparing the current frame with the reference frame, known as the background image [
23,
29]. This method is simple to implement but is susceptible to environmental conditions such as light intensities (i.e., time of day, shadowing) and dynamic backgrounds [
30]. To improve the detection and tracking, a number of advanced techniques such as the sliding window, objectiveness and selective search were developed [
31].
Algorithms for feature extraction and classification for object detection can be either hand-crafted or DL-based methods. Hand-crafted methods for feature extraction are based on models that were manually designed on low-level features to propose ROIs [
19]. These models were based on techniques such as BS, the histogram of oriented gradients (HOG) features [
32,
33] or local binary pattern (LBP) [
34]. Hand-crafted methods can be limited and not very robust as complex features can be difficult to hand-craft. DL techniques allow the network to determine features. This can provide a higher level of abstraction.
Then classifiers, such as a Support Vector Machine (SVM) [
19,
35,
36,
37,
38], a decision tree [
19,
36,
37,
38] or a deep network [
39,
40] are used to classify the object (e.g., pedestrian, cyclist) in the image or video sequence. Deep networks have shown promising results in pedestrian detection, outperforming some traditional methods of pedestrian detection. DL-based techniques will be discussed in later sections.
Some of the more commonly used hand-crafted techniques for pedestrian and cyclist detection are discussed below. Haar-like features detect the changes in intensities in the horizontal, vertical and diagonal directions to detect the object [
23,
41]. Viola and Jones (VJ) implemented the Haar-like features detection approach, while also taking into account the intensity information from the video frame [
30,
42]. Introduced in 2003, it used the sudden changes in pixel intensities to detect the shape of an object [
42,
43,
44]. The VJ detector was one of the earlier techniques designed for pedestrian detection [
42]. It used box-shaped filters for feature extraction which is then fed into a classifier based on adaptive boosting known as AdaBoost [
45]. Dalal and Triggs presented the HOG (detector which uses a linear SVM for classification [
25,
32,
43,
44]). The HOG detector finds an object’s shape and appearance based on the intensities of the local gradients or the orientation of the edge [
23,
32]. The HOG detector became a building block for the Deformable Part Model (DPM) detector in later works [
25,
35,
44,
46,
47]. DPM was used to weaken the effects of deformation of non-rigid objects [
48]. DPM is a popular method for object detection and works well with varying and occluded appearances [
48,
49]. Based on the DPM, many other object detection methods have been proposed [
50]. DPM was implemented in [
25] to simultaneously detect and classify both pedestrians and cyclists using an innovative detection approach with a deep network for classification and localisation. The detection method, upper body-multiple potential regions (UB-MPR ), focused on the UB of the pedestrian/cyclist for object candidate abstraction as the UB of these road users are normally similar and visible. The potential object regions were extracted using multiple potential regions (MPR) for the UB of the candidate. These potential objects were then sent to a Fast Region-Convolutional Neural Network (R-CNN) [
51] for classification. A Fast R-CNN is a DL approach which will be discussed in later sections. A similar approach for using DPM was found in [
52]. Methods for detecting pedestrians can be employed for cyclist detection as in [
53,
54]. LBP uses a neighbourhood of each pixel to extract features [
23,
34]. This method is very robust compared with the methods above and therefore has become very popular.
5. Deep Learning for Pedestrian and Cyclist Detection
A subset of artificial intelligence and machine learning, deep learning (DL) was first introduced in the 1990s but has only recently been able to be used due to advancements and decline in costs of computational equipment (e.g., graphics processing units (GPUs)) and efficient training algorithms [
44,
62]. In particular, the Convolutional Neural Networks (CNN) algorithms have been used in the field of computer vision and image analysis [
59] for object detection [
51], image classification [
7] and face recognition [
63]. CNN approaches have been considered state-of-the-art in this field of computer vision.
Convolutional Neural Networks (CNNs) are a type of DL technique that has high performance in many fields as object recognition and classification. These objects can include faces and handwritten numerals and letters. The robustness of CNNs stems from the fact that they are able to extract information from raw-pixel content and learn features automatically [
44]. It does this by performing various operations, typically some combination of filtering, pooling and non-linear activation. One benefit of using CNNs for feature extraction, when compared to hand-crafted methods, is that CNNs learn features from the images without explicit programming.
Since 2012, new approaches based on DL techniques have developed for pedestrian detection such as AlexNet [
1], a CNN technique developed by Alex Krizhevsky and named after the developer [
64]. AlexNet was trained used in an ImageNet dataset. For ImageNet, the custom is to report two error rates; top-1 (full testset) and top-5 (fraction of testset). AlexNet error rates were 37.5% and 17.0% for top-1 and top-5 respectively. Prior to AlexNet’s results, the best performance in terms of error rates were 47.1% and 28.2%. These results aided in the designing of hardware to improve the performance of CNNs for an increased accuracy in detection as well as the affordability of training of the CNNs.
DL uses multiple layers, which are able to extract features, such as edges or patterns in images and use these features to classify an object. In this way, deep neural networks such as CNNs are used for feature learning to recognise objects such as pedestrians [
59,
65,
66,
67]. Feed-forward neural networks comprise of a series of computational nodes known as neurons that are interconnected for information processing. This is also known as multi-layer perceptron (MLP). The nodes form layers that are interconnected through parameter values called weights. The neuron functions as a logistic regression classifier. The neurons use non-linear operations to transform input data and create a decision boundary in which the data can be linearly separable. An illustration of a single perceptron can be found in
Figure 3. Multiple layers of these perceptrons create an MLP or neural network (
Figure 4). The neural network in
Figure 4 is a fully connected network. This means that each neuron receives an input from each neuron from the previous layer. For CNN, convolution layers exist within the hidden layers to perform the convolutional computations.
DL aims to detect objects in a single/multiple frames similar to how humans detect and interact with objects [
44]. However, the detection of pedestrians and cyclists has been a major challenge in computer vision. With the recent software and hardware advancements, there has been a real progression in this field. There are many survey papers for pedestrian detection [
46,
68,
69,
70,
71,
72,
73,
74] and tracking systems, including the sensor technology and processing techniques. The use of a monocular camera for capturing images of pedestrians was used in [
75]. A review of techniques of pedestrian detection techniques is compared, including some DL techniques, namely, the Convolutional Neural Network (CNN) in [
43]. However, with the recent adoption of Deep Learning (DL) techniques, state-of-the-art survey for pedestrian detection and tracking using these DL techniques should be conducted [
23].
With the introduction of DL techniques (mainly CNN-based), deep network architectures are able to propose ROIs and extract the features for classification with a single step [
23,
64,
76,
77]. By this way, the need for traditional region proposal feature extraction techniques becomes obsolete. As deep networks can achieve a higher level of abstraction than traditional methods, higher accuracy and faster run-time can be achieved by deep network-based detectors [
44]. This is one of the benefits of using DL for object detection. However, the training of these deep networks structure requires longer to build as deep networks require large annotated datasets for training. DL-based object detection has yielded encouraging results in the field of pedestrian detection and general object detection [
50,
64,
78,
79] (will be discussed in later sections).
Convolutional Neural Network
Prior to current state-of-the-art neural networks being introduced, basic neural networks (such as in
Figure 4) would sometimes find it difficult to extract useful features from raw data from sensors. To find significant features, hand-crafted methods were used [
65,
66] (as discussed in
Section 3). To overcome this and increase the performance of the neural network, Convolutional Neural Networks (CNNs) were implemented [
59,
65,
66,
67] (see
Figure 5). CNNs are based on the feed-forward neural network (where the output of a neuron would be the input of another set of neurons in the preceding layer). CNNs use convolutional operations to extract features from the input data (e.g., images, videos); with each layer using a kernel (filter) to extract input features. The activation value of the neurons in the layers represents the filtered input data. Different regions of the input are processed using convolutional operations to detect patterns in the data. Feature maps are then generated after the convolutional operation is performed across the entire input data [
65]. The feature map is a representation of the activation of different parts in the image. It is used to set the parametrisation of the weights and biases of the layers, allowing the learning of features. Max pooling is typically used after the convolution to reduce the size of the input. This reduces the computation requirement as the parameters of the input are reduced. This also aids in over-fitting.
The convolutional operator that is used is dependent on the type of input data. 2D kernels (i.e., filters) are used for 2D temporal sequences (e.g., videos) and 1D kernels are used for 1D temporal sequences. When the CNN’s kernel is used in this way, they can be used as classifiers [
65]. With several layers, CNNs are able to represent data in a hierarchical fashion. As the layers become deeper, the input data is represented in a more abstract manner, something hand-crafted feature extractors would find very difficult or impossible to achieve. This has allowed CNNs to become more of a standard practice in many fields, such as computer vision (e.g., object detection) [
64,
80,
81] and speech recognition [
82].
The network can automatically learn to extract useful information (i.e., features) from images/frames. As the CNN is a DL technique, there will be numerous neurons and layers. Each layer will learn different levels of abstraction. The first few layers learn lower level features such as edges, curves or patterns. The deep layers will attempt to combine the features to identify objects in the frame [
44]. The classifying layer typically consists of a number neurons. The number of neurons is dependent on the number of desired outputs (i.e., number of classes). For example, the classes could be pedestrian, cyclist or car, which means three classes are required. The higher the output value for one of these classifier neurons, the higher the chance that a pedestrian or cyclist is successfully detected. It is important to understand that this gives the deep network the ability to learn features without explicit programming. The learned information is stored within adjustable parameters of the network known as weights and biases. To train the network to learn features, a dataset is used. The dataset will feed numerous number of images that include the object that is to be detected. In this way, features are extracted and learned by the network. However, as the network learns based on only the dataset provided, it can be limited. Therefore, to design a more robust and accurate CNN, a very large annotated dataset is required.
6. Deep Learning Architectures for Pedestrian and Cyclist Detection
DL approaches for pedestrian and cyclist detection can be one of the following two categories: a two stages (region proposal approach) detector or a single stage (non-region proposal approach) detector. The single stage detector aims to remove the need for traditional region proposal feature extraction by processing these steps within a single network. The single stage detector can be simpler to train, with a higher computational efficiency [
5]. In this approach, a proposal of regions is first completed and then the deep network conducts the classification.
With the progress of DL and its development and success in pedestrian detection, detection accuracy has improved. The DL techniques used for pedestrian detection can include region proposal as part of the system. Some of the region proposal-based techniques include Region-CNN (R-CNN) [
79], Regional-Fast Convolutional Network (R-FCN) [
83] and Faster R-CNN [
59]. Non-region proposal-based techniques include Single Shot Detector (SSD) [
84,
85,
86] and You Only Look Once (YOLO) [
87]. All of these pedestrian detection techniques are based on CNN, which has become the standard for pedestrian detection. For the task of classification, detection techniques can be placed into one of these families: DPM variants, decision forests and deep neural networks [
47,
48]. These techniques can also be applied for cyclist detection as they are visibly similar to pedestrians [
25].
CNN is a popular technique for object classification in pedestrian detection systems [
59,
65]. An in-depth review of DL techniques is provided in later sections. A region proposal technique (such as the hand-crafted techniques described in the previous section) can be used alongside CNNs for object detection [
59,
65,
66,
67,
79]. The region proposal technique is used to suggest where an object may exist in the image. The proposed regions are fed into a classifier (e.g., CNN) to determine the class of the object. Studies of the use of deep networks for pedestrian detection applications can be found in [
88,
89,
90,
91,
92].
There are also non-region proposal-based DL techniques [
59,
79,
83].
Figure 6 depicts the difference between the two types of architecture for object detection.
Figure 6a is a detector based on traditional region proposal and feature extraction techniques, where only the classifier is the deep network.
Figure 6b represents a deep network that is able to complete region proposal and feature extraction as well as classification in a single step. This is known as a single step detector. In 2009 the Caltech dataset was introduced for benchmarking the various techniques for pedestrian detection. The ConvNet (a CNN-based approach) was introduced in 2013 with competitive results when compared to previous pedestrian detection techniques mentioned [
43].
6.1. Two-Stage Detectors
Region proposal-based CNNs (i.e., R-CNN, Fast R-CNN) has provided positive results for general object detection. An example of such a system is the use of selective search [
60] in [
79] for generating ROIs. The accuracy of this type of network is dependent on the region proposal technique that is applied as these ROIs are used for classification. Approaches have been made to improve the speed of two-stage detectors as in [
51], where feature maps are generated when the deep network extracts features from the ROIs [
59]. These techniques have since been adopted and variations of the techniques have been applied with encouraging results [
5].
For example, in [
93] the Average Precision (AP) achieved a higher score than the KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) evaluation [
94]. The resulting AP increased by 9% to 16.7%. With the development of Fast R-CNN [
79] and Faster R-CNN, computational speed has increased. Fast R-CNN is based on R-CNN, also designed by [
79]. R-CNN uses selective search (a region proposal technique) to generate 2000 region proposals rather than some large number of region proposals. These proposals are fed into a CNN to feature extraction and then an SVM for classification. There were a few issues with this approach. Namely, classifying 2000 region proposals still takes a large amount of time. Also, this technique could not be implemented in real time as it took 47s per image. To overcome these issues, [
95] (the same author who proposed R-CNN) introduced Fast R-CNN. A similar approach to the R-CNN, however, this time the image is fed into a CNN to generate a feature map. This feature map is used to identify region proposals (i.e., ROIs). This was faster than the R-CNN technique as the convolution is completed once per image rather than for 2000 region proposals, which could have more than 2000 region proposals. Fast R-CNN was found to be approximately 2 magnitudes faster than techniques based on R-CNN [
95].
Faster R-CNN, proposed in [
59], which lets the network learn region proposals, rather than use selective search, as selective search can be a time consuming process. Based on the Fast R-CNN technique, images are fed into a CNN to generate feature maps. However, instead of using selective search for identifying region proposals, a sub-network is used to predict region proposals. This sub-network, termed Region Proposal Network (RPN), learns region proposals using DL algorithms. RPN provided a mean average precision (MAP) of 75.9%, which is approximately 10.2% better compared to selective search results on the VOC (visible Object Classes) 2012 dataset [
59]. However, these networks have also shown that computation of region proposal for object detection has a bottleneck as they are dependent on traditional techniques of region proposal.
Region-based CNNs share convolutions across proposals to reduce computational costs as in [
79,
96]. However, with the Fast R-CNN region proposal could be a bottleneck for the advancement for real-time detectors. To overcome this issue, a Region Proposal Network (RPN), which shares convolutional features with the detection network was proposed in [
59]. This allows for region proposals that are almost computationally cost-free. The RPN is a CNN that functions by predicting object bounds (region proposals) and scores for those bounds simultaneously. This provides the detector with high-quality region proposals. This design performed at near real-time frame rates, improving the quality and object detection accuracy for general DL-based object detection.
For example, the Faster R-CNN, a two-stage detector, comprises of a region proposal network (RPN) and a classification sub-network. The RPN uses DL techniques to learn features in images, allowing it detect potential region proposals. These region proposals are then fed into a classifier to determine the class of the object. The Faster R-CNN has had state-of-the-art performance results on datasets, such as the PASCAL-VOC and Caltech datasets. Most notably, Fast R-CNN [
25,
40] and Faster R-CNN [
25,
40,
48,
79] have been used for pedestrian detection [
59,
97]. For example, in [
40], there was an approximately 23% error reduction using a type of R-CNN approach when compared to some state-of-the-art techniques for pedestrian detection. These types of results illustrate the effectiveness of these techniques.
6.2. Single Stage Detectors
As promising as two-stage detector may be, for them to be able to process sizeable proposals, the computation is typically heavy in the second stage (i.e., classifier stage). So, Single Stage Detectors (SSDs) have been proposed that do not rely on region proposal in the hope that they would increase the speed of the system. SSD, such as You Only Look Once (YOLO), are designed in such a way that a single network predicts region proposals as well as the class of those region proposals [
84]. This design saves a significant amount of computational time, allowing it to perform 3× faster than the state-of-the-art Fast R-CNN while achieving higher accuracy in [
59]. Approaches for using deep networks for region proposal can be found in [
80,
98,
99].
The two-stage techniques are implemented to increase the accuracy and speed of the network (when compared to R-CNN approaches), whereas, the single stage techniques focus on the overall speed of the system, allowing them to be better suited for real-time applications [
19]. A comparison of the DL architectures can be found in [
100] and summarised in
Table 1. edited table below.
7. Sensors Fusion Techniques Using Deep Learning
Even with recent development and advancement made in computer vision for pedestrian and cyclist detection, there are several challenges that need to be addressed [
101,
102,
103,
104]. One of the biggest problems is accuracy; which is affected by cluttered backgrounds, environmental conditions, occlusions [
105], and poor visibility [
104,
106].
The environment around the autonomous vehicle is perceived through sensors. These sensors collect the environmental information that is then used for detecting any pedestrians or cyclists. Sensors can be classified as either active or passive. Active sensors typically require a device to be attached to the object that is to be detected and tracked. Although active sensors provide simple processing, they have been typically applied to controlled environments [
23]. For uncontrolled environments, passive sensors are more suitable as they use natural-based signal sources (e.g., natural light, thermal readings). Therefore there is no requirement for a device to be attached to the object, making it less intrusive than active sensors. Passive sensors would be more effective for autonomous vehicle applications as the environment in which they operate will be uncontrolled and attaching tracking devices would not be feasible. Some examples of the types of sensors employed in vehicles for environmental sensing are visible cameras, thermal cameras, LiDAR and RADAR. Implementation of these sensors is also described in
Table 2. The primary focus of this review will encompass visible and thermal sensors as it has become apparent that further research surrounding these sensors are required.
7.1. Visible-Based Sensors
For the detection of pedestrians and cyclists, visible sensors are often used as they are able to capture high-resolution images [
107,
108]. The images provide useful information for detection and classification, such as colour and texture. Cameras also typically provide more information than active sensors [
25]. Visible cameras have been applied to multiple tasks for vehicles, such as lane detection, distance detection from other vehicles and traffic sign detection.
In terms of 2D and 3D visible cameras, a 2D video would provide enough information to perform object detection. It could potentially even allow for tracking using a bi-dimensional approach [
23,
112]. However, 2D cameras lose a large amount of information when used with a bi-dimensional approach [
23], and therefore may not be suitable as that lost data may have held useful scene information. The 3D cameras create a virtual environment in which the pedestrian coordinates can be represented in 3D space. Unlike a 2D system, a 3D system uses a stereo camera (i.e., multiple lenses), allowing the camera to capture 3D views based on multiple points of views.
Even though vision-based detection has been extensively researched in recent years, there are still issues and challenges while the detection system is in operations [
25,
106]. These are caused by the appearance of the pedestrians/cyclists due to occlusion, pose, crowded scenes and clothing. Cyclist detection, however, can be more difficult than pedestrian detection as cyclists can have a greater number of possible orientations. To overcome this challenge, combining visible cameras with other sensors could be beneficial.
7.2. Thermal-Based Sensors
Even though visible sensors have difficulty functioning when there is a low level of light (i.e., night-time, bad weather), they are the most commonly used sensors for pedestrian and cyclist detection applications [
106]. To overcome the shortcomings of visible sensors, thermal cameras could be used in conjunction with visible cameras as, unlike visible cameras, thermal cameras are not significantly affected by ambient lighting [
113]. A type of 3D system that could use both visible and thermal information to provide more accurate detection and tracking, known as an RGB-D was proposed in [
114,
115]. RGB data provides textural and appearance information of the object being detected, while the depth data (e.g., thermal data) can provide additional information of the shape of the object [
106]. This approach was implemented in [
106] by fusing an RGB camera with a depth camera, which detects heat signatures using a thermal sensor [
116,
117].
There are two types of thermal sensors that can be used for pedestrian and cyclist detection applications, the Near-IR (infrared) camera and the Far-IR (also known as thermal) camera. Near-IR has wavelength of 0.75–1.3
m and Far-IR cameras have a wavelength of 7.5–13
m. Pedestrians and cyclists would appear more visible to the thermal cameras than the near-IR, as the pedestrian/cyclist body heat radiates in the long-wavelength (approximately 9.3
m), making the thermal camera ideal [
108,
110,
118]. The value of the radiation emitted from a human is not particularly affected by other illuminations in the environment (e.g., street-lamps, artificial lighting). This can improve the accuracy of the detector, as demonstrated in [
106,
119,
120,
121,
122,
123]. In [
123], a commercial visible camera with a resolution 640x480 was used with a thermal camera. Testing was completed during various times of the day and weather conditions (i.e., morning, night, afternoon, rain etc.). In one of the tests, using both visible and thermal cameras, the accuracy of the system was 98.13%. When comparing this result using the visible and thermal cameras separately, the accuracy was 72.11% and 95.91% respectively.
Comparison of the benefits and drawbacks of visible and thermal cameras, RADAR and LiDAR can be found in
Table 3 and
Figure 7. Combining the sensor information can offset some of the inefficiencies of the individual sensors, such as a higher detection accuracy throughout the day, even in crowded scenarios.
7.3. Sensor Fusion
Visible sensors are effective, however, they are less reliable in low-light situations [
86]. The study in [
86,
97], suggests combining visible and thermal cameras together to increase the detection accuracy. It should be noted that thermal sensors are not very effective under high-temperature conditions and clothing can affect the pedestrian’s or cyclist’s thermal footprint [
124].
There has been a large amount of research conducted surrounding the most reliable approach in using both colour information of the visible cameras and the thermal information of the thermal cameras [
86,
97,
125,
126,
127,
128]. These studies discuss the drawbacks of visible sensors due to their dependence on illumination and the benefits of adding thermal data would provide for increased accuracy. The KAIST dataset is largely used for multispectral pedestrian detection evaluation due to its large amount of high-quality images in both the visible and IR spectrum. In [
86], fusion techniques using a CNN as the detector were discussed.
Figure 8 compares evaluation based on the KAIST dataset and typical vision-based datasets (Caltech Diamler, etc.). It demonstrates that, although recent vision-based approaches are efficient during the day, their accuracy decreases at night. Therefore, using multispectral information can aid in reducing this inaccuracy, especially at night-time. For further information of the studies used to generate the graph see [
88,
89,
129,
130,
131] for VS data for daytime, [
106,
125,
128,
132] for VS data for night-time and [
86,
97,
127,
133,
134,
135,
136] for multispectral data.
Figure 8 demonstrates, that although there has been advancements in pedestrian/cyclist detection, improvements are still required to reduce the miss rate and increase the accuracy of detection systems. Multispectral information can be used to achieve higher accuracy, however, further investigation is required.
Vision-based pedestrian detection has been widely researched, which has provided over 60 methods evaluated on the Caltech dataset alone [
25]. Despite sensor fusion techniques having been recently applied for pedestrian detection, there is not much research conducted for cyclist detection using the same multispectral approach. This is an important aspect to be considered as cyclists are among the VRUs that are affected by road traffic accidents.
7.4. DL for Sensor Fusion
With the development of R-CNN, Fast R-CNN and Faster-CNN, CNNs have become a standard technique for detection and classification applications. As CNNs have provided positive results in the field of computer vision, studies have been undertaken for using CNNs with multispectral data for pedestrian detection.
The fusion of the sensor information can be either at pixel-level, early fusion (feature-level), late fusion (decision-level), or halfway fusion [
86] (see
Figure 9). To fuse the data at pixel-level, vision-based images are converted into the HIS (Hue-Intensity-Saturation) colour space. Thermal images are intensity images; therefore, the fusion of the thermal images and the visible images takes place in the intensity (I) component. The images are then reconstructed with the new
I value. Some pixel-level fusion methods includes wavelet-based transform [
137,
138], curvelet transform [
139] and Laplacian Pyramid fusion [
140]. Pixel-level fusion is typically not used with DL-based approach for sensor fusion as the fusion takes place outside of the deep network. Therefore, early fusion, late fusion and halfway fusion are the typical architectures for DL-based sensor fusion. For feature-level (early fusion) visible and thermal images are combined together as a 4-channel input for the deep network (see
Figure 9a). The network would then learn the relationships between the image sources [
125]. In decision-level (late fusion), feature extraction takes place for both image sources into sub-networks (
Figure 9b). These features are then fused before being fed into network layers that classify the object. Halfway fusion involves feeding the colour and thermal data separately into the same network. The data is then fused inside the network (
Figure 9c).
In [
141], decision-level fusion techniques were used to combine the results of visible and thermal images for detection and tracking purposes. Hwang et al. [
106] proposed a detector for pedestrians using a aggregated channel features (ACF) technique based on fused features of visible and thermal images. The benefits of using multispectral detection techniques are demonstrated by [
97]. It was found that combining visible and thermal data produced the best results, however, during low-light conditions (e.g., night), the thermal sensor performed better on its own. Combination of the visible and thermal data actually performed worse at night with an increase of the Average Miss Rate (AMR) by 3%. Overall, data fusion decreases AMR by 5% compared to visible and thermal data used on their own during the daytime. This was unexpected as it was thought that thermal data would not add to the feature detection of visible images. Evaluation of the KAIST dataset produced competitive results (64.17% AMR) compared to state-of-the-art Caltech evaluation protocol (65.75% AMR) for pedestrian detection. It should be noted that using more than a single sensor causes an overall increase in system complexity due to alignment and synchronisation of the cameras [
97].
Fusion architectures were compared in [
125], with halfway fusion proving to be the most effective, with a 3.5% lower miss rate (MR) than the other two architectures. Also, using a single form of sensor information (i.e., visible or thermal data) was shown to be worse than the Halfway fusion model with an increased MR by 11%. The evaluation was completed on the KAIST dataset. In [
128], investigation for the optimal fusion technique for CNN-based pedestrian detection was undertaken. Fusion architectures were tested using a Faster R-CNN. Two types of fusion techniques were examined; feature-level and decision-level. Pixel-level techniques were not considered in this study. In another study [
86], the performance of pixel-level fusion with early and late fusion techniques were considered. In [
86], fusion architectures were implemented with an SSD. The results indicate that the Pixel-level Fusion does not perform well, but for early and late fusion, multispectral information can achieve better performance for pedestrian detection. Based on the KAIST dataset, using the wavelet transform for pixel-level fusion provided a lower Miss Rate (MR) for the early and late fusion of 9% and 5% respectively.
In [
97], assessment of the gain of accuracy fused visible and thermal images was conducted. To achieve this, the results from visible images, thermal images and then a combination of the two was compared. An early fusion technique approach was used for the study. Evaluation and the results were compared using the KAIST [
142] multispectral dataset. The study found that a combination of features of visible and thermal images produces a better detector than with visible images or thermal images alone in the daytime. This result was not what was expected as it was believed that thermal images would not improve the features in the visible images. There is also a slight improvement in the night-time for the combination of the images.
8. Datasets
Due to its real-world application and significance, pedestrian and cyclist detection has been a widely studied problem. The key challenges that associated with this field have been variations in pose, scaling and occlusion. These effects can be seen in major datasets, such as the Caltech dataset, where several pedestrians are affected by occlusion [
25]. Pedestrians and cyclists are traditionally considered separately, which can lead to having the input image scanned multiple times to detect the two objects independently. This not only increases computational costs, but it can further cause detection errors where the pedestrian and cyclists are misclassified due to the similarity in appearance.
A dataset is used to train a deep network for pedestrian and cyclist detection. They can also be used for benchmarking and comparing the accuracy and performance of pedestrian detection techniques, as in
Table 4. For deep network models, large annotated datasets are required to produce an accurate system [
50]. The pedestrians, or any other object in the dataset, needs to have bounding boxes that are annotated. For general object detection, ImageNet has proven to be a sufficient dataset to train a CNN [
44,
143,
144].
Datasets are required for training the network to learn features for classification of pedestrians and cyclists. There exists several datasets for pedestrians, however, cyclists datasets are limited.
8.1. Pedestrian Datasets
Although vision-based approaches cannot collect the same level of information at night-day or low-light conditions, most of the detectors are based on colour images. This is due in part to the fact that many of datasets that are used for benchmarking are of colour images [
106]. Some of the most commonly used datasets for evaluating pedestrian detection techniques are Caltech-USA [
173], KITTI [
104], ETH [
174] , TUD-Brussels [
175]. These datasets are vision-based datasets [
43]. Of these datasets, Caltech and KITTI are the major benchmarks for pedestrian detection as they are large datasets with pedestrians in scenes that pose challenges for detection due to crowded scenes, occlusion, etc. [
25].
The KAIST [
142], a multispectral dataset, combines data collected using visible and thermal cameras. The KAIST dataset aims to improve the training of deep networks for detection, as most of the large-scale (i.e., Caltech, KITTI, etc.) use only RGB-based image. The issue with using only colour images is that it is assumed that the autonomous vehicle would be functioning in a well-lit condition, which is not always the case in practice. Therefore, thermal images can be used when the conditions do not allow visible images to function as designed. As KITTI has been so widely used, [
142] was influenced to provide a similar quality dataset. They also used KITTI as a ground truth and valuation criteria.
8.2. Cyclist Datasets
There is significantly more dataset for pedestrians than there is for cyclists. A public dataset for cyclists was introduced by [
48] known as the Tsinghua-Daimler Cyclist Benchmark. Prior to this, there was no challenging dataset for cyclists. Although there was an object detection benchmark as part of the KITTI dataset, however, it contained less than 2000 cyclist instances. This could be seen as insufficient for training a detector and evaluation (i.e., testing). Cyclists have been often disregarded due to their similarities with pedestrians. However, cyclists can be just as vulnerable as pedestrians, and therefore Tsinghua-Daimler Cyclist dataset was introduced. Based on the Tsinghua-Daimler Cyclist dataset, a new dataset was presented in [
25,
48], which contains both pedestrians and cyclists. They added a pedestrian dataset into the cyclist dataset as no dataset for both pedestrians and cyclists exist at this time. A comparison of the datasets mentioned above can be found in
Table 5. It should be noted that some other pedestrian datasets, such as KAIST, do contain cyclists but not enough to train and evaluate a network.
9. Deep Learning for Intention Estimation
Intent estimation is the latter part of the detection and intent estimation system (
Figure 1). Avoiding incidents involving autonomous vehicles and VRUs is a critical aspect of the fully automated vehicles [
14,
181,
182]. Even with success in pedestrian detection using deep networks [
5], it is not enough to simply identify the pedestrian or cyclist. The autonomous vehicle must also predict if there is a chance of any harm that may befall the pedestrian/cyclist due to the autonomous vehicle action or inaction [
181]. Predicting the motion and future behaviour of VRUs can aid in improving safety for autonomous vehicles and other road users, and, therefore, can be considered as a critical part of self-driving vehicles [
183]. Once detection of the pedestrian/cyclist is completed, it must be considered whether the pedestrian may be in danger given their location with respect to the current distance, motion and path of the autonomous vehicle. Even when detection and localisation of the pedestrian are achieved, pose estimation and tracking must be considered as it can allow the vehicle to take actions or manoeuvres to prevent accidents if the pedestrian’s or cyclist’s intentions are considered to cross the path of the vehicle [
184]. However, this can be difficult, especially for long-term predictions. A major cause for this is the agile nature of a human, who can very quickly change speed and direction, and may not allow the vehicle to have sufficient time to react. This limits the reliability of the prediction system as seen in [
185]. Reduced accuracy in prediction could mean that the vehicle could misinterpret or not react in time to a pedestrian’s or cyclist’s sudden movements. In [
183], minimisation of the false detections is described for improved accuracy for long-term predictions. DL has been used for intention estimation [
186,
187] with promising results as will be discussed in this section. Some of the literature for intent estimation can be found in [
178,
183,
188,
189,
190,
191,
192,
193,
194,
195,
196,
197,
198] which is summarised in
Table 6.
As illustrated in
Table 6 and throughout this section, most research focuses on short-term path prediction based on visible cameras [
185,
189,
203]. Although there has been studies undertaken for using thermal sensors to tracking pedestrians/cyclists (see [
204,
205]), there has not been as much focus on using thermal data for pedestrian/cyclist intent estimation. Therefore it may be useful to further investigate the improvements that may be brought from sensor fusion for VRU intent estimation in the same way as sensor fusion for VRU detection was discussed in
Section 7.
As stated in [
183,
206], it was found that hand-crafted feature descriptors would not be able to provide the level of accuracy required to pedestrian detection. As DL approaches are able to extract features directly from the input data, making them effective in pedestrian/cyclist detection applications, this motivated the implementation of a deep network architecture for the purpose of pedestrian intention estimation in [
183]. The network was coupled with a long-short-term-memory (LSTM) network to improve the accuracy of the system. LSTMs are used for learning in time series [
207]. More on this work can be found in [
183].
Another approach uses Cartesian coordinates with a Bayesian Network [
185] and Gaussian process regression [
208]. The Bayesian Network used multivariate Gaussian distributions for the relative position and velocity with respect to the vehicle for a predicted time. There has also been research undertaken using body features that can aid in predicting a pedestrian’s future behaviour such as walking, standing, bending, jogging and running [
188,
209,
210]. The head orientation of the pedestrian was considered in [
183,
203]. This feature allows for determining the awareness of the pedestrian situation and the approaching vehicle.
Before the advent of deep learning techniques, typically pedestrian trajectories using Kalman Filters or naïve movement models using human gait estimation and analysis of simple heuristics [
211,
212]. However, due to the improbability of proper adaptation and handling to changes in pedestrian movement, these techniques provided poor results in terms of predicting future pedestrian movements [
1,
178].
Other than DL approaches, there have been two typical approaches for predicting the future actions of a pedestrian [
178]. One approach is based on dynamical motion modelling (DMM) [
178,
189,
203]. This approach is able to predict the motion trajectories for different scenarios. However, the model assumes that all trajectories inhibit similar dynamics. This leads the model to have a lower accuracy when predicting long-term motion modelling for intent estimation. Planning-based models (PBMs) [
213,
214], has shown better results for long-term predictions. However, the model requires the final destination information of the VRU, which is a difficult task for a moving vehicle to infer. Although the DMM and PBM approaches have proven to be quite powerful, they are dependent on hand-crafted features (e.g., HOG, Haar-like, DPM, etc.). As discussed earlier, hand-crafted features for detection are limited. The same can be said for hand-crafted features for intent estimation. When the autonomous vehicle processes an unseen or complex situation, it may not be able to react properly as the hand-crafted features create a generalisation of certain situations. That is because feature selections and parameters when designed by an expert, rather than using real-time information collected from sensors. This causes these techniques to be rather restricted and under-perform in previously unseen situations. To solve this problem, [
202] proposed a data-driven (i.e., DL) approach, where the initial motion trajectories [
178] are used for long-term intent estimation, enabling for predicting future positions of a VRU up to 4 seconds ahead. A Recurrent Neural Network (RNN)+Long Short-Term Memory (LSTM) approach was adopted. The RNN+LSTM method is typically used for time series problem.
9.1. DMM
Dynamical motion modelling (DMM) is the general approach for the future location of pedestrians based on motion trajectory [
202,
215]. In [
178], an Extended Kalman filter, a type of Bayesian filter, was used for short intent estimation (<2 s). Further details of this approach and some of the other approaches discussed in this section can be found in
Table 6, a Dynamic Bayesian Network (DBN) was used for intention estimation of a pedestrian that is walking on the curb [
203]. As part of the DBN, a Switching Linear Dynamical System (SLDS) was also implemented to predict the changes in the pedestrian’s motion. In [
199], the pedestrian intention was predicted using pose estimation based on dynamical models and behaviour classification using Balanced Gaussian Process Dynamical Models (B-GPDM) and naïve Bayes classifiers. Another approach uses a dynamic model with a HOG feature descriptor with Linear SVM detector [
32]. It uses an Interacting Multiple Model based on Kalman Filters (IMM-KF) for future predictions of the pedestrian. However, simpler methods such as constant speed velocity models can provide comparable results to IMM-KF, which is a more complex technique [
181]. The results were improved in [
189] by using a Gaussian process dynamic model with probabilistic hierarchical trajectories. This approach uses the silhouette of the pedestrian and attempts to predict its future progress. The approach predicted the pedestrian’s action with respect to the path of the vehicle. These methods require that common trajectories of pedestrians are learned and then these are classified. This means that the technique may not be reliable in previously unseen scenarios. Therefore, as pedestrians and other VRUs are able to quickly change direction and motion, DMMs decrease in reliability for increased prediction lengths [
200].
9.2. PBM
Unlike DMMs, planning-based models (PBMs) for pedestrian’s future movements do not model the intentions of the targets explicitly [
178]. Instead, they assume that the target (i.e., pedestrian) has the intention to reach a particular destination. For example, in [
200], a model was proposed for long-term intention prediction of a pedestrian. In this model, the pedestrian’s goal is to reach a location is predicted based on the estimation of the probability distribution of the future positions of the pedestrian. The model is based on a probabilistic path planning technique. A grid occupancy map is used to estimate the destination of the pedestrian based on the position and orientation of the pedestrian. The model was trained using supervised learning with pedestrian trajectories with matching grid map. Although this technique used a DL approach to learn the future movements of the pedestrian, it does not use a data-driven approach, which again can cause issues when experiencing unforeseen scenarios.
9.3. DL Approach
In [
181], a data-driven approach using a deep network is proposed which uses the skeleton features of the pedestrian to estimate future intention. The evaluation provided results similar to [
189] in terms of classification of whether the pedestrian will cross or stop when approaching the road, but the data-driven approach in [
181] is a simpler method and requires less dense information, i.e., requiring monocular information rather than stereo and dense optical flow as in [
178,
189]. In other approaches, stereo cameras [
191] and LiDAR [
183] are used to predict pedestrian intentions based on their silhouette. Head and body orientation estimation have also been used for intention estimation for pedestrians in [
190,
192,
193,
216]. Fang et al [
181] argue that it is unclear how these estimations provide accurate intention estimation or if they provide significant additional time for reactive manoeuvres. For example, before the collision, pedestrians typically look in the direction of the oncoming vehicle [
196]. In [
181] a pose estimation technique is employed and the orientation of the body and head of the pedestrian is considered as in [
193]. It is suggested in [
192], that head orientation is not particularly useful for pedestrians that are intending to stop or cross the road. Delays may be caused when predicting pedestrian intentions due to insufficient information on the posture and body movements [
194], which also makes the data-driven approach more effective as it will use the information provided from the sensors in real time.
An approach using a vision-based technique to evaluate the pose of a pedestrian over several frames to establish the risk to the pedestrian with respect to the vehicle in [
181]. The approach uses a CNN-based technique to detect and estimate the pose of pedestrian based on the work in [
217]. It is also mentioned in [
181] that high-level features, such as skeleton joints, can provide more information than low-levels, such as HOG and Histogram of Optical Flow (HOF) [
218]. The overall design used a monocular camera for a pedestrian detector and 2D pedestrian pose estimation for determining the intentions of the pedestrian [
219,
220,
221], whereas [
220,
221] used machine learning techniques that can also be applied using DL techniques. This technique is simpler to implement than some other pedestrian intention estimation techniques that require stereo cameras and optical flow to function. For future work, [
181] suggests the application of the technique in situations where there are multiple pedestrians and with occlusion for evaluation. However, to achieve this application, a dataset would need to be produced to sufficiently evaluate the technique. In [
222], a thermal sensor was used for more accurate results in low-light scenarios. The proposed method combined body features (e.g., standing, walking) with head orientation features. This method uses the distance between a pedestrian and curb (DPC), the lateral moving speed (LMS) and head orientation (HO) of the pedestrian. The approach provided the lowest error rate (22.03%) than any other combination of the features. This approach outperformed other prediction methods, such as a Markovian model [
223] and a DBN approach [
203].
RNNs have been used for sequence-based prediction in various applications, such as human gait analysis [
224], handwriting imitation [
225] and human interactions [
226]. The RNN uses a feedback loop to capture temporal information by going into an internal state known as the hidden unit. RNNs allow for data to be fed back to the previous layers [
183]. However, this type of RNN can become inaccurate in long-term predictions. Therefore, the RNN+LSTM architecture [
207], which provides for a memory unit for the RNN in the LSTM units. The stored information values can change depending on the previous outputs and new inputs. These LSTM units consist of a cell state and four gate layers, with each gate consisting of an activation function that takes into account the current state and the previous state as an input. There can be multiple memory layers, depending on the application. The other three gates are the input, output and forget. The input gate selects the input that is sent to the memory, the output is based on the memory state and input and the forget gate selects the information that is to be discarded by the memory. A detailed discussion of the architecture of an RNN+LSTM architecture can be found in [
202]. The proposed method in [
202] was data-driven for long-term pedestrian intention prediction using a stacked LSTM architecture and evaluated on the Daimler pedestrian motion trajectory dataset with promising results for intention estimation.
10. Conclusions
Visible data is the typical type of data that is used for VRU detection and intent estimation. It is argued that visible data is not very robust on its own as its reliability diminishes in low-light conditions. It was suggested by many authors that the fusion of thermal data with visible data could improve the reliability and accuracy of detection, making a more robust overall system. Fusion techniques have provided positive results, but efforts will continue to find the ideal fusion technique. What is also lacking is the dataset for cyclists, both in visible and thermal datasets. On the other hand, detection techniques for pedestrians can be adapted for cyclists. Datasets for pedestrians and cyclists in a multispectral dataset can aid in improving the accuracy and speed of object detection techniques.
Detection is the preliminary phase for the purposes of intent estimation, enabling it to identify the pedestrian/cyclist in the surrounding environment. As detection has been a challenging problem in computer vision, there is a significant amount of literature on the topic. However, this remains a problem that is yet unsolved. DL aims to aid in overcoming this challenge. DL, largely CNN, has provided a more effective method of pedestrian and cyclist detection when compared to the traditional methods that were depended on hand-crafted descriptors for region proposal, feature extraction and classification. The ability to outperform the traditional approaches is partly due to the higher level of abstraction that is achieved by the deep network, which can be imitated through hand-crafted techniques. The techniques that we have mentioned need further attention so that they can operate in real time.
However, intent estimation techniques have not received the same attention. This is due to detection being the initial step to identifying the desired object. Once an object is successfully classified as the desired object, it can be tracked so that pose/orientation estimations can be examined. Using this information, an accurate intent estimation can be achieved. DL techniques are also being employed for intention estimation approaches as traditional models of path prediction and motion modelling are not sufficient. Traditional techniques are not as robust as the data-driven techniques based on RNN+LSTM methods. Data-driven can react to unseen situations, these reactions enable it to more effective and accurate in real time.