1. Introduction
Deep learning (DL) has experienced rapid growth in recent years due to its universal learning approach, robustness, generalization, and scalability [
1]. In the area of DL, recursive neural networks (RNN) and convolutional neural networks (CNN) are the most commonly used, with CNN being the most prevalent in many applications, including image classification, speech recognition, obstacle detection, computer vision, and sensor fusion, among others.
Moreover, autonomous driving cars (ADC) constitute a constantly progressing research field, and the direction towards fully automated public cars will likely become a reality soon. According to some official predictions, most cars are expected to be fully autonomous by the year 2035, [
2]. In this scenario, one of the safety factors in ADC is the detection of pedestrians in different weather conditions and different circumstances. For instance, the proper recognition and detection of pedestrians in the vicinity of the ADC are crucial in order to avoid potential road accidents, [
1,
3]. It is believed that the correlation between ADC and the detection of pedestrians in the surroundings of autonomous vehicles is crucial in order to widen the use of this vehicle technology.
Detection using pure stereo vision systems in normal weather conditions is challenging due to intrinsic conditions such as intensity and light variations, as well as different pedestrian clothing. However, there has been a lot of work done using stereo vision systems for pedestrian detection. For example, a method using a stereo vision system for extracting, classifying, and tracking pedestrians is presented in [
4]. This method is mainly based on four directional features with a classifier that increase robustness against the small affine deformations of objects.
In addition, a recent review of object detection for autonomous vehicles (AVs) was carried out in [
5]. In this review, the capabilities of different sensors, such as radar, camera, ultrasonic, infrared, and lidar were analyzed in different weather situations. A suggestion for fusing different sensors for object detection in AVs was made, with the Kalman filter method being recommended.
Up to now, the review suggests that an individual sensor cannot handle different environmental situations such as rain, fog, snow, and so forth. This situation requires the fusion of different modalities of sensors to enhance the field of view of an individual sensor. Sensor fusion is a broad field whose applications range among military [
6], medical [
7], remote sensing [
8], self-driving vehicles [
9], mobile robots [
10], and others. Sensor fusion can be classified according to the input type that is used in the fusion network, based on the source of the fused data, the sensor configuration, the fusion architecture, and both classical and deep-learning-based algorithms [
11].
The detection of pedestrians using classical methods can be found in the literature. For instance, Shakeri et al. [
12] use lidar and/or vision systems to enhance a region of interest (RoI) by fusing an RGB image with a depth stereo vision system for better pedestrian detection. Naoki et al. [
13] use a 3D lidar pointcloud to create an information map of people in motion that surrounded the vehicle. Torresan et al. [
14] fused thermal images with video to detect people in bounding boxes. Musleh et al. [
15] used a stereo camera vision system and a laser scanner, as well as a sensor hybrid fusion method, to localize objects. Pedestrians were identified using poly-lines and pattern recognition related to leg positions, dense disparity maps, and UV disparity. The pedestrians were also identified in a bounding box.
Also, sensor fusion techniques involving CNNs for pedestrian detection have been explored in the literature. For example, in [
16], a CNN that combined HHA (horizontal disparity, height above ground, and angle) from a lidar cloud with RGB images was used for detecting people. Additionally, a survey comparing state-of-the-art deep learning methods for pedestrian detection was presented in [
17]. In this work, CNN methods such as the Region-Based CNN (R-CNN), the Fast R-CNN, the Faster R-CNN, the Single Shot Multi-Box Detector (SSD), and the You Only Look Once (YOLO) were evaluated and compared. However, the networks were trained using different datasets and tools, so the comparison may not be very accurate. Furthermore, the Faster R-CNN for multi-spectral pedestrian detection was used in [
18], where thermal and color images were fused to provide the complementary information required for daytime and nighttime detection. A method based on the Faster Region-Based Convolutional Neural Network (R-CNN) for pedestrian detection at night time using a visible light camera was proposed by Jong et al. in [
19]. In addition, an approach that combined a classical method such as a Histogram of Oriented Gradients (HOG), which is a feature descriptor, with deep learning to detect objects, showed better performance when compared to a single CNN. They also analyzed the extraction of features from a large dataset using a CNN; however, instead they used an HOG. The detection of objects in this work was done using bounding boxes (Gao, 2020).
So far, the methods mentioned previously have relied on placing bounding boxes around detected and classified objects. However, humans require the detection of pedestrians at the pixel level to better understand their surroundings, especially in the process of building autonomous cars, [
20]. Thus, semantic segmentation refers to the action of classifying objects in an image at a pixel-level or pixel-wise. In other words, each pixel is classified individually and assigned to a class that best represents it. This gives a more detailed understanding of the imagery than image classification or object detection, which can be crucial in detecting people in autonomous driving cars and other fields such as robotics or image search engines [
20,
21].
Sensor fusion approaches based on semantic segmentation have been performed in [
22]. In this approach, a semantic segmentation algorithm was used to effectively fuse 3D pointclouds with images. Another approach [
21] used a network consisting of three CNN encoder–decoder sub-networks to fuse RGB images, lidar, and radar for road detection. Additionally, Khaled et al. in [
23] use two networks, SqueezeSeg and PointSeg, for semantic segmentation and to apply a feature fusion level to fuse a 3D lidar with an RGB camera for pedestrian detection.
Semantic segmentation is a well-developed algorithm based on image data [
23] and that leverages recent work on sensor fusion carried out in [
21] to handle multi-modal fusion. In this work, an asymmetrical CNN architecture was used that consisted of three encoder sub-networks, where each sub-network was assigned to a particular sensor stream: camera, lidar, and radar. The stream data was then fused by a fully connected layer whose outputs were upsampled by a decoder layer that recovered the fused data by performing pixel-wise classification. The upsampling was done in order to recover the lost spatial information and generate a dense pixel-wise segmentation map that accurately captured the fine-grained details of the input image, [
24]. The stream layers were designed to be as compact as possible, based on the complexity of the incoming data from the sensor. The convolutional neural network presented in [
21] was used to detect roads in severe weather conditions. Therefore, an attempt to apply the same network for pedestrian detection was done in this work, but without success. Hence, the strategy had to be changed to use a similar methodology.
Thus, the main contribution of this paper is to explore the feasibility of fusing RGB images with lidar and radar pointclouds for pedestrian detection using a small dataset. It is widely recognized that larger datasets lead to better training, while small datasets can result in overfitting [
25]. Nevertheless, studies [
26,
27] have shown that data size is not an obstacle to high-performing models. To address the previous issue, a novel and practical fully connected deep learning CNN architecture for semantic pixel-wise segmentation called SegNet [
24] has been proposed. The network consists of three SegNet sub-networks that downsample the inputs of each sensor, a fully connected (fc) neural network (NN) that fuses the sensor data, and a decoder network that upsamples the data. Therefore, the proposed method focuses on detecting people at a pixel level. The task of identifying pedestrians is referred to as semantic segmentation and involves producing pixel-level classifications based on a dataset that has been labeled at the pixel level. Typically, there is only one class of interest, namely pedestrians. Moreover, the inclusion of radar in the fusion process gives the advantage of being able to detect pedestrians in severe weather conditions. Additionally, an extrinsic calibration method for radar with lidar and an RGB camera, based on the work done in [
28,
29], has been presented.
The remainder of this article is organized as follows.
Section 2 deals with the sensor calibration between lidar and radar, where an extrinsic calibration matrix
was found based on the singular value decomposition (SVD).
Section 3 shows the deep learning model architecture used in this work.
Section 4 presents the results of the calibration method and the results after training the network. Finally,
Section 5 summarizes the performance of the calibration method and the network, and conclusions are drawn, together with future work.
A GitHub ROS repository is available at [
30].
4. Results
The system used to handle the simulations was composed of a L3CAM lidar, a UMRR-96 Type 153 radar, and a GE66 Raider Intel ®Core(TM) i9-10980HK CPU with an NVIDIA GeForce RTX 3070 8Gb GPU. The Robot Operating System (ROS1) Noetic on Ubuntu 20.04.5 LTS was used to collect sensor data, compute extrinsic parameters, and align the sensors. Moreover, the CNN network was simulated in a Jupyter Notebook using Python 3 and a conda environment. More specifically, the simulations were divided into two parts: an extrinsic parameters matrix and network simulations.
4.1. Extrinsic Parameters Matrix
A ROS1 wrapper was developed to collect lidar data, and the ROS1 SmartMicro driver was used to gather radar data. As a result, a dataset of 11 lidar–RGB–radar (LRR) images was taken, 8 of which were taken indoors and 3 of which were taken outdoors.
Figure 9 shows the RGB image of the calibration board, and
Figure 10 illustrates in white the lidar pointcloud and in colored cubes, as well as the more sparse radar pointcloud. It is worth mentioning that the radar was used in long-range mode to reduce noise and better detect the corner detector, as is shown by the purple cube in
Figure 10, placed almost in the middle of the four circles.
Then, Algorithm 1 was applied to each lidar–RGB image to obtain the lidar point coordinates
of the middle point of the four circles. On the other hand, the ROS1 radar driver gives the coordinates of the corner detector placed behind the middle point.
Figure 11 depicts the position of the lidar point as a blue sphere, while the position of the radar’s coordinate is shown as a brown cube. The colored pointcloud is the
plane parallel to the board.
The two-point datasets, for example lidar and radar point sets, are shown in
Figure 12. It can be seen that the two dataset of points were not aligned. Therefore, alignment was necessary. In other words, Algorithm 2 was applied to the two datasets to find the extrinsic matrix parameters
. Afterwards, Equation (
9) was applied to the radar dataset to align and transfer it to the lidar frame
.
Figure 13 shows the radar set after correction.
The norm was calculated in both lidar and radar datasets before and after correction by Equation (
9) as shown in
Figure 14. It shows the first eight images taken inside a laboratory; therefore, their distances fluctuated between 1.5 m and 2.0 m. The last three images were taken outside in a parking lot, where distances fluctuated around 14.0 m. Moreover, it can be seen by the blue line that the norms were reduced. Also, the figure shows the mean values of the two norms. As expected, the mean of the norm after correction was smaller than the mean of the norm before correction.
The transformation matrix
was applied to the radar coordinates given by the corner reflector, as can be seen by the blue sphere in
Figure 11. Moreover, an outdoor LRR image was taken from a parking lot with people, as illustrated in
Figure 15. The colored spheres represent the radar dataset before correction, and the white spheres represent the radar dataset after correction by Equation (
9).
4.2. Network
The CNN was trained using a custom dataset that was taken at a parking lot with people walking in front of the system. The dataset was taken in the format yyyy-MM-dd:hh:mm:ss:zz, where yyyy denotes the year, MM the month, dd the day, hh the hour, mm the minute, ss the second, and zz the milliseconds. This format was chosen because the lidar, RGB, and radar frequencies are 6 Hz, 10 Hz, and 18 Hz, respectively. For instance, a typical format is 20221017_131347_782.pcd where hh = 13, mm = 13, and ss = 48. This means 48 will be repeated 6 times for the lidar, 10 times for the RGB, and 18 times for the radar. The value of zz = 782 will be different for each reading, making it easier to synchronize them.
Afterward, the dataset was synchronized, producing a total of 224 LRR images. Then, 30 LRR images were selected and expanded to 60 by flipping and adding them to the original set. The program ‘labelme’ was used to label the RGB images as a ‘single-class person’. In other words, ‘labelme’ can create polygons of the persons and save them as a JSON, which can be extracted as a PNG image with a single person class. The LiDAR images were saved in PCD format, extracted, and converted to 2D grayscale images with a 16-bit depth. They were interpolated to improve their quality. The radar images were also saved in PCD format, extracted, and converted to 2D grayscale images with a 16-bit depth. The location of each point in the radar images was projected vertically.
The training, evaluating, and testing modes were executed on Jupyter Notebook with Pytorch version 1.11 and the GPU activated. The LRR images were downsampled to a size of 256 × 256. Moreover, due to the capacity of the GPU, which is 8 GB, a batch size was set to one. Next, the network was trained for 300 epochs. Additionally, 10 LRR images were used for validation and 10 LRR for testing, giving a total dataset of 80 LRR images.
The results of the training mode are shown from
Figure 16,
Figure 17,
Figure 18,
Figure 19 and
Figure 20.
Figure 16 shows the RGB image that has been applied to the EN, whereas
Figure 17 illustrates the ground truth image. Moreover, the lidar and radar images are shown in
Figure 18 and
Figure 19. Finally,
Figure 20 depicts the resulting fused image between the lidar, radar, and RGB. Furthermore, the time elapsed during the training of the network was 7019.40 s, while the mean average of the model in testing mode was 0.010 s.
The target and prediction during training were stored in a vector. Later on, the intersection over union (IoU) was evaluated for each frame to see the percentage of the overlapping area between the prediction and the target. To this end, to evaluate the results of the network in training mode, pixel accuracy and intersection over union (IoU) metrics for the image segmentation were used. Additionally, the cross-entropy loss function was applied.
The IoU refers to the ratio of the overlap area in pixels to the union of the target mask and the prediction mask in pixels and is represented by Equation (
10).
The pixel accuracy refers to the ratio of the correctly identified positives and negatives to the size of the image and is represented by Equation (
11).
Both Equations (
10) and (
11) are defined in terms of pixel accuracy.
where the true positive (TP) is the area of intersection between the ground truth (GT) and the segmentation mask (S). The false positive (FP) is the predicted area outside the ground truth. The false negative (FN) is the number of pixels in the ground truth area that the model failed to predict. These parameters can be obtained using Equation (
12). In this equation, if the confusion vector is equal to 1, this means that there is an overlapping; then the TP is computed by adding all the ones. The same procedure happens for the FP, TN, and FN.
Figure 21 shows the IoU versus pixel accuracy for over 60 LRR fused images. The average loss over the training of 300 epochs is shown in
Figure 22.
The results of the evaluation mode are shown in
Figure 23, where the blue line indicates the average loss of the training mode and the orange line indicates the average evaluation loss. It can be seen that the model started overfitting around epoch 50, with a lowest loss value of 1.411. At this point, the model was saved for use in the testing mode. In addition, the IoU and the pixel accuracy for the testing mode is shown in
Figure 24, where the mean of the IoU was 94.4%, and the pixel accuracy was 96.2%.
The results of the testing mode are displayed in
Figure 25, where the ground truth and the model’s output are overlaid. The model’s detection of pedestrians is shown in red, which represents the overlapping area with the ground truth. The white spots near and between the pedestrians are part of the model’s output but do not overlap with the ground truth. Light blue depicts weakly detected pedestrians, which was a result of the model’s overfitting and the limited size of the training data. This can be confirmed by the loss of 1.413 between the output and ground truth.
5. Conclusions
In conclusion, this article explored, as a preliminary or pilot step, the feasibility of fusing lidar, radar pointclouds, and RGB images for pedestrian detection by using a sensor fusion pixel semantic segmentation CNN SegNet network. The proposed method has the advantage of detecting pedestrians at the pixel level, which is not possible with conventional methods that use bounding boxes.
Experimental results using a custom dataset showed that the proposed architecture achieved high accuracy and performance, making it suitable for systems with limited computational resources. However, since this is a preliminary study, the results must be interpreted with caution.
An extrinsic calibration method based on SVD for lidar and radar in a multimodal sensor fusion architecture that includes an RGB camera was also proposed and was demonstrated to improve the mean norm between the lidar and radar points by 38.354%.
For future work, we look to investigate the behavior of the proposed architecture with a larger dataset and to test the model in different weather conditions. Additionally, evaluating the architecture under a real scenario by mounting the sensors on a driving car shall be investigated. Overall, this preliminary study provides a promising starting point for further research and development in the field of sensor fusion for pedestrian detection.