1. Introduction
There are many ways in which the manner we maintain our society’s quality affects the productivity of locals and national societies. Infrastructures, such as roads, have always been a major factor and reflection in the overall richness of a nation’s economy. However, maintaining the quality of roads can be a challenge, and the lack of maintenance can lead to the stagnation of productivity and can even cause on-road accidents. A report from AAA, in 2004 North America [
1], showed that over 25,000 crashes per year occurred due to vehicle parts, cargo, or other materials unintentionally discharged from vehicles onto the roadway resulting in approximately 80–90 fatalities; another study by Tefft between 2011–2014 [
2] reported over 200,000 police-reported crashes due to road debris, resulting in 39,000 injuries and 500 deaths. The detection of roads that need repair or cleaning is vital in preventing accidents. However, it is challenging for a driver to detect small hazardous objects and damage on the road [
3].
We propose a road hazard detection system that will allow the detection of road hazards and allow officials to respond to the hazards quickly. By doing so, we believe that the rate of accidents caused by debris or potholes will decrease considerably. The expectation is that the method will provide officials with accurate detection of hazards that can cause accidents or damage.
There have been many attempts by researchers to solve the road hazard problem. Detection of potholes and cracks on the road has been explored, and algorithms have been developed with good performance [
4,
5,
6,
7]. Detection of objects, such as planks or tires, has also been explored in the road with excellent performance. However, detection may not always be real-time [
8,
9,
10,
11]. Different sensor types have been applied to the problem, such as passive cameras, active radar, or laser sensors. While active range sensors provide high accuracy in terms of pointwise distance and velocity measurement, they typically suffer from low resolution and high cost. In contrast, cameras provide very high spatial resolution at a relatively low cost. However, the detection task of small obstacles is a challenging problem from a computer vision perspective, since the considered objects cover tiny image areas and come in all possible shapes and appearances. Another popular method to perform object classification and detection is using semantic segmentation; this algorithm classifies objects at a pixel level, it is used as a masking tool since it effectively can separate an object from others and the background. A study of Pinheiro and Colbbert [
12] explored a possibility in which images can be segmented using CNNs; the authors utilized a well-known network called Overfeat, developed by Sermanet et al. [
13], and they combined it with a simple CNN to train the segmentation model. It helps them achieve up to 56.25% on the Visual Object Classes (VOC) 2008, 57.01% on the VOC 2009, and 56.12% on the VOC 2010. Chen et al. [
14] have developed a more sophisticated state-of-the-art architecture with CNNs, and they have been able to achieve very impressive accuracies on VOC 2012 and Cityscapes [
15] datasets, with 71.57% and 71.40%, respectively.
The study of Jo and Ryu [
4] described a pothole maintenance system to detect and repair potholes and a method to detect potholes on weak devices (i.e., embedded systems). The system proposed would have a blackbox camera attached to cars that can detect potholes. When the camera detects a pothole, information on the pothole’s location and its description would be sent to a database where transportation officials would be able to access and respond. This system would be beneficial due to reduced cost compared to laser methods and enabling a faster response time. The pothole detection method consisted of three steps: pre-processing, choosing candidate features, and cascade detection. The algorithm achieved good accuracy and was robust against false positives. One issue was that, due to the embedded system’s hardware, problems with lane detection occurred, but there was no noticeable effect on predictive performance in testing. A study of Eisenbach et al. [
5] proposed deep convolutional neural networks that performed very well but are not fit for real-time image capture on vehicles traveling at a medium speed. The network’s performance is shown not to be dependent on the size of the network in tests between traditional deep convolutional neural networks and a shallow convolutional neural network described in Zhang et al. [
6]. Performance improved on the shallow network when the image block size was increased from 64 × 64 to 99 × 99. The shallow network is small enough to be capable of running on embedded systems in real-time. In a study of Pauly et al. [
7], they tested two convolutional neural networks, one with four layers and the other five, on two datasets, the datasets being pictures of cracks in two different locations. The results show little change in performance between the networks when the same dataset is sampled for training and testing. The five-layer convolutional neural network performed better than the four-layer convolutional neural network when one dataset was used for training, and the other dataset was used for testing. Therefore, deeper neural networks are ideal for generalized detection, but this may incur more significant training time and inference time since as the number of layers increases the time it takes to process an image will increase as well.
In this paper, we will look into different object detection and semantic segmentation algorithms to analyze the best algorithm that will lead to better detection of road damages and better performance able to run independently and at a ’real-time’ speed using an embedded system. We will analyze the methods and data properties that will be used in the paper, evaluate the performance of the detectors used in our experiments, and present a trade-off between the YOLO, the Single Shot MultiBox Detector (SSD) in accuracy and performance. Finally, we present an argument that supports the results we got from the two most recent YOLO [
16,
17], semantic segmentation [
18], and SSD [
19] tests.
4. Discussion and Conclusions
Overall, the detector is performing above average, considering the small dataset on which we trained the detector. We consider these results acceptable for the size of our dataset since conventional detectors training of the Pascal VOC dataset 2007 + 2012 has a total of 9963 + 11,530 (21,493) images which contain 24,640 + 27,450 (52,090) labels divided among 20 different object classes.
Table 6 below shows the mAP of selected detectors on this dataset. However, the true lack of power in these detectors shows when tested on a much larger dataset like MS COCO, which contains more than 200,000 images and over 500,000 object instances among 80 object classes; in
Table 7, we can see that the detectors struggle to get their mAP above 50%. During the results of YOLOv2 versus the YOLOv3, we observed a significant increase in our detection accuracy. Although this came with a trade-off with the detection speed, YOLOv3 runs at almost 1/3 the original velocity of what YOLOv2 was achieving. Thus, we still need to weight off both spectrums of the detection since we need our detection to happen in real-time, but a higher accuracy would result in less false negative detections, thus yielding a better system. Image segmentation can provide us pixel-perfect hazard predictions which could be used to further classify the hazards, but it would come with a high computational cost that will result in a less than optimal “real-time” system. However, the SSD detector was not able to perform as well as those previous detectors, but it was the only one that achieved “real-time” processing in our embedded system, Google Coral. This model’s inference time per image takes about 30 ms, which allows our model to process up to 30 FPS; a car using our system could drive at 30 mph and detect hazards every 15 inches. Using this system, a city could deploy several Google Corals on vehicles that regularly transit the streets, the low accuracy shown in the results can be further improved if we can add more data to the dataset, but this can be easily achieved when the system is deployed and starts detecting other hazards on the road.