**1. Introduction**

To improve road traffic safety, autonomous vehicles have become the mainstream of future traffic development in the world. Target recognition is one of the fundamental parts to ensure the safe driving of autonomous vehicles, which needs the help of various sensors. In recent years, the most popular sensors include LiDAR and color camera, due to their excellent performance in the field of obstacle detection and modeling.

The color cameras can capture images of real-time traffic scenes and use target detection to find where the target is located. Compared with the traditional target detection methods, the deep learning-based detection method can provide more accurate information, and therefore has gradually become a research trend. In deep learning, convolutional neural networks combine artificial neural networks and convolutional algorithms to identify a variety of targets. It has good robustness to a certain degree of distortion and deformation [1] and You only look once (YOLO) is a target real-time detection model based on convolutional neural network. For the ability to learn massive data, capability to extract point-to-point feature and good real-time recognition effect [2], YOLO has become a benchmark in the field of target detection. Gao et al. [3] clustered the selected initial candidate boxes, reorganized the feature maps, and expanded the number of horizontal candidate boxes to construct the YOLO-based pedestrian (YOLO-P) detector, which reduced the missed rate for pedestrians. However, the YOLO model was limited to static image detection, making a greater limitation in the detection of pedestrian dynamic changes. Thus, based on the original YOLO, Yang et al. [4] merged it with the

detection algorithm DPM (Deformable Part Model) and R-FCN (Region-based Fully Convolutional Network), designed an extraction algorithm that could reduce the loss of feature information, and then used this algorithm to identify situations involving privacy in the smart home environment. However, this algorithm divides the grid of the recognition image into 14 × 14. Although dim objects can be extracted, the workload does not meet the requirement of real-time. Nguyen et al. [5] extracted the information features of grayscale image and used them as the input layer of YOLO model. However, the process of extracting information using the alternating direction multiplier method to form the input layer takes much more time, and the application can be greatly limited.

LiDAR can obtain three-dimensional information of the driving environment, which has unique advantages in detecting and tracking obstacle detection, measuring speed, navigating and positioning vehicle. Dynamic obstacle detection and tracking is the research hotspot in the field of LiDAR. Many scholars have conducted a lot of research on it. Azim et al. [6] proposed the ratio characteristics method to distinguish moving obstacles. However, it is only uses numerical values to judge the type of object, which might result in the high missed rate when the regional point cloud data are sparse, or the detection region is blocked. Zhou et al. [7] used a distance-based vehicle clustering algorithm to identify vehicles based on multi-feature information fusion after confirming the feature information, and used a deterministic method to perform the target correlation. However, the multi-feature information fusion is cumbersome, the rules are not clear, and the correlated methods cannot handle the appearance and disappearance of goals. Asvadi et al. [8] proposed a 3D voxel-based representation method, and used a discriminative analysis method to model obstacles. This method is relatively novel, and can be used to merge the color information from images in the future to provide more robust static/moving obstacle detection.

All of these above studies use a single sensor for target detection. The image information of color camera will be affected by the ambient light, and LiDAR cannot give full play to its advantages in foggy and hot weather. Thus, the performance and recognition accuracy of the single sensor is low in the complex urban traffic environment, which cannot meet the security needs of autonomous vehicles.

To adapt to the complexity and variability of the traffic environment, some studies use color camera and LiDAR to detect the target simultaneously on the autonomous vehicle, and then provide sufficient environmental information for the vehicle through the fusion method. Asvadi et al. [9] uses a convolutional neural network method to extract the obstacle information based on three detectors designed by combining the dense depth map and dense reflection map output from the 3D LiDAR and the color images output from the camera. Xue et al. [10] proposed a vision-centered multi-sensor fusion framework for autonomous driving in traffic environment perception and integrated sensor information of LiDAR to achieve efficient autonomous positioning and obstacle perception through geometric and semantic constraints, but the process and algorithm of multiple sensor fusion are too complex to meet the requirements of real-time. In addition, references [9,10] did not consider the existence of dimmer targets such as pedestrians and non-motor vehicle.

Based on the above analysis, this paper presents a multi-sensor (color camera and LiDAR) and multi-modality (color image and LiDAR depth image) real-time target detection system. Firstly, color image and depth image of the obstacle are obtained using color camera and LiDAR, respectively, and are input into the improved YOLO detection model frame. Then, after the convolution and pooling processing, the detection bounding box for each mode is output. Finally, the two types of detection bounding boxes are fused on the decision-level to obtain the accurate detection target.

In particular, the contributions of this article are as follows:


#### **2. System Method Overview**

#### *2.1. LiDAR and Color Camera*

The sensors used in this paper include color camera and LiDAR, as shown in Figure 1.

**Figure 1.** Installation layout of two sensors.

The LiDAR is a Velodyne (Velodyne LiDAR, San Jose, CA, USA) 64-line three-dimensional radar system which can send a detection signal (laser beam) to a target, and then compare the received signal reflected from the target (the echo of the target) with the transmitted signal. After proper processing, the relevant information of the target can be obtained. The LiDAR is installed at the top center of a vehicle and capable of detecting environmental information through high-speed rotation scanning [11]. The LiDAR can emit 64 laser beams at the head. These laser beams are divided into four groups and each group has 16 laser emitters [12]. The head rotation angle is 360◦ and the detectable distance is 120 m [13]. The 64-line LiDAR has 64 fixed laser transmitters. Through a fixed pitch angle, it can ge<sup>t</sup> surrounding environmental information for each Δ*t* and output a series of three-dimensional coordinate points. Then, the 64 points (*p*1, *p*2,..., *p*64) acquired by the transmitter are marked, and the distance from each point in the scene to the LiDAR is used as the pixel value to obtain a depth image. The color camera is installed under the top LiDAR. The position of the camera is adjusted according to the axis of the transverse and longitudinal center of the camera image and the transverse and longitudinal orthogonal plane formed with the laser projector, so that the camcorder angle and the yaw angle are approximated to 0, and the pitch angle is approximately to 0. Color images can be obtained directly from color cameras, but images output from LiDAR and camera must be matched in time and space to realize the information synchronization of the two.

#### *2.2. Image Calibration and Synchronization*

To integrate information in the vehicle environment perceptual system, information calibration and synchronization need to be completed.

#### 2.2.1. Information Calibration

(1) The installation calibration of LiDAR: The midpoints of the front bumper and windshield can be measured with a tape measure, and, according to these two midpoints, the straight line of central axis of the test vehicle can be marked by the laser thrower. Then, on the central axis, a straight line perpendicular to the central axis is marked at a distance of 10 m from the rear axle of the test vehicle; the longitudinal axis of the radar center can be measured by a ruler, and corrected by the longitudinal beam perpendicular to the ground with a laser thrower, to make the longitudinal axis and the beam coincide, and the lateral shift of the radar is approximately 0 m. The horizontal beam of the laser thrower is coincided with the transverse axis of the radar, then the lateral shift of the radar is approximately 0 m.

(2) The installation calibration of camera: The position of the camera is adjusted according to the axis of the transverse and longitudinal center of the camera image and the transverse and longitudinal orthogonal plane formed with the laser projector, so that the camcorder angle and the yaw angle are approximated to 0. Then, the plumb line is used to adjust the pitch angle of the camera to approximately 0.

2.2.2. Information Synchronization

> (1) Space matching

Space matching requires the space alignment of vehicle sensors. Assuming that the Velodyne coordinate system is *Ov* − *XvYvZv* and the color camera coordinate system is *Op* − *XpYpZp*, the coordinate system is in translational relationship with respect to the Velodyne coordinate system. The fixing angle between the sensors is adjusted to unify the camera coordinates to the Velodyne coordinate system. Assuming that the vertical height of the LiDAR and color camera is Δ*h*, the conversion relationship of a point "M" in space is as follows:

$$
\begin{bmatrix} X\_V^m \\ Y\_V^m \\ Z\_V^m \end{bmatrix} = \begin{bmatrix} X\_P^m \\ Y\_P^m \\ Z\_P^m \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ \Delta h \end{bmatrix} \tag{1}
$$

(2) Time matching

The method of matching in time is to create a data collection thread for the LiDAR and the camera, respectively. By setting the same acquisition frames rate of 30 fps, the data matching on the time is achieved.

#### *2.3. The Process of Target Detection*

The target detection process based on sensor fusion is shown in Figure 2. After collecting information from the traffic scene, the LiDAR and the color camera output the depth image and the color image, respectively, and input them into the improved YOLO algorithm (the algorithm has been trained by many images collected by LiDAR and color camera) to construct target detection Models 1 and 2. Then, the decision-level fusion is performed to obtain the final target recognition model, which realizes the multi-sensor information fusion.

**Figure 2.** The flow chart of multi-modal target detection.

#### **3. Obstacle Detection Method**

#### *3.1. The Original YOLO Algorithm*

You Only Look Once (YOLO) is a single convolution neural network to predict the bounding boxes and the target categories from full images, which divides the input image into *S* × *S* cells and predicts multiple bounding boxes with their class probabilities for each cell. The architecture of YOLO is composed of input layer, convolution layer, pooling layer, fully connected layer and output layer. The convolution layer is used to extract the image features, the full connection layer is used to predict the position of image and the estimated probability values of target categories, and the pooling layer is responsible for reducing the pixels of the slice.

The YOLO network architecture is shown in Figure 3 [14].

**Figure 3.** The YOLO network architecture. The detection network has 24 convolutional layers followed by two fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. We pre-train the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input images) and then double the resolution for detection.

Assume that *B* is the number of sliding windows used for each cell to predict objects and *C* is the total number of categories, then the dimensions of output layer is *S* × *S* × (*B* × 5 + *<sup>C</sup>*).

The output model of each detected border is as follows:

$$T = (\mathbf{x}, \mathbf{y}, \mathbf{w}, h, \mathbf{c}) \tag{2}$$

where (*<sup>x</sup>*, *y*) represents the center coordinates of the bounding box and (*<sup>w</sup>*, *h*) represents the height and width of the detection bounding box. The above four indexes have been normalized with respect to the width and height of the image. *c* is the confidence score, which reflects the probability value of the current window containing the accuracy of the detection object, and the formula is as follows:

$$
\mathcal{L} = P\_o \times P\_{\rm ICI} \tag{3}
$$

where *Po* indicates the probability of including the detection object in the sliding window, *P*IOU indicates the overlapping area ratio of the sliding window and the real detected object.

$$P\_{\text{ICU}} = \frac{\text{Area}\left(BB\_i \cap BB\_{\mathcal{S}}\right)}{\text{Area}\left(BB\_i \cup BB\_{\mathcal{S}}\right)}\tag{4}$$

In the formula, *BBg* is the detection bounding box, and *BBg* is the reference standard box based on the training label.

For the regression method in the YOLO, the loss function can be calculated as follows:

$$\begin{split} F(\text{loss}) &= \lambda \cdot \text{cond} \sum\_{l=0}^{S^2} \sum\_{j=0}^{B} \mathbf{1}\_{lj}^{\text{obj}} [\left(\mathbf{x}\_l - \stackrel{\wedge}{\mathbf{x}}\_l\right)^2 + \left(y\_l - \stackrel{\wedge}{y}\_l\right)^2] + \lambda \cdot \text{cond} \sum\_{l=0}^{S^2} \sum\_{j=0}^{B} \mathbf{1}\_{lj}^{\text{obj}} [\left(\sqrt{\omega\_l} - \sqrt{\stackrel{\wedge}{\boldsymbol{\omega}}\_l}\right)^2 + \left(\sqrt{\boldsymbol{\eta}\_l} - \sqrt{\stackrel{\wedge}{\boldsymbol{h}\_l}}\right)^2] \\ &+ \sum\_{l=0}^{S^2} \sum\_{j=0}^{B} \mathbf{1}\_{lj}^{\text{obj}} (\mathbf{C}\_l - \stackrel{\wedge}{\mathbf{C}\_l})^2 + \lambda \cdot \text{mod} \sum\_{l=0}^{S^2} \sum\_{j=0}^{B} \mathbf{1}\_{lij}^{\text{amb}} (\mathbf{C}\_l - \stackrel{\wedge}{\mathbf{C}\_l}) + \sum\_{l=0}^{\bullet^2} \mathbf{1}\_{l}^{\text{obj}} \sum\_{z \in \mathcal{L}\text{loss}\mathbf{z}} \left(p\_l(\mathbf{c}) - \stackrel{\wedge}{p}\_l(\mathbf{c})\right)^2 \end{split} \tag{5}$$

1*obj i* denotes that the grid cell *i* contains part of the traffic objects. 1*obj ij* represents the *j* bounding box in grid cell *i*. Conversely, 1*noobj i* represents the *j* bounding box in grid cell *i* which does not contain any part of traffic objects. The time complexity of Formula (5) is *O*(*k* + *c*) × *<sup>S</sup>*<sup>2</sup>, which is calculated for one image.

#### *3.2. The Improved YOLO Algorithm*

In the application process of the original YOLO algorithm, the following issues are found:


Based on the above deficiencies, this paper improves the original YOLO algorithm as follows: (1) To eliminate the problem of redundant time caused by the identification of undesired targets, and according to the size and driving characteristics of common targets in traffic scenes, the total number of categories is set to six types, including {bus, car, truck, non-motor vehicle, pedestrian and others}. (2) For the issue of non-motor vehicle and pedestrian detection, this paper proposes a secondary image detection scheme. Then, the cell division of the image is kept as 7 × 7, the sliding window convolution kernel is set as 3 × 3.

The whole target detection process of the improved YOLO algorithm is shown in Figure 4, and the steps are as follows:

	- (3a) When the distance l between the target marked as {others} and the autonomous vehicle is less than the safety distance l0 (the distance that does not affect decision making; if the distance exceeds it, the target can be ignored), i.e., *l* ≤ *l*0, the slider region divided as {other} is marked, and the region is subdivided into 9 × 9 cells. The secondary convolution operation is performed again. When the confidence score c of the secondary detection is higher than the threshold *τ*1, the border model of {others} is output, and the category is changed from {others} to {non-motor vehicle} or {pedestrian}. When the confidence score

c of the secondary detection is lower than the threshold *τ*1, it is determined that the target does not belong to the classification item, and the target is eliminated.

(3b) When *l* > *l*0, this target is kept as {others}. It does not require a secondary convolution operation.

**Figure 4.** The flow chart of secondary image detection program. Object ∈ large means that targets are {bus, car, truck}.

The original YOLO algorithm fails to distinguish and recognize the targets according to their characteristics, and may lose some targets. The improved YOLO algorithm can try to detect the target twice in a certain distance according to the characteristic of dim of pedestrians and non-motor vehicles. Thus, it is can reduce the missing rate of the target and output a more comprehensive scene model and ensure the safe driving of vehicles.

#### **4. Decision-Level Fusion of the Detection Information**

After inputting the depth image and color image into the improved YOLO model algorithm, the detected target frame and confidence score are output, and then the final target model is output based on the fusion distance measurement matrix for decision level fusion.

#### *4.1. Theory of Data Fusion*

It is assumed that multiple sensors measure the same parameter, and the data measured by the *i* sensor and the *j* sensor are *Xi* and *Xj*, and both obey the Gaussian distribution, and their pdf (probability distribution function) curve is used as the characteristic function of the sensor and is

denoted as *pi*(*x*), *pj*(*x*). *xi* and *xj* are the observations of *Xi* and *Xj*, respectively. To reflect the deviation between *xi* and *xj*, the confidence distance measure is introduced [15]:

$$d\_{ij} = 2\int\_{x\_i}^{x\_j} p\_i(\mathbf{x}/\mathbf{x}\_i)d\mathbf{x} \tag{6}$$

$$d\_{ji} = 2\int\_{x\_{j}}^{x\_{i}} p\_{j}(\mathbf{x} / \mathbf{x}\_{j}) d\mathbf{x} \tag{7}$$

Among them:

$$p\_i(\mathbf{x}/\mathbf{x}\_i) = \frac{1}{\sqrt{2\pi}\sigma\_i} \exp\{-\frac{1}{2} [\frac{\mathbf{x} - \mathbf{x}\_i}{\sigma\_i}]^2\} \tag{8}$$

$$p\_{\dot{\jmath}}(\mathbf{x}/\mathbf{x}\_{\dot{\jmath}}) = \frac{1}{\sqrt{2\pi}\sigma\_{\dot{\jmath}}} \exp\{-\frac{1}{2} [\frac{\mathbf{x} - \mathbf{x}\_{\dot{\jmath}}}{\sigma\_{\dot{\jmath}}}]^2\} \tag{9}$$

The value of *dij* is called the confidence distance measure of the *i* sensor and the *j* sensor observation, and its value can be directly obtained by means of the error function erf (*θ*), namely:

$$d\_{ij} = \text{erf}[\frac{\mathbf{x}\_{j} - \mathbf{x}\_{i}}{\sqrt{2}\sigma\_{i}}] \tag{10}$$

$$d\_{ji} = \text{erf}[\frac{\mathbf{x}\_i - \mathbf{x}\_j}{\sqrt{2}\sigma\_j}] \tag{11}$$

If there are n sensors measuring the same indicator parameter, the confidence distance measure *dij* (*i*, *j* = 1, 2, ..., *n*) constitutes the confidence distance matrix *Dn* of the multi-sensor data:

$$D\_n = \begin{bmatrix} d\_{11} & d\_{12} & \cdots & d\_{1n} \\ d\_{21} & d\_{22} & \cdots & d\_{2n} \\ \vdots & \vdots & & \vdots \\ d\_{n1} & d\_{n2} & \cdots & d\_{nn} \end{bmatrix} \tag{12}$$

The general fusion method is to use experience to give an upper bound *βij* of fusion, and then the degree of fusion between sensors is:

$$r\_{i\circ} = \begin{cases} 1, & d\_{i\circ} \le \beta\_{i\circ} \\ 0, & d\_{i\circ} > \beta\_{i\circ} \end{cases} \tag{13}$$

In this paper, there are two sensors, i.e., LiDAR and color camera, so *i*, *j* = 1, 2. Then, taking *βij* = 0.5 [16], *r*12 is set as the degree of fusion between the two sensors. Figure 5 explains the fusion process.


**Figure 5.** Decision-level fusion diagram of detection model. Blue area (*BB*1) is the model output from the depth image. Green area (*BB*2) is the model output from the color image. Red area (*BB*') is the final detection model. When *r*12 = 0, the fusion process is shown in (**a**). The models not to be fused are shown in (**b**). When *r*12 = 1, the fusion process is shown in (**c**).

Simple average rules between scores are applied in confidence scores. The formula is as follows:

$$x = \frac{c\_1 + c\_2}{2} \tag{14}$$

where *c*1 is the confidence score of target Model 1, and *c*2 is the confidence score of target Model 2. In addition, it should be noted that, when there is only one bounding box, to reduce the missed detection rate, this bounding box information is retained as the final output result. The final target detection model can be output through decision-level fusion and confidence scores.

#### *4.2. The Case of the Target Fusion Process*

An example of the target fusion process is shown in Figure 6, and the confidence scores obtained using different sensors can be seen in Table 1.

	- (1) For target a, according to the decision-level fusion scheme, the result *r*12 ≤ 0 is obtained; then, the overlapping area is taken as the final detection model, and the confidence score after fusion is 0.82, as shown in Figure 6C (a').
	- (2) For target b, according to the decision-level fusion scheme, the result *r*12 ≥ 0 is obtained; then, the union of all regions is taken as the final detection model, and the confidence score after fusion is 0.54, as shown in Figure 6C (b').
	- (3) For target c, since there is no such information in Figure 6A, and Figure 6B identifies the pedestrian information on the right, according to the fusion rule, the bounding box information of c in Figure 6B is retained as the final output result, and the confidence score is kept as 0.51, as shown in Figure 6C (c').

(**A**) 

**Figure 6.** An example of target detection fusion process. (**A**) is a processed depth image. The models detected a and b are shown with blue. (**B**) is a color image. The models detected a, b and c are shown with green. (**C**) is the final target model after fusion. The models fused a', b' and c' are shown with red.


**Table 1.** Confidence scores obtained using different sensors.

#### **5. Results and Discussion**

#### *5.1. Conditional Configuration*

The target detection training dataset included 3000-frame resolution images of 1500 × 630 and was divided into six different categories: bus, car, truck, non-motor vehicle, pedestrian and others. The dataset was partitioned into three subsets: 60% as training set (1800 observations), 20% as validation set (600 observations), and 20% as testing set (600 observations).

The autonomous vehicles collected data on and off campus. The shooting equipment included a color camera and a Velodyne 64-line LiDAR. The camera was synchronized with a 10 Hz spining LiDAR. The Velodyne has 64-layer vertical resolution, 0.09 angular resolutions, 2 cm of distance

accuracy, and captures 100 k points per cycle [9]. The processing platform was completed in the PC segment, including the i5 processor (Intel Corporation, Santa Clara, CA, USA) and GPU (NVIDIA, Santa Clara, CA, USA). The improved YOLO algorithm was accomplished by building a Darket framework and using Python (Python 3.6.0, JetBrains, Prague, The Czech Republic) for programming.

#### *5.2. Time Performance Testing*

The whole process included the extraction of depth image and color image, and they were, respectively, substituted into the improved YOLO algorithm and the proposed decision-level fusion scheme as the input layer. The improved YOLO algorithm involved the image grid's secondary detection process and is therefore slightly slower than the normal recognition process. The amount of computation to implement the different steps of the environment and algorithm is shown in Figure 7. In the figure, it can be seen that the average time to process each frame is 81 ms (about 13 fps). Considering that the operating frequency of the camera and Velodyne LiDAR is about 10 Hz, it can meet the real-time requirements of traffic scenes.

**Figure 7.** Processing time for each step of the inspection system (in ms).

#### *5.3. Training Model Parameters Analysis*

The training of the model takes more time, so the setting of related parameters in the model has a grea<sup>t</sup> impact on performance and accuracy. Because the YOLO model involved in this article has been modified from the initial model, the relevant parameters in the original model need to be reconfigured through training tests.

The training step will affect the training time and the setting of other parameters. For this purpose, eight steps of training scale were designed. Under the learning rate of 0.001 given by YOLO, the confidence prediction score, actual score, and recognition time of the model are statistically analyzed. Table 1 shows the performance of the *BB*2 model, and Figure 7 shows the example results of the *BB*2 model under D1 (green solid line), D3 (violet solid line), D7 (yellow solid line) and D8 (red solid line).

Table 2 shows that, with the increase of training steps, the confidence score for the *BB*2 model is constantly increasing, and the actual confidence level is also in a rising trend. When the training step reaches 10,000, the actual confidence score arrives at the highest value of 0.947. However, when the training step reaches 20,000, the actual confidence score begins to fall, and the recognition time also slightly increases, which is related to the configuration of model and the selection of learning rate.


**Table 2.** Performance of *BB*2 model under different steps.

Figure 8 shows the vehicle identification with the training steps of 4000, 6000, 10,000, and 20,000. The yellow dotted box indicates the recognition rate when the learning rate is 10,000. Clearly, the model box basically covers the entire goal and almost no redundant area. Based on the above analysis, the number of steps set in this paper is 10,000.

**Figure 8.** Performance comparison of *BB*2 model under 4 kinds of training steps.

The learning rate determines the speed at which the parameters are moved to the optimal value. To find the optimal learning rate, the model performances with the learning rate of <sup>10</sup>−7, <sup>10</sup>−6, <sup>10</sup>−5, <sup>10</sup>−4, <sup>10</sup>−3, <sup>10</sup>−2, 10−<sup>1</sup> and 1 are estimated, respectively, when the training step is set to 10,000.

Table 3 shows the estimated confidence scores and final scores of the output detection models *BB*1 and *BB*2 under different learning rates. Figure 9 shows the change trend of the confidence score. After analyzing Table 3 and Figure 9, we can see that, with the decrease of learning rate, all of the confidence prediction score and actual score of model experienced a rising trend firstly and then decreasing. When the learning rate reaches D3 (10−2), the confidence score reaches a maximum value, and the confidence level remains within a stable range with the change of learning rate. Based on the above analysis, when the learning rate is <sup>10</sup>−2, the proposed model can obtain a more accurate recognition rate.


**Table 3.** Model performance under different learning rates.

**Figure 9.** Performance trends under different learning rates.

#### *5.4. Evaluation of Experiment Results*

The paper takes the IOU as the evaluation criteria of recognition accuracy obtained by comparing the *BBi* (*i* = 1, 2) of output model and the *BBg* of actual target model, and defines three evaluation grades:


**Figure 10.** The definition of evaluation grade. The yellow area is the identified effective area. The black frame area is model's total area. The above proportion is the ratio between yellow area and black area.

To avoid the influence caused by the imbalance of all kinds of samples, the precision and recall were introduced to evaluate the box model under the above three levels:

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{15}$$

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{16}$$

In the formula, TP, FP, and FN indicate the correctly defined examples, wrongly defined examples and wrongly negative examples, respectively. The Precision–Recall diagram for each model *BBi* (*i* = 1, 2) is calculated, as shown in Figure 11a,b.

When the recall is less than 0.4, all the accuracy under the three levels is high; when the recall reaches around 0.6, only the accuracy of the level hard decreases sharply and tends to zero, while the accuracy of the other two levels is basically maintained at a relatively high level. Therefore, when the requirements of level for target detection is not very high, the method proposed in this paper can fully satisfy the needs of vehicle detection under real road conditions.

**Figure 11.** Detection performance of the target. (**A**) is the performance relationship of model *BB*1. (**B**) is the performance relationship of model *BB*2.

#### *5.5. Method Comparison*

The method proposed in this paper is compared with the current more advanced algorithms. The indicators are mainly mAP (mean average precision) and FPS (frames per second). The results obtained are shown in Table 4.


**Table 4.** Comparison of the training results of all algorithms.

In Table 4, the recognition accuracy of the improved algorithm proposed in this paper is better than that of the original YOLO algorithm. This is related to the fusion decision of the two images and the proposed secondary image detection scheme. To ensure the accuracy, the detection frame number of the improved YOLO dropped from 45 to 13, and the running time increased, but it can fully meet the normal target detection requirements and ensure the normal driving of autonomous vehicles.

#### **6. Conclusions**

This paper presents a detection fusion system with integrating LiDAR and color camera. Based on the original YOLO algorithm, the second detection scheme is proposed to improve the YOLO algorithm for dim targets such as non-motorized vehicles and pedestrians. Then, the decision level fusion of sensors is introduced to fuse the color image of color camera and the depth image of LiDAR to improve the accuracy of the target detection. The final experimental results show that, when the training step is set to 10,000 and the learning rate is 0.01, the performance of the model proposed in this paper is optimal and the Precision–Recall performance relationship could satisfy the target detection in most cases. In addition, in the aspect of algorithm comparison, under the requirement of both accuracy and real-time, the method of this paper has better performance and a relatively large research prospect.

Since the samples needed in this paper are collected from several traffic scenes, the coverage of the traffic scenes is relatively narrow. In the future research work, we will gradually expand the complexity of the scenario and make further improvements to the YOLO algorithm. In the next experimental session, the influence of environmental factors will be considered, because the image-based identification method is greatly affected by light. At different distances (0–20 m, 20–50 m, 50–100 m, and >100 m), the intensity level of light is different, so how to deal with the problem of light intensity and image resolution is the primary basis for target detection.

**Author Contributions:** Conceptualization, J.Z.; Data curation, J.H. and Y.L.; Formal analysis, J.H.; Funding acquisition, S.W.; Investigation, J.H. and J.Z.; Methodology, Y.L.; Project administration, Y.L.; Resources, J.H.; Software, J.Z.; Supervision, J.Z. and S.W.; Validation, S.W.; Visualization, Y.L.; Writing—original draft, J.H.; and Writing—review and editing, J.Z., S.W. and S.L.

**Funding:** This work was financially supported by the National Natural Science Foundation of China under Grant No. 71801144, and the Science & Technology Innovation Fund for Graduate Students of Shandong University of Science and Technology under Grant No. SDKDYC180373.

**Conflicts of Interest:** The authors declare no conflict of interest.
