**3. Methodology**

Figure 1 shows an overview of the proposed framework with its three modules: a data processing module, a pavement marking detection module, and a visibility analysis module.

**Figure 1.** Overview of the Framework for Condition Analysis of Pavement Markings.

#### *3.1. Data Processing Module*

#### 3.1.1. Data Acquisition

Since deep learning is a kind of data-driven algorithm, a training dataset must be prepared for the network model. Due to the quick development of autonomous driving, many public datasets for each driving situation have been collected, such as BDD100K, KITTI, Caltech Lanes, and VPGNet [15–18]. However, these datasets mainly focus on lane detection rather than the pavement markers. The VPGNet dataset provides annotations for lanes and pavement markers, but its pixel-wise annotations are inappropriate for the detection module used in this study. Thus, a system for automatically gathering images or videos of pavement systems must be set up. An action camera mounted behind the front windshield of a car driving on the roadways of Indianapolis, U.S.A. was used to record high-definition (HD) videos. Generally, the camera can capture 90% of the view in front of the moving vehicle, including pavements, transportation systems, and nearby street views, but only the data on pavements were used in this study. For this study, several trips were taken to record plenty of video data. The collected dataset covered various weather conditions, such as daytime, nighttime, sunny days, and rainy days, and di fferent regions such as highways and urban areas. After screening all the video data, more than 1000 high-quality pictures were intercepted, of which about 200 were used for testing, and the remaining pictures were used for training, which maintained a good training-testing ratio.

#### 3.1.2. Data Annotation

Since the primary goal of this study was to make the computer recognize the pavement markings in the road view videos or images, a labeled dataset had to be prepared for the model training process. After comparing multiple open-source labeling software, the visual object tagging tool (VoTT) was chosen to perform the data annotation. VoTT is a powerful open-source labeling software released by Microsoft [19]. This software provides a technique for automatic labeling based on the pre-trained network, which can significantly reduce the workload for annotations. It also supports many formats of the exported annotation results, which make the labeled sample set suitable for various deep learning development frameworks. Figure 2 shows an example of the labeling process.

**Figure 2.** The software interface for data annotation tasks.

The VOC XML file format was chosen to generate annotations for each imported image. The key step in this procedure is to develop categories for the pavement markings. This study mainly focused on arrow-like pavement markings, such as those for left turn, right turn, etc. However, up to 10 categories of pavement markings were additionally captured with rectangular boxes for future research. Those categories are described in Figure 3. The annotated data were divided into a training dataset and a testing dataset at a ratio of 0.9:0.1.

**Figure 3.** Types of labeled pavement markings.

#### *3.2. Pavement Markings Detection Module*

In the field of computer vision, many novel object recognition frameworks have been studied in recent years. Among these frameworks, the most studied frameworks are deep learning-based models. Generally, according to the recognition principle, existing object detection models can be divided into two categories: two-stage frameworks and one-stage frameworks [20].

In two-stage frameworks, the visual target is detected in mainly two steps. First, abundant candidate regions that can possibly cover the targets are proposed, and then the validity of such regions is determined. R-CNN, Fast-RCNN, and Faster-RCNN are the representative two-stage frameworks, all of which have high a detection precision [21–23]. However, since they take time to generate candidate regions, their detection e fficiency is relatively unpromising, which makes them unsuitable for real-time applications. To make up for this deficiency, researchers proposed the one-stage framework.

Compared to the two-stage framework, the one-stage framework gets rid of the phase for proposing candidate regions, and simultaneously performs localization and classification by treating the object detection task as a regression problem. Moreover, with the help of CNN, the one-stage framework can be constructed as an end-to-end network so that inferences can be made with simple matrix computations. Although this type of framework is slightly inferior to the two-stage framework in detection accuracy, its detection speed is dozens of times better. One of the representative one-stage frameworks, You Only Look Once (YOLO), achieves a balance between detection accuracy and speed [24]. After continuous updating and improvement, the detection accuracy of YOLOv3 has already caught up with that of most two-stage frameworks. This is why YOLOv3 was chosen as the pavement markings detection model in this study.

#### 3.2.1. Demonstration of the YOLO Framework

Previous studies, such as on Region-CNN (R-CNN) and its derivative methods, used multiple steps to complete the detection, and each independent stage had to be trained separately, which slowed down the execution and made optimizing the training process di fficult. YOLO uses an end-to-end design idea to transform the object detection task into a single regression problem, and directly obtains the coordinates and classification probabilities of the targets from raw image data. Although Faster-RCNN also directly takes the entire image as an input, it still uses the idea of the proposal-and-classifier of the R-CNN model. The YOLO algorithm brings a new solution to the object detection problem. It only scans the sample image once and uses the deep CNNs to perform both the classification and the localization. The detection speed of YOLO can reach 45 frames per second, which basically meets the requirement of real-time video detection applications.

YOLO divides the input image into *S* ∗ *S* sub-cells, each of which can detect objects individually. If the center point of an object falls in a certain sub-cell, the possibility of including the object in that sub-cell is higher than the possibility of including it in the adjacent sub-cells. In other words, this sub-cell should be responsible for the object. Each sub-cell needs to predict *B* bounding boxes and the confidence score that corresponds to each bounding box. In detail, the final prediction is a five-dimensional array, namely, (*x*, *y*, *w*, *h*, *c*) *T*, where (*x*, *y*) is the o ffset that compares the center point of the bounding box with the upper left corner of the current sub-cell; (*<sup>w</sup>*, *h*) is the aspect ratio of the bounding box relative to the entire image; and *c* is the confidence value. In the YOLO framework, the confidence score has two parts: the possibility that there is an object in the current cell, and the Intersection over Union (IoU) value between the predicted box and the reference one. Suppose the possibility of the existence of the object is Pr(*Obj*), and the IoU value between the predicted box and the reference box is *IoU*(*pred*, *truth*), the formula for the confidence score is shown as Equation (1).

$$Confidence = \Pr(Obj) \* IoI(pred, truth) \tag{1}$$

Suppose that *boxp* is the predicted bounding box, and *boxt* is the reference bounding box. Then the IoU value can be calculated using the following formula.

$$IoL\_p^t = \frac{box\_p \cap box\_t}{box\_p \cup box\_t} \tag{2}$$

In addition, YOLO outputs the individual conditional probability of *C* object categories for each cell. The final output of the YOLO network is a vector with *S* ∗ *S* ∗ (5 ∗ *B* + *C*) nodes.

YOLO adopted the classic network structure of CNN, which first extracted spatial features through convolutional layers, and then computed predictions by fully connected layers. This type of architecture limits the number of predictable target categories, which makes the YOLO model insu fficient for multi-object detection. Moreover, since YOLO randomly selects the initial prediction boxes for each cell, it cannot accurately locate and capture the objects. To overcome the di fficulties of YOLO and enhance its performance, Redmon et al. further modified its structure, applied novel features, and proposed improved models such as YOLOv2 and YOLOv3 [25,26].

The YOLOv2 network discarded the fully connected layers of YOLO, transformed YOLO into a fully convolutional network, and used the anchor boxes to assist in the prediction of the final detection bounding boxes. It predefined a set of anchor boxes with di fferent sizes and aspect ratios in each cell to cover di fferent positions and multiple scales of the entire image. These anchor boxes were used as initial candidate regions, which were distinguished according to the presence or absence of the targets inside them through the network. The position of the predicted bounding boxes was also continuously fine-tuned [27]. To fit the characteristics of the training samples, YOLOv2 used the k-means clustering algorithm to automatically learn the best initial anchor boxes from the training dataset. Moreover, YOLOv2 applied the Batch Normalization (B.N.) operation to the network structure. B.N. decreased the shift in the unit value in the hidden layer, and thus improved the stability of the neural network [28]. The B.N. regularization can prevent overfitting of the model, which makes the YOLOv2 network easier to converge.

Compared to YOLOv2, YOLOv3 mainly integrated some advanced techniques. While maintaining the fast detection, it further improved the detection accuracy and the ability to recognize small targets. YOLOv3 adopted a novel framework called Darknet-53 as its main network. Darknet-53 contained a total of 53 convolutional layers and adopted the skip-connection structure inspired by ResNet [29]. The much deeper CNN helped improve feature extraction. Motivated by the idea of multilayer feature fusion, YOLOv3 used the up-sampling method to re-extract information from the previous feature maps, and performed feature fusion with di fferent-scale feature maps. In this way, more fine-grained information can be obtained, which improved the accuracy of the detection of small objects.

#### 3.2.2. Structure of YOLOv3

Figure 4 shows the YOLOv3 network structure, which has two parts: Darknet-53 and the multi-scale prediction module. Darknet-53 is performed to extract features from the input image, the size of which is set at 416 × 416. It consists of two 1 ∗ 1 and 3 ∗ 3 convolutional layers, without any fully connected layers. Each convolutional layer is followed by a B.N. layer and a LeakyReLU activation function, which is regarded as the DBL block. In addition, Darknet-53 applies residual blocks in some layers. The main distinction of the residual block is that it adds a direct connection from the block entrance to the block exit, which helps the model to converge more easily, even if the network is very deep. When the feature extraction step is completed, feature maps are used for multi-scale object detection. In this part, YOLOv3 extracts three feature maps of different scales in the middle, middle-bottom, and bottom layers. In these layers, the concatenation operations are used to fuse the multi-scale features. In the end, three predictions of different scales will be obtained, each of which will contain the information on the three anchor boxes. Each anchor box is represented as a vector of (5 + *numclass*) dimensions, in which the former five values indicate the coordinates and the confidence score, and *numclass* refers to the category number of the objects. In this study, five kinds of arrow-like pavement markings were considered.

**Figure 4.** Structure of the YOLOv3 network. (https://plos.figshare.com/articles/YOLOv3\_architecture\_ /8322632/1).

#### *3.3. Visibility Analysis Module*

After the data collected by the dashboard camera are labeled, a YOLOv3-based pavement marking detection module can be constructed and trained. The target pavement markings can be extracted and exported as small image patches. Thus, the next step is to design a visibility analysis module to help determine the condition of the pavement markings.

Pavement markings are painted mainly to give notifications to drivers in advance. As such, a significant property of pavement markings is their brightness. However, brightness is an absolute value affected by many factors, such as the weather and the illumination. Since the human visual system is more sensitive to contrast than to absolute luminance, the intensity contrast is chosen as the metric for the visibility of pavement markings [30]. In this study, contrast was defined as the difference between the average intensity of the pavement marking and the average intensity of the surrounding pavement. The main pipeline of this visibility analysis module is shown in Figure 5.

**Figure 5.** A demonstration of the pipeline of the visibility analysis module.

#### 3.3.1. Finding Contours

As the pavement markings are already exported as image patches, the first step is to separate the pavement markings from the pavement. Since only arrow-like markings were considered in this study, the portion with the marking can be detached easily from the image, for as long as the outer contour of the marking is found. The contour can be described as a curve that joins all the continuous points along the boundary with the same color or intensity.

The contour tracing algorithm used in this part was proposed by Suzuki et al. [30] It was one of the first algorithms to define the hierarchical relationships of the borders and to differentiate the outer bounders from the hole bounders. This method has been integrated into the OpenCV Library [31]. The input image should be a binary image, which means the image has only two values: 0 and 1, with 0 representing the black background, and 1, the bright foreground or object. Thus, the border should mainly serve as the edge.

Assume that *pij* denotes the pixel value at position (*i*, *j*) in the image. Two variables, Newest Border Number (*NBD*), Last Newest Border Number (*LNBD*), are created to record the relationship between the pixels during the scanning process. The algorithm uses the row-by-row and left-to-right scanning schemes to process each *NBD* and *LNBD*, where *pij* > 0.

	- 2.1. Starting from pixel (*<sup>i</sup>*2, *j*2), traverse the neighborhoods of pixel (*i*, *j*) in a clockwise direction. In this study, the 4-connected case is selected to determine the neighborhoods, which means only the points connected horizontally and vertically are regarded as the neighborhoods. If a non-zero value exists, denote such pixel as (*<sup>i</sup>*1, *j*1). Otherwise, let *pij* = −*NBD* and jump to Step 3.
	- 2.2. Assign (*<sup>i</sup>*2, *j*2) ← (*<sup>i</sup>*1, *j*1) and (*<sup>i</sup>*3, *j*3) ← (*i*, *j*).
	- 2.3. Taking pixel (*<sup>i</sup>*3, *j*3) as the center, traverse the neighborhoods in a counterclockwise direction from the next element (*<sup>i</sup>*2, *j*2) to find the first non-zero pixel, and assign it as (*<sup>i</sup>*4, *j*4).
	- 2.4. Update the value *pi*3,*j*3according to Step 2.4 in Figure 6.
	- 2.5. If *pi*3,*j*3+<sup>1</sup> = 0, update *pi*3,*j*3 ← <sup>−</sup>*NBD*.
	- 2.6. If *pi*3,*j*3+<sup>1</sup> - 0 and *pi*3,*j*3 = 1, update *pi*3,*j*3 ← *NBD*.
	- 2.7. If the current condition satisfies (*<sup>i</sup>*4, *j*4)=(*<sup>i</sup>*, *j*) and (*<sup>i</sup>*3, *j*3)=(*<sup>i</sup>*1, *j*1), which means it goes back to the starting point, jump to Step 3. Otherwise, assign (*<sup>i</sup>*2, *j*2) ← (*<sup>i</sup>*3, *j*3) and (*<sup>i</sup>*3, *j*3) ← (*<sup>i</sup>*4, *j*4) and return to Sub-step 2.3.

**Figure 6.** The introduction of Step 1, 2.1–2.4 and the introduction of the final output to the contour tracing algorithm.

Figure 6 show the contour tracing algorithm. By using this approach, the outer border or the contour of the arrow-like pavement marking can be found. However, due to uneven lighting or faded markings, the detected contours are not closed curves, as shown in Figure 7b. The incomplete contours cannot help separate the pavement marking portion. To solve this problem, the dilation operation is performed before the contours are traced.

**Figure 7.** Results of the visibility analysis module. (**a**) Original patch, including the pavement marking; (**b**) Found contours without the dilation operation; (**c**) Found contours with the dilation operation; (**d**) Generated image mask for the marking; and (**e**) Generated image mask for the pavement.

Dilation is one of the morphological image processing methods, opposite to erosion [32]. The basic effect of the dilation operator on a binary image is the gradual enlargement of the boundaries of the foreground pixels so that the holes in the foreground regions would become smaller. The dilation operator takes two pieces of data as inputs. The first input is the image to be dilated, and the second input is a set of coordinate points known as a kernel. The kernel determines the precise e ffect of the dilation on the input image. It presumes that the kernel is a 3 × 3 square, with the origin at its center. To compute the dilation output of a binary image, each background pixel (i.e., 0-value) should be processed in turns. For each background pixel, if at least one coordinate point inside the kernel coincides with a foreground pixel (i.e., 1-value), the background pixel must be flipped to the foreground value. Otherwise, the next background pixel must be continually processed. Figure 8 shows the e ffect of a dilation using a 3 × 3 kernel. By using the dilation method before detecting the contours for the pavement marking patches, the holes in the markings are significantly eliminated, and the outer border becomes consistent and complete, which can be easily observed in Figure 7b,c.

**Figure 8.** An example of the e ffect of the dilation operation (https://homepages.inf.ed.ac.uk/rbf/HIPR2/ dilate.htm).

#### 3.3.2. Construct Masks

Once the complete outer border of the pavement marking is obtained, the next step is to detach the pavement marking from the surrounding pavement. In practical scenarios, the pavement marking cannot be physically separated from the image patch due to its arbitrary shape. The most common way to achieve the target is to use masks to indicate the region segmentation. Since there are only two categories of objects, i.e., the pavement markings and the pavement, in this study, two masks had to be generated for each image patch.

Image masking is a non-destructive process of image editing that is universally employed in graphics software such as Photoshop to hide or reveal some portions of an image. Masking involves setting some of the pixel values in an image to 0 or another background value. Ordinary masks have only 1 and 0 values, and areas with a 0 value should be hidden (i.e., masked). Examples of masks generated for pavement markings are shown in Figure 7d,e.

#### 3.3.3. Computing the Intensity Contrast

According to the pipeline of the visibility analysis module, the final step is to calculate the contrast between the pavement markings and the surrounding pavement. The straightforward way to determine the contrast value is to simply compute the difference between the average intensities of the markings and the pavement. However, this procedure does not adapt to the changes in the overall luminance. For instance, a luminance difference of 60 grayscales in a dark scenario (e.g., at night) should be more significant than the same luminance difference in a bright scenario (e.g., a sunny day). The human eyes sense brightness approximately logarithmically over a moderate range, which means the human visual system is more sensitive to intensity changes in dark circumstances than in bright environments [33]. Thus, in this study, the intensity contrast was computed using the Weber contrast method, the formula for which is:

$$\text{Contrast}(M, P) = \frac{\overline{I\_M} - \overline{I\_P}}{\overline{I\_P}}, \overline{I\_M} = \frac{\sum\_{v \in \text{Marking}} I\_v}{N\_{\text{Marking}}}, \overline{I\_P} = \frac{\sum\_{v \in \text{Parent}} I\_v}{N\_{\text{Parent}}} \tag{3}$$

where *IM* and *IP* are the average intensity values of the pavement marking and the surrounding pavement, respectively, and the *Nregion* is the number of pixels in the specific region.

#### **4. Experimental Validation of the Framework**

#### *4.1. Experiment Settings*

Regarding the pavement marking detection model, it needs to be trained with a labelled dataset to enhance its performance. In this study, a Windows 10 personal computer with an Nvidia GeForce RTX 2060 Super GPU and a total memory of 16 GB was used to perform the training and validation procedures. The deep learning framework that was used to build, train, and evaluate the detection network is the TensorFlow platform, which is one of the most popular software libraries used for machine learning tasks [34].

On actual roads, left-turn markings are much more common than right-turn markings. This leads to an imbalanced ratio of the proportions of these two kinds of pavement markings in the training dataset. If a classification network is trained without fixing this problem, the model could be completely biased [35]. Thus, in this study, data augmentation was performed before the model was trained. Specifically, for each left-turn (right-turn) marking, the image was flipped along the horizontal axis to make the left-turn (right-turn) marking a new right-turn (left-turn) marking. By applying this strategy to the whole training dataset, the numbers of the two markings should be the same. An example of this data augmentation method is shown in Figure 9.

**Figure 9.** An example of data augmentation.

#### *4.2. Model Training*

The neural network is trained by first calculating the loss through a forward inference, and then updating related parameters based on the derivative of loss to make the predictions as accurate as possible. Therefore, the design of loss functions is significant. In the YOLOv3 algorithm, the loss

function has mainly three parts: the location offset of the predicted boxes, the deviation of the target confidence score, and the target classification error. The formula for the loss function is:

$$L(l, \text{g}, \text{O}, \text{o}, \text{C}, \text{c}) = \lambda\_1 L\_{\text{loc}}(l, \text{g}) + \lambda\_2 L\_{\text{conf}}(o, \text{c}) + \lambda\_3 L\_{\text{cla}}(O, \text{C}), \tag{4}$$

where λ1 ∼ λ3 refers to the scaling factors.

The location loss function uses the sum of the square errors between the true offset and the predicted offset, which is formulated as:

$$L\_{\text{loc}}(l, \mathfrak{g}) = \sum\_{m \in \{\mathfrak{x}, \mathfrak{y}, w, l\}} \left(\widehat{l}^{\text{m}} - \mathfrak{g}^{\text{m}}\right)^{2} \tag{5}$$

where ˆ *l* and *g*ˆ represent the coordinate offsets of the predicted bounding box and the referenced bounding box, respectively. Both ˆ *l* and *g*ˆ have four parameters: *x* for the offset along the *x*-axis, *y* for the offset along the *y*-axis, *w* for the box width, and *h* for the box height.

The target confidence score indicates the probability that the predicted box contains the target, which is computed as:

$$L\_{conf}(o, \mathfrak{c}) = -\sum \left( o\_i \ln(\mathfrak{c}\_i) + (1 - o\_i) \ln(1 - \mathfrak{c}\_i) \right). \tag{6}$$

The function *Lcon f* uses the binary cross-entropy loss, where *oi* ∈ {0, 1} indicates whether the target actually exists in the predicted rectangle *i*. The 1 value means yes, and the 0 value means no. *ci* ∈ [0, 1] denotes the estimated probability that there is a target in the rectangle *i*.

The formulation of the target classification error in this study slightly differs from that in the YOLOv3 network. In the YOLOv3 network, the authors still used the binary cross-entropy loss function, as the author thought the object was possibly classified into more than one category in complicated reality scenes. However, in this study, the categories of the pavement markings were mutually exclusive. Thus, the multi-class cross-entropy loss function was used to measure the target classification error, the mathematical expression of which is:

$$L\_{\rm cla}(O\_{\prime}, \mathbb{C}) = -\sum\_{i \in \text{pos}} \sum\_{j \in \text{cla}} (O\_{ij} \ln(\mathbb{C}\_{ij}) + (1 - O\_{ij}) \ln(1 - \mathbb{C}\_{ij})),\tag{7}$$

where *Oij* ∈ {0, 1} indicates if the predicted box *i* contains the object *j*, and *C* ˆ *ij* ∈ [0, 1] represents the estimated probability occurring in the aforementioned event.

Pan and Yang (2010) found that in the machine learning field, the knowledge gained while solving one problem can be applied to another different but related problem, which is called transfer learning [36]. For instance, the knowledge obtained while learning to recognize cars could be useful for recognizing trucks. In this study, the pavement marking detection network was not trained from scratch. Instead, a pre-trained model learning to recognize objects in the MS COCO dataset was used for the initialization. The MS COCO dataset, published by Lin et al., contains large-scale object detection data and annotations [37]. The model pre-trained from the COCO dataset can provide the machine with some general knowledge on object detection tasks. Starting from the pre-trained network, a fine-tuned procedure is conducted by feeding the collected data to the machine to make it capable of recognizing pavement markings. The total training process runs for a total of 50 epochs.

With the help of the TensorBoard integrated into the TensorFlow platform, users can monitor the training progress in real time. It can export figures to indicate the trends of specific parameters or predefined metrics. Figure 10 shows the trend of three different losses during the training process. The figure shows a decreasing trend for all the losses.

**Figure 10.** The trends of various loss functions during the training process monitored by TensorBoard.

#### *4.3. Model Inference and Performance*

After the training, the produced model is evaluated on the testing dataset. At the end of each training epoch, the network structure and the corresponding parameters are stored as the checkpoint file. For the evaluation, the checkpoint file with the least loss is chosen to be restored. The testing sample images are directly fed to the model as the inputs, and then the machine will automatically detect and locate the pavement markings in the image. Once the arrow-like pavement markings are recognized in the image, the detected areas are extracted to perform the visibility analysis. In this study, the function of the visibility analysis module was integrated into the evaluation of the pavement marking detection module. Thus, for each input image, the model drew the predicted bounding boxes, and added text to indicate the estimated category, the confidence score, and the contrast score on the image. Some examples of the evaluation of testing images are shown in Figure 11.

**Figure 11.** Visual results were evaluated on the testing samples.

From the figure, it can be seen that most of the pavement markings are correctly located and classified, and the contrast value provides a good measure of the visibility of the markings. The two subfigures on the left belong to the cloudy scenario, and the two on the right represent the sunny case. The pavements in the two subfigures on the left are both dark; but due to the poor marking condition, the contrast values of the top subfigure (i.e., 1.0, 0.4) are much lower than those at the bottom (i.e., 2.1, 1.9, 2.3). It can be observed that the pavement markings in the bottom subfigure are much more recognizable than those in the top subfigure, which validates the effectiveness of the contrast value for analyzing the visibility of pavement markings. Similarly, for the two subfigures on the right, all the detected pavement markings are in good condition; nevertheless, the contrast value of the bottom subfigure (i.e., 0.9, 1.1) is higher than that of the top subfigure (i.e., 0.2, 0.3), because the pavement in the bottom image is darker. This means the markings in the bottom-right subfigure are easier to identify than those in the top-right subfigure. The high brightness of the pavement could reduce the visibility of the markings on it, as the markings are generally painted white.

For the quantitative evaluation of the performance of the pavement marking detection model in this study, the mean average precision (mAP) was used. The results of the object detection system were divided into the following four categories by comparing the estimation with the reference label: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Table 1 defines these four metrics.


**Table 1.** Four categories of the metrics.

The precision refers to the proportion of the correct results in the identified positive samples, and the recall denotes the ratio of the correctly identified positive samples to all the positive samples. The formulas for these two metrics are as follows.

$$Precision = \frac{TP}{TP + FP} \tag{Recall} = \frac{TP}{TP + FN} \tag{8}$$

To determine if a prediction box correctly located the target, an IoU threshold was predefined before the model was evaluated. For as long as the IoU value between the estimated bounding box and the ground truth was bigger than the threshold, this prediction was considered a correct detection. When the threshold value was adjusted, both the precision and the recall changed. As the threshold decreased, the recall value continued to increase, and the accuracy decreased after reaching a certain level. According to this pattern, the precision-recall curve, i.e., the PR curve, was drawn [38]. The AP value refers to the area under the PR curve, and the mAP value indicates the average AP among the multiple categories.

Figure 12 shows the results of the quantitative validation of the detection model on the testing dataset. As shown in the top-left subfigure, there are 203 sample images and 223 pavement marking objects included in the evaluation dataset. It can be seen that the distribution of different pavement markings is imbalanced. Thus, collecting more images and enlarging the dataset are the future study orientations for this proposal. The bottom-left subfigure demonstrates the number of true/false predictions upon the testing samples for each category, where the red portion represents the false predictions and the green potion refers to the true predictions. Given the number shown in the figure, it can be surmised that the detection model is working properly since most of the identified pavement markings were correctly classified. The right subfigure provides the average precision values for each category. The mAP value can reflect the overall performance of the detection module. However, the low mAP value indicates that there are some spaces to further improve the model.

From the validation results on testing samples, it is observed that some left-turn and right-turn markings are misclassified as the other category. By exploring the whole project, the reason causing this issue is finally found: a code issue. Since YOLOv3 is a representative framework in the object detection field, there are many open-source implementation codes. In this project, the detection model upon pavement markings is also trained with the open-source codes. Within the data preprocessing step of the codes, the author randomly chooses some training samples and flips them horizontally to enhance the diversity of the training data. Actually, this is a common and useful operation to achieve data augumentation. However, it does not fit for this pavement marking detection task. For general objects, the horizontal flipping would not change its category so that this operation is valid. But in terms of pavement markings, the flip process may transform the marking into another type, e.g., left-turn to right-turn. Thus, the flip operation within the codes generates wrong training samples, misleads the machine and hinders the performance of the model.

**Figure 12.** The quantitative evaluation information on the testing dataset of the trained model.

By removing the codes and re-training the model, the new quantitative validation results are shown in Figure 13. Comparing the Figures 12 and 13, the performance of the model is greatly enhanced, i.e., there is a 24% increment on the mAP value. The evaluation results fully prove the effectiveness of the YOLOv3 model in the pavement marking recognition task.

**Figure 13.** The quantitative evaluation information on the testing dataset of the improved model.
