CTM-YOLOv8n: A Lightweight Pedestrian Traffic-Sign Detection and Recognition Model with Advanced Optimization

Chen, Qiang; Dai, Zhongmou; Xu, Yi; Gao, Yuezhen

doi:10.3390/wevj15070285

Open AccessArticle

CTM-YOLOv8n: A Lightweight Pedestrian Traffic-Sign Detection and Recognition Model with Advanced Optimization

by

Qiang Chen

^1,2,

Zhongmou Dai

¹,

Yi Xu

^3,4,* and

Yuezhen Gao

⁵

¹

School of Automobile and Transportation, Tianjin University of Technology and Education, Tianjin 300222, China

²

National & Local Joint Engineering Research Center for Intelligent Vehicle Road Collaboration and Safety Technology, Tianjin 300222, China

³

School of Transportation and Vehicle Engineering, Shandong University of Technology, Zibo 255000, China

⁴

Qingte Group Co., Ltd., Qingdao 266106, China

⁵

Department of Civil Engineering, University of Alberta, 116 St NW, Edmonton, AB T6G 2E1, Canada

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2024, 15(7), 285; https://doi.org/10.3390/wevj15070285

Submission received: 5 June 2024 / Revised: 21 June 2024 / Accepted: 24 June 2024 / Published: 26 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

Traffic-sign detection and recognition (TSDR) is crucial to avoiding harm to pedestrians, especially children, from intelligent connected vehicles and has become a research hotspot. However, due to motion blurring, partial occlusion, and smaller sign sizes, pedestrian TSDR faces increasingly significant challenges. To overcome these difficulties, a CTM-YOLOv8n model is proposed based on the YOLOv8n model. With the aim of extracting spatial features more efficiently and making the network faster, the C2f Faster module is constructed to replace the C2f module in the head, which applies filters to only a few input channels while leaving the remaining ones untouched. To enhance small-sign detection, a tiny-object-detection (TOD) layer is designed and added to the first C2f layer in the backbone. Meanwhile, the seventh Conv layer, eighth C2f layer, and connected detection head are deleted to reduce the quantity of model parameters. Eventually, the original CIoU is replaced by the MPDIoU, which is better for training deep models. During experiments, the dataset is augmented, which contains the choice of categories ‘w55’ and ‘w57’ in the TT100K dataset and a collection of two types of traffic signs around the schools in Tianjin. Empirical results demonstrate the efficacy of our model, showing enhancements of 5.2% in precision, 10.8% in recall, 7.0% in F1 score, and 4.8% in mAP@0.50. However, the number of parameters is reduced to 0.89M, which is only 30% of the YOLOv8n model. Furthermore, the proposed CTM-YOLOv8n model shows superior performance when tested against other advanced TSDR models.

Keywords:

traffic-sign detection and recognition; YOLOv8n; C2f Faster; MPDIoU; lightweight

1. Introduction

According to data provided by the WHO [1], road traffic accidents cause the deaths of approximately 1.3 million people around the world each year, leaving 20–50 million people with nonfatal injuries. More importantly, road traffic injuries are the leading cause of death for children and young adults aged 5 to 29, who are particularly vulnerable on the roads of the world. The number of deaths caused by road traffic accidents in Tianjin has reached 9 of every 1000 abnormal deaths. Many accidents were caused by drivers who were not paying attention to traffic signs at critical moments or by adverse conditions that impeded visibility [2,3].

Traffic signs with different design styles based on particular shapes and colors with symbols inside are used to inform, guide, restrict, and warn drivers, which can help to make driving safer and more convenient. Today, intelligent connected vehicles (ICVs) [4] can independently assist driving operations, which is of great significance in liberating the human body and greatly reduces the accident rate. The traffic-sign detection and recognition (TSDR) of ICVs can effectively prevent traffic collisions and accidents by notifying drivers of potential traffic problems [5,6]. Consequently, the continued development of TSDR systems for pedestrians, especially children, remains important in making driving safer and is gradually becoming a hot research topic in the industrial and academic sectors. However, in real traffic scenarios, pedestrian TSDR faces many difficulties due to motion blurring, partial occlusion, and small sign sizes.

TSDR technology is usually used to identify the type and actual meaning of a traffic target that may have been localized in video frames. As such, TSDR methods can be divided into the following broad categories [7], namely, template matching, traditional machine learning, and deep learning.

Template-matching-based methods rely on extracting features from visual information such as color, edge, and shape. Hechri et al. [8] proposed a method to find the optimal communication for TSDR using a sliding window with the same scale as the traffic signs presented in the database. Wang et al. [9] proposed a bitmap method to detect red circular traffic signs. When performing color segmentation of the detected images, region-of-interest (ROI) shape detection through edge information is conducted. However, due to the complex and changing environment of road traffic, this method has poor real-time performance and insufficient robustness.

Traditional machine learning-based methods, such as support vector machines (SVMs), linear discriminate analysis, and ensemble classifiers, classify detected traffic signs relatively independently. Yang et al. [10] applied SVM classifiers comprising four classes to recognize traffic sign categories, which were based on the kernel of the radial basis function using color information and histograms of oriented gradient features. In addition, the computational framework for the traffic-sign detection, tracking, and recognition task is introduced using the mono-camera mounted on a moving vehicle in non-stationary environments [11]. However, an SVM can be easily trained and can accurately recognize traffic signs, but its efficiency is low compared to many other methods.

In recent years, deep learning-based TSDR has demonstrated raw-data-based representational learning and achieved excellent performance in improving accuracy and recognition rate. These fall into two main categories: a two-stage algorithm represented by a region-based convolutional neural network (R-CNN) [12] (Fast-RCNN [13], Faster R-CNN [14], and Cascaded R-CNN [15]) and a one-stage algorithm such as Single-Shot Multi-box Detector (SSD) [16], You Only Look Once (YOLO) [17], and TSingNet [18]. Due to the limited computational capacity in real-world applications, most studies have focused on one-stage algorithms for TSDR.

One-stage algorithms can simultaneously predict object classes and generate bounding boxes and are therefore competent for detection tasks with high real-time requirements [19]. Moreover, the input image is uniformly sampled using different aspect ratios and scales at different positions. According to Wan’s research [20], it is important to recognize small traffic signs through pruning networks, improving loss function based on YOLOv3. Gu et al. [21] proposed a lightweight real-time traffic-sign detection integration framework based on YOLOv4, which optimizes latency concern by reducing network computational overhead and promotes information transfer and sharing at diverse levels. Ren et al. [22] proposed Meta-YOLO by constructing a two-stage meta-learner F model, which can learn the direction of learning update and the learning rate. Zhang et al. [23] proposed ReYOLO, which replaces the conventional block with modules constructed through structural reparameterization methods and designs a novel weighting mechanism that can be embedded in a feature pyramid to efficiently detect small and ambiguous traffic signs in the wild. Chu et al. [24] proposed TRD-YOLO, which includes global feature extraction capability and a lightweight multi-branch detection head to improve the accuracy of small-traffic-sign detection. Han et al. [25] incorporate a decoupled head, space-to-depth convolution (SPD-Conv) modules, and a context augmentation module (CAM) into YOLOv5 to address the problem of poor localization, low accuracy, and missed detections. Lai et al. [26] constructed STC-YOLO by using a small-object detection layer, multi-head attention, and a normalized Gaussian Wasserstein distance (NWD) metric, which is suitable for complex scenes. Liu et al. [27] proposed UCN-YOLOv5, which enhances the extraction characteristics of the network by using the core module RSU of U2Net and constructs the C3_CN2 structure by integrating ConvNeXt-V2 and the C3 module. Ren et al. [28] proposed a GS-FPN structure by integrating the convolutional block attention model and a new, lightweight GSConv module to replace the original feature fusion structure in YOLOv5, reducing the information loss of the feature map and ultimately achieving improved recognition performance. Song et al. [29] proposed Sign-YOLO to achieve the optimal balance between accuracy and real-time performance, combining a coordinate attention module and High-BiFPN to enhance the neck structure of YOLOv5s and integrate multi-scale semantics. Meanwhile, to target TSDR in complex road conditions, YOLOv5-based improved models have been constructed by modifying the backbone, neck, head, or loss [30,31,32,33,34,35,36]. Li et al. [37] improved YOLOv7 by integrating the self-attention and convolutional mix modules into an added layer for small-target detection. She et al. [38] improved YOLOv7-Tiny, which utilizes the channel attention mechanism to construct a new SliceSample module, to reduce the loss of feature information. This module not only focuses on the correlation between adjacent channels but also adopts two branches to improve the model’s ability to extract information from the feature maps. Overall, deep learning-based one-stage algorithms show high application potential and research value in TSDR.

Although the above methods have improved detection accuracy, practical application scenarios require more attention to pedestrians and fewer deployment parameters for the model. Therefore, the model for TSDR should focus on narrowing the above gaps, which are still necessary to further research. Based on the above needs, the latest one-stage algorithm YOLOv8n is used as a basis in this paper, which can well meet the needs of practical applications for detection speed and network parameters.

To enhance the effect of YOLOv8n on TSDR, this study proposes a CTM-Yolov8n model (C2f Faster module, tiny-object-detection layer, and MPDIoU You Only Look Once Version 8) based on YOLOv8n. The specific contributions of this study are as follows.

To avoid harm to pedestrians, especially children, from ICVs in Tianjin, data augmentation is carried out, which includes choosing the categories of ‘w55’ and ‘w57’ in the TT100K dataset and a collection of two types of traffic signs around schools in Tianjin. Moreover, the approach of enhancing corruption is chosen to enhance the dataset, and the dataset of 1700 images is completed.
The C2f Faster module is constructed using a PConv (a novel partial convolution that extracts spatial features more efficiently by reducing redundant computation and memory access simultaneously) layer followed by Conv 1 × 1 layers, which can reduce the amount of computation and parameters.
A tiny-object-detection layer (TOD) consisting of the ‘Upsample’, ‘Concat’, ‘C2f Faster’, and ‘Detect’ modules is integrated into the first C2f layer of the backbone to improve small-object detections in TSDR. To reduce the model parameter number, the seventh Conv layer, eighth C2f layer, and the detect head with it are deleted. Moreover, the C2f module in the head is replaced by the C2f Faster module.
MPDIoU improved the original loss function CIoU of YOLOv8n to effectively improve the accuracy of pedestrian TSDR, which can minimize the top-left- and bottom-right-point distance between the predicted bounding box and the ground-truth bounding box for better training of deep models.

The remainder of this paper is organized as follows. In Section 2, the method of data augmentation is presented and the improved YOLOv8n model is introduced. Detailed experimental setup and comparison of experimental results with other models are given in Section 3. The discussions are presented in Section 4. Finally, conclusions are drawn in Section 5.

2. Materials and Methods

2.1. Dataset Augmentation

To carry out the comprehensive experiments, the dataset that was used is TT100k [39], which is a large-scale traffic-sign-detection dataset released by Tsinghua University and Tencent Corporation. The TT100K dataset includes 30,000 traffic sign instances ranging in size from 16 × 20 to 160 × 160 pixels in images with a resolution of 2048 × 2048 pixels. Small traffic signs are the most common in the TT100K dataset, as shown in Figure 1.

To avoid traffic accidents that cause harm to pedestrians, especially children, the categories ‘w55’ and ‘w57’ in the TT100K dataset were chosen in this experiment. Furthermore, two types of traffic signs around schools in Tianjin were collected in 2023, and 1000 images were selected from it. To fulfill the complex environments in pedestrian TSDR, the approach of enhancing corruption [26], such as motion blur, Gaussian noise, and fog, was chosen to enhance the TT1001K dataset. The additional data collection process and data augmentation are shown in Figure 2.

After data augmentation, the dataset comprised a total of 1700 images. Among these, 1530 images were allocated to the training set and 170 to the test set. As shown in Figure 3, flag images and category names are depicted. The label data are detailed in Figure 4.

2.2. YOLOv8n Network Structure

As shown in Figure 5, considering the real-time requirements, the lightweight YOLOv8n [40] was selected as the baseline and optimized model.

The YOLOv8n model is a type of YOLOv8 model which is a one-stage algorithm based on regression techniques, designed to be fast and accurate, making it an excellent choice for object detection tasks. The structure of the YOLOv8n network is divided primarily into the input, the backbone, and the head layer.

The input is used to receive images and apply various data augmentation operations for subsequent processing, such as scaling, adjusting hues, mosaic, etc. By randomly cropping and scaling four images, the mosaic data augmentation method concatenates them into one image for training.

The backbone, consisting of Conv, C2f, and SPPF modules, serves as the main feature extraction layer. The convolution, batch normalization (BN), and activation functions are performed by the Conv module. The C2f module is designed for acquiring residual features. The SPPF module, known as spatial pyramid pooling fast, reduces computational complexity. This article uses the SiLU activation function for pooling operations in the SPPF module. The SiLU activation function is a non-linear function that can help networks learn image features and complex operating modes by introducing non-linear factors, thereby improving computational speed.

The head is responsible for generating the final prediction through effectively integrating and fusing feature maps from different levels, promoting functional integration and fusion at different scales and levels of abstraction. Also, DFL (distribution focal loss) is used to extract more detailed information. This step involves the application of non-maximum suppression (NMS) techniques to filter out redundant outcomes, thereby preserving dependable and precise prediction results. This model preprocesses the received image in the input layer by scaling and adjusting the color tone. Afterwards, four images were randomly merged through data augmentation and connected to one image for training data, thereby reducing data processing time. Therefore, there is no need to consider the impact of data distortion and different environments in the backbone layer and the head layer.

2.3. Improvement of CTM-YOLOv8n

2.3.1. C2f Faster Module

In YOLOv8n, the C2f module can enhance CNN feature fusion ability, which is used to efficiently perform feature extraction, resulting in superior quality results, as shown in Figure 6.

To design faster neural networks, Chen et al. [41] proposed a novel partial convolution (PConv) module that extracts spatial features more efficiently by reducing redundant computation and memory access simultaneously, as shown in Figure 7. PConv is fast and efficient by applying filters on only a few input channels while leaving the remaining ones untouched.

Given an input feature map

F \in R^{C \times H \times W}

, where C is the number of input channels, and H and W represent the height and width of the feature map, respectively, the Conv module applies C filters

W \in R^{K \times K \times C}

to compute the output, as shown in Figure 7b. This results in much higher memory access, which can cause non-negligible delay and slow down the overall computation. The PConv module applies a regular Conv on only a part of the input channels for spatial feature extraction and leaves the remaining channels untouched, that is C1 (

C 1 < C

) filters

W \in R^{K \times K \times C 1}

to compute the output, as shown in Figure 7a. Consequently, the PConv module can reduce computational redundancy and memory access simultaneously. The C2f Faster module is constructed using the PConv layer followed by the Conv 1 × 1 layers, which can reduce the amount of computation and parameters, as shown in Figure 8.

2.3.2. Improved Tiny-Object-Detection Layer

To enhance small-pedestrian-sign detection in TSDR, the backbone is modified, namely the TOD (improved tiny-object-detection layer). Firstly, a tiny-object-detection layer is added to the first C2f layer in the backbone, which can produce a 160 × 160 feature map. This TOD layer consists of the ‘Upsample’, ‘Concat’, ‘C2f Faster’, and ‘Detect’ modules. Second, the seventh Conv layer and the eighth C2f layer are deleted. Therefore, the number of network layers for the backbone is reduced from 10 to 8, and the model parameter number is further reduced. Third, the ‘Upsample’, ‘Concat’, ‘C2f Faster’, and ‘Detect’ modules that connect to the eighth C2f layer in the head are also deleted. The structure of the CTM-YOLOv8n model is shown in Figure 9.

After modifying the network structure, there are only three C2f modules and four Conv modules in the backbone. The C2f module in the head is replaced by the C2f Faster module. The TOD layer can effectively extract key features, reduce the influence of background feature information, and improve the detection accuracy for the same input image. More importantly, it can effectively reduce the number of model parameters.

2.3.3. Loss Function Improvements

In the prediction phase, the loss value can be used as a reference for backpropagation to bring the predicted value closer to the actual value. The larger the error, the larger the loss value. The original loss function of YOLOv8n is in the form of CIoU (complete intersection over union), which provides a comprehensive measure considering both the distance between the central point and the aspect ratio. However, the definition of an aspect ratio from the CIoU is a relative value rather than an absolute value. To effectively improve the precision of pedestrian TSDR, the loss function is improved by MPDIoU [42].

MPDIoU is based on the minimum distance of the point, which contains all the relevant factors considered in the existing loss functions, such as the overlapping or non-overlapping areas, the central point’s distance, and deviations in width and height, while simplifying the calculation process.

A_{p r d} = (x_{1}^{p r d}, y_{1}^{p r d}, x_{2}^{p r d}, y_{2}^{p r d})

,

A_{g t} = (x_{1}^{g t}, y_{1}^{g t}, x_{2}^{g t}, y_{2}^{g t})

denote the predicted and ground-truth box coordinates, respectively.

(x_{1}^{p r d}, y_{1}^{p r d})

,

(x_{2}^{p r d}, y_{2}^{p r d})

denote the top-left and bottom-right point coordinates of

A_{p r d}

.

(x_{1}^{g t}, y_{1}^{g t})

,

(x_{2}^{g t}, y_{2}^{g t})

denote the top-left and bottom-right point coordinates of

A_{g t}

.

w, h

denote the width and height of the input image. The calculation factors of the MPDIoU loss function are shown in Figure 10. The calculation formula for the MPDIoU loss is as follows.

L_{M P D I o U} = 1 - (I o U - \frac{d_{1}^{2}}{h^{2} + w^{2}} - \frac{d_{2}^{2}}{h^{2} + w^{2}}),

(1)

where

\{\begin{matrix} d_{1}^{2} = {(x_{1}^{p r e d} - x_{1}^{g t})}^{2} + {(y_{1}^{p r e d} - y_{1}^{g t})}^{2} \\ d_{2}^{2} = {(x_{2}^{p r e d} - x_{2}^{g t})}^{2} + {(y_{2}^{p r e d} - y_{2}^{g t})}^{2} \\ I o U = \frac{A_{g t} \cap A_{p r d}}{A_{g t} \cup A_{p r d}} \end{matrix},

(2)

2.4. Evaluation Indicators

To quantitatively compare performance, the experimental evaluation metrics include precision, recall, F1 score, AP, and mAP. These metrics measure how much two bounding boxes overlap and determine whether a prediction is correct or not. Precision reflects how accurate the model is at detecting objects. Recall reflects how complete the model is in detecting objects. The F1 score is the harmonized mean of precision and recall. AP is the average of precision values at different recall levels for each class, and it reflects how well the model can detect objects across various confidence thresholds. mAP denotes the mean of precision calculated at different levels of recall.

In addition, Params (the number of parameters), GFLOPs (giga floating-point operations per second), and FPS (frames per second) were used as evaluation indicators.

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %,

(3)

R e c a l l = \frac{T P}{T P + F N} \times 100 %,

(4)

F 1 s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} \times 100 %,

(5)

A P = \int_{0}^{1} P (R) d R \times 100 %,

(6)

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i},

(7)

where TP represents the positive samples that are detected correctly. FP represents the positive samples that are incorrectly objected to. FN represents the negative samples that are incorrectly objected to. A prediction is a TP if its IoU with the ground-truth box is above the threshold and an FP otherwise. N represents the number of classes, and AP_i is the average precision of class i.

In this study, we used two types of mAP:

mAP@0.50: Calculated using an IoU threshold of 0.5;
mAP@0.95: Calculated using thresholds from 0.5 to 0.95, increasing by 0.05.

Using these metrics, comprehensive and accurate analysis of the proposed method is studied to consider both accuracy and efficiency.

3. Results

3.1. Experimental Setup

For our experimentation, a machine equipped with an Intel(R) Core (TM) i7-11800H processor and NVIDIA GeForce RTX 3060 graphics is used. For training the model, the Adam optimizer is employed, starting with a learning rate of 0.001 and increasing to 0.01. To enhance the speed of parameter updates, the SGD (stochastic gradient descent) algorithm is used. The weight decay is carefully balanced at 0.0005 to avoid overfitting or underfitting the model. The batch size is set to 16. The dataset used in this article was collected around schools in Tianjin, with a total of 1700 images containing two types of traffic sign datasets. The dataset includes 384 obstructed traffic sign images, 835 traffic sign images under strong light illumination, and 481 traffic sign images under backlight. The settings of the experimental parameter are summarized in Table 1.

3.2. Study Comparison before and after Optimization

To demonstrate the significant advantage of the CTM-YOLOv8n model in TSDR over the YOLOv8n sub-model, a series of comparative experiments were conducted. The findings are detailed in Table 2, Table 3 and Table 4.

As shown in Table 2, the CTM-YOLOv8n model is superior to the YOLOv8n model in TSDR detection, showing enhancements of 9.9% in precision, 0.9% in recall, 7.0% in F1 score, 7.3% in [email protected], and 6.1% in [email protected]. According to the experimental results, although this method reduces the weight, it does not have a significant impact on detection performance and improves the detection accuracy for traffic signs. As detailed in Table 3, the CTM-YOLOv8n model significantly outperforms the baseline YOLOv8n model, showing enhancements of 5.2% in precision, 10.8% in recall, 7.0% in F1 score, 4.8% in mAP@0.50, and 2.2% in mAP@0.95. However, this did not lead to higher computational requirements. The parameter has been reduced to 0.89M, which is only 30% of the YOLOv8n model. This is because this article removes the Conv layer, the C2f layer, and the detection head, reducing the number of model parameters without affecting the recognition accuracy of classes “w55” and “w57”. And by using faster C2f modules to extract spatial features, the network speed can be improved without affecting the recognition of the “w55” and “w57” classes. From Table 4, the mAP@0.50 of ‘w55’ and ‘w57’ increases by 8.4% and 5.5%, respectively; the mAP@0.95 of the traffic sign category ‘w55’ and ‘w57’ increases by 2.4% and 0.6%, respectively.

A heat map is a data visualization technique used to represent the relative density or degree of each data point in a dataset. In fields such as image processing and object detection, heat maps are commonly used to represent the model’s response or attention distribution to input data. To highlight the efficacy of the proposed model in detecting tiny traffic signs, heat maps are used from both the YOLOv8n and CTM-YOLOv8n models, as shown in Figure 11. In particular, as shown in Figure 11c, the TOD layer in CTM-YOLOv8n provides more detailed attention to tiny traffic signs.

As shown in Figure 12, the area under the curve (AUC) is a key indicator of model performance, demonstrating the comprehensive efficacy across varying precision and recall configurations. The comparison highlights that the refined model exhibits a larger AUC, indicating superior detection capabilities.

3.3. Ablation Experiment

To illustrate the effectiveness and light weight characteristics of each enhancement in the performance of the CTM-YOLOv8n model, ablation experiments were carried out on the novel structure proposed in this study to elucidate the influence of each enhancement module, with the findings presented in Figure 13. The results of these various modifications are detailed in Table 5.

Compared to YOLOv8n, incorporating the C2f Faster module led to a reduction in parameters by 0.70 M and computational demands by 1.8 GFLOPs while achieving a 1.4% improvement in precision, with minimal impact on mAP@0.50. These results indicate the efficacy of the C2f Faster module in reducing the overall complexity and parameters of the model.

YOLOv8n+TOD, adding the TOD layer to the first C2f layer in the backbone, led to increases in the precision, recall, F1 score, mAP@0.5, and mAP@0.95 by 4.7%, 6.8%, 0.06%, 4.4%, and 4.2%, respectively, along with a reduction of 1.32M in parameters compared to YOLOv8n. This enhancement enables the model to adeptly identify and concentrate on crucial segments of the input with more detailed attention to tiny traffic signs.

Compared to YOLOv8n+C2f Faster, YOLOv8n+TOD+C2f Faster reduces the parameters by 1.41M while achieving a 1.9%, 1.6%, 0.03%, 4.4%, and 3.4% improvement in model precision, recall, F1 score, mAP@0.50, and mAP@0.95, respectively. Moreover, the computational demand is reduced by 0.7 GFLOPs compared to YOLOv8n+TOD. This is because the C2f module in the head is replaced by the C2f Faster module to reduce the number of model parameters.

With the improvement in the original loss function CIoU by the MPDIoU, the CTM-YOLOv8n model’s precision, recall, F1 score, mAP@0.5, and mAP@0.95 increase by 1.9%, 7.6%, 0.03%, 0.9%, and 0.2%, respectively. This enhancement does not change the parameters and GFLOPs of the model compared to YOLOv8n+TOD+C2f Faster.

3.4. Comparative Study with State-of-the-Art Models

The performance of the CTM-YOLOv8n model in TSDR was assessed by comparing it with established models such as Faster R-CNN, SSD, STC-YOLO, Improved YOLOv5, SEDG-YOLOv5, YOLOv7-tiny, and CR-YOLOv8. The results are detailed in Table 6.

Table 6 illustrates the detailed comparison of the CTM-YOLOv8n model with other SOTA detection methods. CTM-YOLOv8n exhibits a 0.9% improvement in mAP@0.50 over YOLOv7-tiny, along with a reduction of 22.4M in parameters and an increase in FPS of 94. CTM-YOLOv8n also outperforms CR-YOLOx8, showing a 7.4% better mAP@0.5 and a 58 FPS increase in computation. Although the FPS of the SEDG-YOLOv5 model is better than our model, the mAP@0.50 is lower by 3.3%, with a parameter increase of 1.88 M. The Faster R-CNN model mainly relies on single-scale feature extraction, and its detection ability is poor. While the SSD model shows enhanced detection performance in certain respects, its dependence on fixed-size anchor boxes leads to missed detections of various targets. On the contrary, the YOLOv5 model using FPN+PAN for multi-scale recognition has the problem of inefficient output feature map utilization. The YOLOv7 model enhances network performance and computational efficiency with its innovative ELAN architecture, but at the cost of increasing parameters with many stride convolution and pooling layers. In conclusion, the CTM-YOLOv8n model proposed in this study outperforms existing models in detection performance, proving to be more adept at fulfilling the requirements of pedestrian TSDR.

3.5. Real-Test Experiments

To demonstrate the efficiency of the CTM-YOLOv8n model, a test dataset was used. The comparison results of pedestrian TSDR using the final trained model are shown in Figure 14. The detection results are denoted by rectangular boxes, each accompanied by the corresponding category labels and associated confidence levels. The CTM-YOLOv8n model exhibits outstanding performance, accurately discerning pedestrian traffic signs of diverse scales with significant confidence, indicating a superior ability to detect small objects. Moreover, it is shown that the trained CTM-YOLOv8n model has good recognition ability, strong anti-interference ability, and a high-accuracy recognition rate with different backgrounds and interferences. To verify whether changes in the model will affect changes in the capacity of data classes, we compare the traffic-sign detection results of the CTM-YOLOv8n model and the YOLOv8n model in the TT100K database to verify the traffic-sign detection performance in partially occluded, backlit, and related environments.

In the illuminated scene, traffic signs of category ‘w57’ were detected with a confidence level of 0.6, shown in Figure 14(a1), while in Figure 14(b1), traffic signs of category ‘w57’ were detected with a confidence level of 0.8, indicating better detection accuracy. Traffic signs of category ‘pl’ in the image can still be detected. The comparison chart for test results under the condition of directional light and occlusion is shown in Figure 14(a2,b2), with the traffic sign categories of ‘w57’ and ‘w55’ detected with confidence levels of 0.8 and 0.3, respectively. According to the YOLOv8n model, the ‘w57’ class on the right side of the image was not recognized, and the ‘pl’ class on the right side was identified as the ‘w55’ class. The CTM-YOLOv8n model effectively solves this type of problem. The comparison of test results under backlight conditions is shown in Figure 14(a3,b3). In Figure 14(a3), the blurred traffic sign of the categories ‘w57’ is detected with a confidence of 0.3. Correspondingly, in Figure 14(b3), the confidence level of ‘w57’ is 0.4, which is higher than the YOLOv8n model. The detection of other traffic signs is not affected and more categories of traffic signs can be detected. Our proposed CTM-YOLOv8n model can enhance the detection accuracy of traffic signs in the TT100K database without affecting the capacity of object classes.

The comparison chart of the test results under the condition of directional light is shown in Figure 15(a1,b1), and the traffic sign category of ‘w57′ is detected with confidence of 0.9 and 0.8, respectively. The comparison of test results under backlight conditions is shown in Figure 15(a2,b2). In Figure 15(a2), the blurred traffic sign of the category ‘w57′ is detected with a confidence of 0.7. Corresponding to this, it is not detected in Figure 15(b2). In partial occlusion scenarios, the traffic sign of category ‘w55′ is detected with the confidence of 0.5, shown in Figure 15(a3), while it is not detected in Figure 15(b3). Figure 15 reveals that the CTM-YOLOv8n model can effectively perform TSDR in real traffic scenarios and faces many difficulties due to motion blurring, partial occlusion, cluttered backgrounds, and small sign sizes. Table 7 shows the parameter comparison of the model under different environmental conditions.

Table 7 provides a detailed comparison between the CTM-YOLOv8n model and the YOLOv8n model. CTM-YOLOv8n performs better than YOLOv8n under directional, backlight, and partial occlusion conditions. For [email protected] increases of 8.1%, 9.4%, and 8.2%, the parameters decreased by 15.4 M, 13.49 M, and 14.31 M, respectively. FPS increased by 62, 64, and 63 frames per second, respectively. In summary, our proposed CTM-YOLOv8n model has a simpler structure, outperforming the YOLOv8n model in detection accuracy and computational speed and better meeting the requirements of pedestrian TSDR. Due to limited experimental conditions, this article only conducts pedestrian TSDR recognition and detection under directional light, backlight, and occlusion conditions. In the future, testing experiments under different weather conditions can be added.

4. Discussion

Our research demonstrates innovative and pioneering efforts to address the issues of TSDR using the YOLOv8n model, focusing on overcoming detecting small pedestrian traffic signs. Additionally, improvements have been made by constructing the C2f Faster module using PConv, adding a TOD layer to the first C2f layer in the backbone and changing the loss function CIoU to the MPDIoU. Moreover, two types of traffic signs around schools in Tianjin were collected and augmented with the TT1001K dataset to meet the complex environments.

An empirical study was performed to validate the effectiveness of the presented CTM-YOLOv8n, which is covered in Section 3.2 and Section 3.3. It can be seen from the result that by incorporating the C2f Faster module and adding the TOD layer to the first C2f layer in the backbone, our proposed model performed better in small-traffic-sign detection and obtained the highest mAP@0.50, along with a significantly reduced number of parameters. By changing the original loss function CIoU to MPDIoU, the precision of the CTM-YOLOv8n model and mAP@0.50 were further reduced, which did not change the numbers of parameters and GFLOPs. It can be seen from Table 4 that each additional module enhances the network detection accuracy. The C2f Faster module aims to apply filters on only a few input channels while leaving the remaining ones untouched to extract spatial features more efficiently and make the network faster. The TOD layer is added through the first C2f layer in the backbone to improve the small objection in TSDR. The seventh Conv layer, eighth C2f layer, and detection head are also deleted to reduce the number of model parameters. Ultimately, there are only three C2f modules and four Conv modules in the backbone, and the C2f module in the head is replaced by the C2f Faster module. The loss function MPDIoU can minimize the top-left- and bottom-right-point distance between the predicted bounding box and the ground-truth bounding box for better training of deep models.

The detection accuracy of the CTM-YOLOv8n model can be significantly increased with several published state-of-the-art approaches, as detailed in Section 3.4. CTM-YOLOv8n exhibits an improvement in [email protected], along with a reduction in the number of parameters and an increase in FPS. As shown in Figure 14, the CTM-YOLOv8n model can effectively perform pedestrian TSDR in real traffic scenarios facing motion blurring, partial occlusion, cluttered backgrounds, and small sign sizes.

Despite advances in pedestrian TSDR performance by the CTM-YOLOv8n model, opportunities for enhancing detection speed and dataset enrichment remain. The inclusion of the TOD layer has not only improved feature extraction and fusion efficiency but also increased the number of GFLOPs. Additionally, the omission of complex weather conditions in the dataset curtails the generalizability and accuracy of the model.

Future efforts will aim to elevate the model’s detection efficacy and processing velocity, focusing on the augmentation of a broader, more inclusive dataset, particularly incorporating data from extreme weather scenarios to boost the model’s versatility across different settings.

5. Conclusions

To avoid harm to pedestrians, especially children, from ICVs in Tianjin, the 1700-image dataset contains the choice of categories ‘w55’ and ‘w57’ in the TT100K dataset, and the collection of two types of traffic signs around schools in Tianjin was augmented, and a pedestrian TSDR network, the CTM-YOLOv8n model, was constructed based on the framework of YOLOv8n. The CTM-YOLOv8n model achieved 94.3% for the mAP@0.50, which was 4.8% higher than YOLOv8n. However, this has not resulted in higher computational requirements; the number of parameters is reduced to 0.89M, which is only 30% of the YOLOv8n model. In the CTM-YOLOv8n model, aiming at extracting spatial features more efficiently and making the network faster, the C2f Faster module is constructed to replace the C2f module in the head, which applies filters to only a few input channels while leaving the remaining ones untouched. To enhance small-object detection in TSDR, a TOD layer consisting of the ‘Upsample’, ‘Concat’, ‘C2f Faster’, and ‘Detect’ modules is designed and added to the first C2f layer in the backbone. Meanwhile, the seventh Conv layer, eighth C2f layer, and connected detection head are deleted to reduce the number of model parameters. Additionally, the original loss function CIoU is changed to the MPDIoU. Furthermore, the comparative study with several published SOTA approaches further proves the superiority of the CTM-YOLOv8n model. The experimental results show that our proposed model effectively meets the needs of detecting small traffic signs, ensures real-time detection, and significantly improves detection accuracy.

However, there are many difficulties in detecting pedestrian traffic signs at night in Tianjin, such as street lighting and reflected light interference. In the future, the focus should be on collecting datasets in nightscape and adopting advanced techniques.

Author Contributions

Conceptualization, Q.C. and Y.X.; methodology, Z.D. and Q.C.; software, Q.C.; validation, Q.C., Y.X. and Z.D.; formal analysis, Y.X. and Y.G.; investigation, Z.D. and Y.G.; data curation, Q.C.; writing—original draft preparation, Q.C.; writing—review and editing, Q.C., Y.X. and Y.G.; visualization, Q.C. and Y.G.; supervision, Y.X.; project administration, Z.D.; funding acquisition, Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Tianjin Education Committee Science and Technology Project (grant number: 2021KJ018).

Data Availability Statement

The data used in this study can be obtained from the link https://drive.google.com/file/d/1YL3Z92-gEwpYQ-Iq5k47ZZ-A4Pofpzk5/view?usp=sharing (accessed on 23 June 2024), and the password can be obtained from the corresponding author.

Acknowledgments

The authors thank the editors and anonymous reviewers for their helpful suggestions.

Conflicts of Interest

The authors declare no conflicts of interest. Yi Xu is employee of Qingte Group Co., Ltd. The paper reflects the views of the scientists, and not the company.

References

Road Traffic Injuries. Available online: https://www.who.int/health-topics/road-safety (accessed on 17 November 2023).
Bai, Y.; Shang, C.; Li, Y.; Shen, L.; Jin, S.; Shen, Q. Transport Object Detection in Street View Imagery Using Decomposed Convolutional Neural Networks. Mathematics 2023, 11, 3839. [Google Scholar] [CrossRef]
Arcos-García, Á.; Álvarez-García, J.A.; Soria-Morillo, L.M. Evaluation of Deep Neural Networks for Traffic Sign Detection Systems. Neurocomputing 2018, 316, 332–344. [Google Scholar] [CrossRef]
Badue, C.; Guidolini, R.; Carneiro, R.V.; Azevedo, P.; Cardoso, V.B.; Forechi, A.; Jesus, L.; Berriel, R.; Paixão, T.M.; Mutz, F.; et al. Self-Driving Cars: A Survey. Expert Syst. Appl. 2021, 165, 113816. [Google Scholar] [CrossRef]
Du, L.; Ji, J.; Pei, Z.; Zheng, H.; Fu, S.; Kong, H.; Chen, W. Improved Detection Method for Traffic Signs in Real Scenes Applied in Intelligent and Connected Vehicles. IET Intell. Transp. Syst. 2020, 14, 1555–1564. [Google Scholar] [CrossRef]
Cao, J.; Song, C.; Peng, S.; Xiao, F.; Song, S. Improved Traffic Sign Detection and Recognition Algorithm for Intelligent Vehicles. Sensors 2019, 19, 4021. [Google Scholar] [CrossRef]
Arcos-García, Á.; Álvarez-García, J.A.; Soria-Morillo, L.M. Deep Neural Network for Traffic Sign Recognition Systems: An Analysis of Spatial Transformers and Stochastic Optimisation Methods. Neural Netw. 2018, 99, 158–165. [Google Scholar] [CrossRef]
Hechri, A.; Mtibaa, A. Lanes and Road Signs Recognition for Driver Assistance System. Int. J. Comput. Sci. Eng. 2011, 8, 402–408. [Google Scholar]
Wang, G.; Ren, G.; Jiang, L.; Quan, T. Hole-Based Traffic Sign Detection Method for Traffic Signs with Red Rim. Vis. Comput. 2014, 30, 539–551. [Google Scholar] [CrossRef]
Yang, Y.; Luo, H.; Xu, H.; Wu, F. Towards Real-Time Traffic Sign Detection and Classification. IEEE Trans. Intell. Transport. Syst. 2016, 17, 2022–2031. [Google Scholar] [CrossRef]
Yuan, Y.; Xiong, Z.; Wang, Q. An Incremental Framework for Video-Based Traffic Sign Detection, Tracking, and Recognition. IEEE Trans. Intell. Transport. Syst. 2017, 18, 1918–1929. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Xie, Z.; Sun, J.; Zou, X.; Wang, J. A Cascaded R-CNN With Multiscale Attention and Imbalanced Samples for Traffic Sign Detection. IEEE Access 2020, 8, 29742–29754. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. ISBN 978-3-319-46447-3. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Liu, Y.; Peng, J.; Xue, J.-H.; Chen, Y.; Fu, Z.-H. TSingNet: Scale-Aware and Context-Rich Feature Learning for Traffic Sign Detection and Recognition in the Wild. Neurocomputing 2021, 447, 10–22. [Google Scholar] [CrossRef]
Zhang, J.; Huang, M.; Jin, X.; Li, X. A Real-Time Chinese Traffic Sign Detection Algorithm Based on Modified YOLOv2. Algorithms 2017, 10, 127. [Google Scholar] [CrossRef]
Wan, J.; Ding, W.; Zhu, H.; Xia, M.; Huang, Z.; Tian, L.; Zhu, Y.; Wang, H. An Efficient Small Traffic Sign Detection Method Based on YOLOv3. J. Signal Process. Syst. 2021, 93, 899–911. [Google Scholar] [CrossRef]
Gu, Y.; Si, B. A Novel Lightweight Real-Time Traffic Sign Detection Integration Framework Based on YOLOv4. Entropy 2022, 24, 487. [Google Scholar] [CrossRef]
Ren, X.; Zhang, W.; Wu, M.; Li, C.; Wang, X. Meta-YOLO: Meta-Learning for Few-Shot Traffic Sign Detection via Decoupling Dependencies. Appl. Sci. 2022, 12, 5543. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, Z.; Xie, X.; Gui, Y.; Kim, G.-J. ReYOLO: A Traffic Sign Detector Based on Network Reparameterization and Features Adaptive Weighting. AIS 2022, 14, 317–334. [Google Scholar] [CrossRef]
Chu, J.; Zhang, C.; Yan, M.; Zhang, H.; Ge, T. TRD-YOLO: A Real-Time, High-Performance Small Traffic Sign Detection Algorithm. Sensors 2023, 23, 3871. [Google Scholar] [CrossRef]
Han, T.; Sun, L.; Dong, Q. An Improved YOLO Model for Traffic Signs Small Target Image Detection. Appl. Sci. 2023, 13, 8754. [Google Scholar] [CrossRef]
Lai, H.; Chen, L.; Liu, W.; Yan, Z.; Ye, S. STC-YOLO: Small Object Detection Network for Traffic Signs in Complex Environments. Sensors 2023, 23, 5307. [Google Scholar] [CrossRef] [PubMed]
Liu, P.; Xie, Z.; Li, T. UCN-YOLOv5: Traffic Sign Object Detection Algorithm Based on Deep Learning. IEEE Access 2023, 11, 110039–110050. [Google Scholar] [CrossRef]
Ren, Z.; Zhang, H.; Li, Z. Improved YOLOv5 Network for Real-Time Object Detection in Vehicle-Mounted Camera Capture Scenarios. Sensors 2023, 23, 4589. [Google Scholar] [CrossRef] [PubMed]
Song, W.; Suandi, S.A. Sign-YOLO: A Novel Lightweight Detection Model for Chinese Traffic Sign. IEEE Access 2023, 11, 113941–113951. [Google Scholar] [CrossRef]
Shen, J.; Zhang, Z.; Luo, J.; Zhang, X. YOLOv5-TS: Detecting Traffic Signs in Real-Time. Front. Phys. 2023, 11, 1297828. [Google Scholar] [CrossRef]
Song, W.; Suandi, S.A. TSR-YOLO: A Chinese Traffic Sign Recognition Algorithm for Intelligent Vehicles in Complex Scenes. Sensors 2023, 23, 749. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Li, X.; Lu, M. An Improved Traffic Sign Detection and Recognition Deep Model Based on YOLOv5. IEEE Access 2023, 11, 54679–54691. [Google Scholar] [CrossRef]
Yang, J.; Sun, T.; Zhu, W.; Li, Z. A Lightweight Traffic Sign Recognition Model Based on Improved YOLOv5. IEEE Access 2023, 11, 115998–116010. [Google Scholar] [CrossRef]
Yuan, X.; Kuerban, A.; Chen, Y.; Lin, W. Faster Light Detection Algorithm of Traffic Signs Based on YOLOv5s-A2. IEEE Access 2023, 11, 19395–19404. [Google Scholar] [CrossRef]
Zhang, R.; Zheng, K.; Shi, P.; Mei, Y.; Li, H.; Qiu, T. Traffic Sign Detection Based on the Improved YOLOv5. Appl. Sci. 2023, 13, 9748. [Google Scholar] [CrossRef]
Zhao, L.; Wei, Z.; Li, Y.; Jin, J.; Li, X. SEDG-Yolov5: A Lightweight Traffic Sign Detection Model Based on Knowledge Distillation. Electronics 2023, 12, 305. [Google Scholar] [CrossRef]
Li, S.; Wang, S.; Wang, P. A Small Object Detection Algorithm for Traffic Signs Based on Improved YOLOv7. Sensors 2023, 23, 7145. [Google Scholar] [CrossRef] [PubMed]
She, F.; Hong, Z.; Zeng, Z.; Yu, W. Improved Traffic Sign Detection Model Based on YOLOv7-Tiny. IEEE Access 2023, 11, 126555–126567. [Google Scholar] [CrossRef]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-Sign Detection and Classification in the Wild. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2110–2118. [Google Scholar]
YOLOv8. Available online: https://docs.ultralytics.com/ (accessed on 7 February 2024).
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 12021–12031. [Google Scholar]
Siliang, M.; Yong, X. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Zhang, L.J.; Fang, J.J.; Liu, Y.X.; Feng Le, H.; Rao, Z.Q.; Zhao, J.X. CR-YOLOv8: Multiscale Object Detection in Traffic Sign Images. IEEE Access 2024, 12, 219–228. [Google Scholar] [CrossRef]

Figure 1. Sample images from the TT100K dataset.

Figure 2. Additional data collection and enhancement.

Figure 3. Traffic sign categories of ‘w55’ and ‘w57’ in TT100K dataset.

Figure 4. Volume of TSDR label data and TSDR label distribution: (a) coordinates of labeled box’s center point; (b) the number of ‘w55’ and ‘w57’; and (c) labeled box’s width and height distribution.

Figure 5. Scheme structure of the YOLOv8n model.

Figure 6. The schematic diagram of the C2f module.

Figure 7. Comparison between PConv and Conv: (a) PConv module; (b) Conv module.

Figure 8. The schematic diagram of the C2f Faster module.

Figure 9. Scheme of the structure of the CTM-YOLOv8n model.

Figure 10. The factors of L_MPDIoU.

Figure 11. Heat map of network outputs: (a) Input image; (b) YOLOv8n model; (c) CTM-YOLOv8n model.

Figure 12. PR curve for networks: (a) YOLOv8n model; (b) CTM-YOLOv8n model; (c) YOLOv8n model; (d) CTM-YOLOv8n model.

Figure 13. Comparison of the mAP@0.5 with different model weights.

Figure 14. Figures of test results: (a) YOLOv8n model; (b) CTM-YOLOv8n model; (a1,b1) illuminated scene; (a2,b2) under the condition of directional light and occlusion; (a3,b3) under backlight conditions.

Figure 15. Figures of test result: (a) CTM-YOLOv8n model; (b) YOLOv8n model; (a1,b1) under the condition of directional light; (a2,b2) under backlight conditions; (a3,b3) partial occlusion scenarios.

Table 1. Experimental parameter settings.

Item	Parameter
Operating system	Ubuntu 18.04
CUDA version	Cuda 11.1 + CuDNN 8.6.0
Development environment	Python 3.8 + PyTorch 1.12
Momentum	0.937
Epochs	200
Batch size	16

Table 2. Comparison of research on traffic-sign detection before and after optimization.

Network Models	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.95 (%)	F1 Score	Params (M)
YOLOv8n	64.5	68.4	69.3	51.7	0.65	3.00
CTM-YOLOv8n	76.4	69.3	76.6	57.8	0.72	0.89

Table 3. Comparison of research on pedestrian traffic-sign detection before and after optimization.

Network Models	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.95 (%)	F1 Score	GFLOPs	Params (M)
YOLOv8n	86.8	81.1	89.5	64.0	0.84	8.1	3.00
CTM-YOLOv8n	92.0	91.9	94.3	66.2	0.91	9.7	0.89

Table 4. Comparison study of ‘w55’ and ‘w57’ before and after optimization.

Network Models	mAP@0.50 (%)		mAP@0.95 (%)
Network Models	W55	W57	W55	W57
YOLOv8n	86.1	92.9	63.6	64.5
CTM-YOLOv8n	93.4	95.2	67.1	64.9

Table 5. Ablation experiment findings.

Network Models	Precision (%)	Recall (%)	mAP@0.50 (%)	mAP@0.95 (%)	F1 Score	GFLOPs	Params (M)
YOLOv8n	86.8	81.1	89.5	64.0	0.84	8.1	3.00
YOLOv8n + C2f Faster	88.2	82.7	89.0	62.6	0.85	6.3	2.30
YOLOv8n + TOD	91.5	87.9	93.9	68.2	0.90	10.4	0.98
YOLOv8n + TOD + C2f Faster	90.1	84.3	93.4	66.0	0.88	9.7	0.89
YOLOv8n + TOD + C2f Faster + MPDIoU (CTM-YOLOv8n)	92.0	91.9	94.3	66.2	0.91	9.7	0.89

Table 6. Outcomes of a comparative study with state-of-the-art models.

Model	mAP@0.50 (%)	Params (M)	GFLOPs	FPS
Faster-RCNN	63.4	28.3	-	83
SSD [16]	75.1	26.1	80.9	24
STC-YOLO [26]	88.9	6.70	-	87
Improved YOLOv5 [35]	81.9	7.51	16.8	30
SEDG-YOLOv5 [36]	91.0	2.77	6.2	178
YOLOv7-tiny [38]	93.4	23.29	-	67
CR-YOLOv8 [43]	86.9	14.60	-	103
CTM-YOLOv8n	94.3	0.89	9.7	161

Table 7. Test results under different environmental conditions.

Model	Environment Condition	mAP@0.50 (%)	GFLOPs	FPS	Params (M)
YOLOv8n	Occlusion	84.9	-	96	16.32
	Backlight	86.7	-	100	14.34
	Phototropism	85.6	-	98	15.21
CTM-YOLOv8n	Occlusion	93	9.8	158	0.92
	Backlight	96.1	9.4	164	0.85
	Phototropism	93.8	9.9	161	0.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Q.; Dai, Z.; Xu, Y.; Gao, Y. CTM-YOLOv8n: A Lightweight Pedestrian Traffic-Sign Detection and Recognition Model with Advanced Optimization. World Electr. Veh. J. 2024, 15, 285. https://doi.org/10.3390/wevj15070285

AMA Style

Chen Q, Dai Z, Xu Y, Gao Y. CTM-YOLOv8n: A Lightweight Pedestrian Traffic-Sign Detection and Recognition Model with Advanced Optimization. World Electric Vehicle Journal. 2024; 15(7):285. https://doi.org/10.3390/wevj15070285

Chicago/Turabian Style

Chen, Qiang, Zhongmou Dai, Yi Xu, and Yuezhen Gao. 2024. "CTM-YOLOv8n: A Lightweight Pedestrian Traffic-Sign Detection and Recognition Model with Advanced Optimization" World Electric Vehicle Journal 15, no. 7: 285. https://doi.org/10.3390/wevj15070285

Article Menu

CTM-YOLOv8n: A Lightweight Pedestrian Traffic-Sign Detection and Recognition Model with Advanced Optimization

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Augmentation

2.2. YOLOv8n Network Structure

2.3. Improvement of CTM-YOLOv8n

2.3.1. C2f Faster Module

2.3.2. Improved Tiny-Object-Detection Layer

2.3.3. Loss Function Improvements

2.4. Evaluation Indicators

3. Results

3.1. Experimental Setup

3.2. Study Comparison before and after Optimization

3.3. Ablation Experiment

3.4. Comparative Study with State-of-the-Art Models

3.5. Real-Test Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI