Driving Safety Area Classification for Automated Vehicles Based on Data Augmentation Using Generative Models

Lee, Donghoun

doi:10.3390/su16114337

Open AccessArticle

Driving Safety Area Classification for Automated Vehicles Based on Data Augmentation Using Generative Models

by

Donghoun Lee

Department of Artificial Intelligence and Data Science, Sejong University, Seoul 05006, Republic of Korea

Sustainability 2024, 16(11), 4337; https://doi.org/10.3390/su16114337

Submission received: 21 April 2024 / Revised: 16 May 2024 / Accepted: 20 May 2024 / Published: 21 May 2024

Download

Browse Figures

Versions Notes

Abstract

:

The integration of automated vehicles (AVs) into existing road networks for mobility services presents unique challenges, particularly in discerning the driving safety areas associated with the automation mode of AVs. The assessment of AV’s capability to safely operate in a specific road section is contingent upon the occurrence of disengagement events within that section, which are evaluated against a predefined operational design domain (ODD). However, the process of collecting comprehensive data for all roadway areas is constrained by limited resources. Moreover, challenges are posed in accurately classifying whether a new roadway section can be safely operated by AVs when relying on restricted datasets. This research proposes a novel framework aimed at enhancing the discriminative capability of given classifiers in identifying safe driving areas for AVs, leveraging cutting-edge data augmentation algorithms using generative models, including generative adversarial networks (GANs) and diffusion-based models. The proposed framework is validated using a field test dataset containing disengagement events from expressways in South Korea. Performance evaluations are conducted across various metrics to demonstrate the effectiveness of the data augmentation models. The evaluation study concludes that the proposed framework significantly enhances the discriminative performance of the classifiers, contributing valuable insights into safer AV deployment in diverse road conditions.

Keywords:

automated vehicle; driving safety area; restricted dataset; discriminative capability; data augmentation

1. Introduction

The rapid advancement in automated vehicle (AV) technology is expected to provide substantial improvements in transportation efficiency and safety. Recent initiatives have aimed to explore AV-based mobility services, such as public transportation, freight transport, and shared-ride mobility-on-demand services [1,2,3]. A crucial step in deploying these services involves classifying the road links appropriate for AV operation given a mobility service area. Based on the identification of adequate road networks, mobility service providers can strategically deploy the AV-based mobility services tailored to the specific needs and conditions of the service area [4]. In addition, in the case of road sections identified as being incapable of safely supporting the operation of AVs, the connected and automated vehicle (CAV)-based mobility service is further considered through the implementation of digital infrastructures, such as cooperative intelligent transportation systems (C-ITS) devices and facilities [5,6,7].

To ascertain road suitability for the automation mode of a given AV, an assessment is typically conducted to determine whether the AV can safely operate with a predefined operational design domain (ODD) in the road section [8,9,10,11]. The ODD encompasses physical and environmental aspects of the road, such as road type, geometric road design, speed limit, lighting, and weather conditions [12,13,14].

However, since the extent of experimental driving is limited due to practical constraints such as time and financial resources, only a fraction of the field test data can be obtainable. To address the related issues, new road sections in the mobility service network are evaluated based a machine learning classifier that is trained on the available data. For instance, a previous study analyzed the safety performance of AV with respect to road geometry using a classification and regression tree (CART) model based on a field test dataset [15], which covers a subset of South Korea’s national expressways. Nevertheless, such datasets often exhibit severe skewness, with instances of disengagement events being much rarer than those of successful operation of the automation mode. Consequently, this imbalance can lead the classifier to bias towards the majority class or to misclassify the minority class as anomalies, severely impairing the classifier’s discriminative accuracy and potentially misidentifying non-drivable areas as drivable with the automation mode.

There have been enormous efforts to develop a novel oversampling-based data augmentation model for dealing with the challenges of data imbalance in the field of transportation engineering, particularly in traffic monitoring and surveillance systems [16,17,18,19]. However, since the previous approaches, including cropping, rotation, mixing, and noise injection, are predominantly applied to unstructured data types like images, the challenge still remains to adapt the data augmentation techniques for tabular datasets such as those containing the disengagement event data.

One of the most commonly used approaches for augmenting tabular data in imbalanced datasets is the synthetic minority over-sampling technique (SMOTE) [20]. This has been used to alleviate class imbalances by synthesizing new examples through linear interpolation, potentially distorting the feature space in datasets with intricate patterns. Several variants of SMOTE have been developed to enhance its effectiveness and address specific problems in various data scenarios, such Borderline-SMOTE, K-Means SMOTE, and the adaptive synthetic sampling approach (ADASYN) [21,22,23]. In addition, the variational autoencoder (VAE) has also been considered to upgrade the model performance by augmenting tabular data in imbalanced datasets [24]. Furthermore, recent advancements have introduced tabular data augmentation strategies utilizing generative models, including generative adversarial networks (GANs) and diffusion-based models [25,26,27,28,29,30]. These advanced models can be leveraged to synthesize realistic disengagement data, addressing the challenge posed by the scarcity of real-world experimental data.

This research aims to propose a novel framework that integrates cutting-edge data augmentation algorithms, including GAN and diffusion-based models, to improve the discriminative capability of a given machine learning classifier in identifying safe driving areas for AVs. The proposed framework is evaluated with a field test dataset containing the road sections shown to the AVs to be the disengagement events in the expressways of South Korea. The performance evaluation is conducted by analyzing the effectiveness of the classifiers with different data augmentation models, based on various performance metrics used in machine learning classifiers. This evaluation study demonstrates the effectiveness of the proposed framework in generating synthetic data that enhance the generalizability and reliability of safety assessments. The research findings are expected to contribute valuable insights into the deployment of AVs, ensuring safer integration into existing road networks.

The main contributions of the present study can be highlighted as follows:

This study develops a comprehensive framework to accurately identify safe automated driving areas with limited real-world experimental data.
This study incorporates oversampling-based tabular data augmentation into the proposed framework to address the class imbalance by generating synthetic instances.
This research demonstrates the effectiveness of the proposed framework based on performance evaluations with various metrics.

The rest of this paper is organized as follows. Section 2 provides the descriptions of the proposed framework and explains the detailed methods for tabular data augmentation and machine learning classifiers. Section 3 shows the data description and performance metrics for the evaluation. Section 4 discusses the results and analyses of the evaluation study. Finally, Section 5 describes the conclusions and directions for future research.

2. Methodology

The following subsections provide the details of the proposed framework for enhancing the performance of accurate identification of roads that can be operated on safely with AVs using a machine learning classifier with different tabular data augmentation models. The detailed explanations of the proposed framework, tabular data augmentation models, and machine learning classifier used in this study are provided in the following subsections.

2.1. Framework for Improving the Generalized Performance of a Classifier with Tabular Data

In order to more accurately classify driving safety areas for AVs based on the disengagement tabular dataset, this study proposes a detailed process to enhance the generalized performance of a classifier model using tabular data augmentation algorithms, as depicted in Figure 1. The process initiates with the input of tabular data, which is the disengagement data obtained from the field tests with AVs. The input dataset is subsequently partitioned into two subsets: training data and testing data. The former is utilized for training the given classifier model, while the latter is leveraged for measuring its generalized performance based on a test process. Unlike in the hold-out method, which is particularly useful when both training and testing sets are large enough to be representative of the overall data distribution, the training data undergo a K-fold cross validation process. This is an efficient technique utilized to assess the classifier model’s performance and enhance its generalizability, particularly in scenarios where the raw data are scarce [31,32]. In the K-fold cross validation process, the training dataset is divided into K equal-sized subsets. The training and validation of the classifier model are performed K times, each time using a different subset as the validation set and the remaining data as the training set.

In addition, to address the issues related to the highly skewed training data with few positive instances and many negative instances, synthetic data samples are generated using advanced generative models, such as GANs and diffusion-based deep learning models. These algorithms synthesize additional data samples to augment the training dataset, mitigating potential overfitting and improving the classifier’s ability to generalize to new, unseen data [33]. More detailed explanations of the generative models used in this study are provided in Section 2.2.

The output from the data augmentation step with the generative models is a set of synthetic data samples. The synthetic data samples are then amalgamated with the original training data to create a robust and diversified training dataset. This expanded dataset can improve the robustness of the classifier model by providing a more diverse range of examples to learn from. This enriched dataset is used to train the machine learning classifier model. This research herein considers a random forest model as the classifier, although any machine learning classifier model can be used. The details of the classifier model used in this research are provided in Section 2.3.

Following the training phase, the classifier model is subjected to performance monitoring and evaluation using the reserved testing data, which is untouched by the training and augmentation process. This phase is critical for assessing the classifier’s classification accuracy and determining its suitability for deployment in real-world scenarios. The bidirectional arrow between the performance monitoring and model evaluation block, and the machine learning classifier model, signifies an iterative process of model refinement. Based on the performance metrics obtained during evaluation, the model may be adjusted, retrained, or further tuned to enhance its classification performance. This iterative cycle continues until the model achieves satisfactory performance metrics, at which point it can be finalized for deployment.

2.2. Oversampling-Based Tabular Data Augmentation Model

2.2.1. Synthetic Minority Oversampling Technique (SMOTE)

One of the conventional oversampling approaches in addressing the issue of class imbalance of datasets is the synthetic minority oversampling technique (SMOTE) [20]. SMOTE addresses the imbalance by generating synthetic examples rather than by oversampling with replacement. The core idea behind SMOTE is to form new minority class instances by interpolating between existing ones that lie close in the feature space. This is achieved by first selecting a minority class sample and then identifying its nearest neighbors in the feature space. Synthetic samples are then created by choosing one of the nearest neighbors at random and connecting the line segment between the two in feature space. Points are sampled at random along this line segment, effectively creating new, synthetic minority class examples. This technique not only augments the minority class but also makes the decision region of the minority class more general by simulating the variability occurring naturally within the class, thereby improving the performance of classifiers trained on such balanced datasets.

2.2.2. Conditional Tabular Generative Adversarial Network (CTGAN)

GANs involve a set of deep neural network architectures used for generative modeling, which learn to mimic complex distributions of data. GANs consist of two competing networks: a generator that creates samples and a discriminator that evaluates them. During training, these networks engage in a game-theoretic competition to improve the realism of the generated samples [34]. In the realm of tabular data, tabular GANs (TGANs) have been applied to generate synthetic versions of data tables. TGANs leverage the discriminative power of deep learning models to handle discrete and continuous variables, generating high-quality synthetic data that preserve the statistical correlations within the original dataset [35]. In addition, conditional GANs (cGANs) extend the GAN architecture by incorporating condition variables, enabling the generation of data under specific constraints or conditions. This approach is particularly useful for data augmentation, especially in scenarios involving imbalanced datasets. cGANs can generate synthetic samples for minority classes, improving the robustness of machine learning models by providing a more balanced dataset for training [36].

More recently, the conditional tabular GAN (CTGAN) was proposed to address the challenges of modeling tabular data containing a mix of discrete and continuous columns [25]. CTGAN leverages the conditional GAN framework to generate synthetic data that faithfully represent the underlying statistical properties of the original data, providing a more effective solution for handling diverse data types and distributions within tabular datasets. It utilizes a conditional generator to handle the challenges posed by the non-Gaussian and multimodal distributions of continuous columns C_i and the imbalance in discrete columns. A mode-specific normalization technique is adopted to normalize each continuous column by estimating the number of modes in the column using a variational Gaussian mixture model (VGM). Each value c_i_,j in C_i is then assigned a probability of belonging to each mode P_Ci(c_i_,j), allowing for mode-specific normalization. The equation is formulated as (1):

P_{C_{i}} (c_{i, j}) = \sum_{k = 1}^{K} μ_{k} N (c_{i, j}; η_{k}, ϕ_{k}),

(1)

where K indicates the number of modes estimated by the VGM, each mode is denoted by η_k, and μ_k and φ_k represent the weight and standard deviation of a mode, respectively.

In addition, CTGAN also considers a conditional generator that is trained using a training-by-sampling strategy to handle data imbalances, which evens out the minor categories’ representation during the training process. The samples

\hat{τ}

produced by the conditional generator P_g(∙) are designed to be conditioned on discrete columns as follows:

\hat{τ} ~ P_{g} (r o w | D_{h} = k^{h}),

(2)

where D_h and k^h describe the hth discrete column and its element, respectively. This conditional approach helps maintain the underlying distribution of the real data while generating synthetic data.

The discriminator of the CTGAN used in this study also adopts an identical process of the original [25]. It assesses a likelihood, which determines whether a given sample is real or synthetic, to quantify the score of generated new tabular data. The discriminator accomplishes this by assigning a probability score to each sample, indicating the likelihood that the sample belongs to a real data distribution. The score assigned to the generated new tabular data reflects how well it resembles real data according to the discriminator’s learned criteria. Higher scores indicate a closer resemblance to real data, while lower scores suggest that the generated data deviate more from the real data distribution. By quantifying the score of the generated data, the CTGAN’s discriminator provides valuable feedback to the generator, guiding its learning process to generate more realistic synthetic data that closely match the characteristics of the real data.

These equations and methods underpin the CTGAN’s ability to synthesize realistic tabular data, addressing issues like mode collapse and data imbalance that are common in traditional GANs when augmenting complex tabular datasets. The CTGAN incorporates conditional information into the generation process, which enables the generation of diverse samples that satisfy specific constraints [25]. It guides the generation process towards producing samples that adhere to the specified conditions. Apart from leveraging conditional information, regularization techniques are also considered in the CTGAN to stabilize training and promote diversity in the generated samples. Hence, it can reduce the likelihood of mode collapse.

2.2.3. Tabular Data with Denoising Diffusion Probabilistic Models (TabDDPMs)

A novel synthetic data generation approach to generative modeling using denoising diffusion probabilistic models (DDPMs) specifically tailored for tabular data was proposed by [29], which is called tabular data with DDPMs (TabDDPMs). It is grounded in the principles of diffusion models, which progressively learn to reverse a diffusion process that gradually adds noise to the data until it becomes a pure noise distribution. By learning to denoise, the model can generate new data points by effectively reversing this noise addition process. Specifically, the TabDDPM modifies the denoising process to handle the discrete nature of tabular data, ensuring that the synthetic data retain statistical properties similar to those of the original dataset.

Similar to all other previous diffusion models [37], TabDDPM deals with the dataset based on forward and reverse Markov processes. The former is a process to gradually add noise to an initial state from sampling noise, while the latter is a process to gradually denoise a latent variable and generate new data samples. It also uses a simplified loss function focusing on minimizing the mean-squared error between the model’s noise predictions

ϵ_{θ}

and the true noise

ϵ

over all timesteps t, as shown in (3):

L_{s i m p l e, t} = E_{x_{0}, ϵ, t} {| |ϵ - ϵ_{θ} (x_{t}, t)| |}^{2} .

(3)

where x_t is the noisy observation at timestep t.

In addition, TabDDPM considers the Gaussian diffusion processes for numerical features and multinomial diffusion processes for categorical features. Likewise, this study also adopts Gaussian diffusion and multinomial diffusion processes for continuous and discrete variables, respectively. This means that Gaussian noise is added to the continuous input variables, while uniform noise is added to the discrete input variables.

The forward and reverse processes for continuous variables are described in (4) and (5), respectively:

q (x_{t}| x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t}, β_{t} I),

(4)

where β_t is a variance term that increases with each diffusion step, making x_t progressively noisier.

p_{θ} (x_{t - 1}| x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)),

(5)

where

μ_{θ} (x_{t}, t)

is computed using a neural network to predict the noise

ϵ_{θ} (x_{t}, t)

and is expressed as follows:

μ_{θ} (x_{t}, t) = \frac{{x_{t} - \frac{β_{t}}{\sqrt{1 - \bar{α_{t}}}} ϵ}_{θ} (x_{t}, t)}{α_{t}},

(6)

where

α_{t} = 1 - β_{t}

,

\bar{α_{t}}

is the product of all previous

α_{i}

for

i \leq t

.

The forward and reverse processes for discrete variables are represented in (7) and (8), respectively:

q (x_{t}| x_{t - 1}) = C a t (x_{t}; (1 - β_{t}) x_{t} + \frac{β_{t}}{K}),

(7)

where K indicates the number of categories. The equation describes how each categorical feature is corrupted incrementally by noise, with

β_{t}

controlling the level of noise.

q (x_{t}| x_{0}) = C a t (x_{t}; α_{t} x_{0} + \frac{1 - \bar{α_{t}}}{K}) .

(8)

where x₀ represents the original data.

From the equation above, the posterior

q (x_{t - 1}| x_{t}, x_{0})

can be formulated as (9):

q (x_{t - 1}| x_{t}, x_{0}) = C a t (x_{t - 1}; \frac{π}{\sum_{k = 1}^{K} π_{k}}),

(9)

where:

π = (α_{t} x_{t} + \frac{1 - α_{t}}{K}) ⊙ (\bar{α_{t - 1}} x_{0} + \frac{1 - \bar{α_{t - 1}}}{K}) .

(10)

The probability π is computed based on the noisy observation x_t and the original data x₀, blending them with the diffusion parameters.

These equations represent the core mathematical framework used in TabDDPM to model both continuous and categorical data in tabular datasets, enabling effective generative modeling while handling the heterogeneous nature of the features typically found in such data.

2.3. Classifier Model for Identifying the Driving Safety Road Sections of AVs

The random forest is a robust ensemble learning method [38], which is utilized in this study to classify road segments based on their safety for automated vehicles. This model operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees. Each tree in the ensemble is built from a random sample of the training set data, and at each node, a subset of features is randomly chosen to split the node. This randomness helps in making the model more resistant to overfitting, which is crucial for maintaining high classification accuracy in diverse operational environments [39,40].

In the context of identifying the driving safety for AVs’ operation, the random forest classifier offers several advantages. Firstly, its ability to handle high-dimensional data is particularly beneficial [41], considering the complex and multi-faceted nature of road safety data, which encompass factors such as road geometry and traffic conditions. Moreover, the ensemble nature of the random forest allows it to improve classification accuracy by averaging multiple deep decision trees, reducing the risk of erroneous classifications and enhancing the robustness against noise in the dataset.

On the other hand, there are two options to determine the best splits during the decision tree formation within the forest, namely, Gini impurity and information gain (IG). Since the tabular input dataset is highly imbalanced and contains attributes with many unique values, IG could be more appropriate despite its higher computational cost. The process of calculating IG starts by computing entropy for a dataset, which quantifies the amount of uncertainty or impurity in the dataset. Entropy is defined for a dataset S with classes {c₁, c₂, …, c_k} as follows:

E n t r o p y (S) = - \sum_{i = 1}^{k} p_{i} {l o g}_{2} p_{i}

(11)

where p_i is the proportion of the examples in S that belong to class c_i. This formula calculates total entropy of the set S before any splitting has occurred, providing a baseline level of disorder.

To calculate IG of the set S based on the split attribute A with possible values {v₁, v₂, …, v_m}, entropy is computed for each subset of S created by splitting S based on each value of A. IG is then calculated by subtracting the weighted entropies of each subset from the original entropy of S, as expressed by (12):

I n f o r m a t i o n G a i n (S, A) = E n t r o p y (S) - \sum_{j = 1}^{m} \frac{|S_{v_{j}}|}{|S|} \times E n t r o p y (S_{v_{j}}),

(12)

where |S_vj| is the number of examples in S where attribute A has value v_j, and |S| is the total number of examples in S. Entropy(S_vj) is the entropy of each subset after the split.

3. Data Description and Performance Metrics

This research evaluates the proposed framework based on an experimental dataset, which is used for analyzing the performance of the given classifier with different data augmentation models to identify the road sections that are safe for AVs. The performance is measured with respect to several evaluation metrics in machine learning classifiers. The following subsections provide the detailed descriptions of the experimental dataset and performance metrics.

3.1. Data Description

Figure 2 shows a visual overview of South Korea’s expressway network, selectively color-coded to demarcate a focused study site within the broader transportation grid. The network comprises a total of 40 expressway routes with an aggregate length of approximately 5000 km, forming the backbone of the country’s road transport infrastructure. Within this extensive network, this study concentrates on a subset of 24 routes, denoted by vivid red lines against a muted grey backdrop, representing routes that were not included in the study.

The total length of the highlighted red routes is approximately 3000 km, and these routes constitute the geographical context for an in-depth investigation into the availability of automated functions in AVs.

The data used in the present study are exactly identical to the dataset collected by the previous study [15], which entailed driving an AV along those red-colored expressways to gather empirical data on the vehicle’s behavior, specifically focusing on disengagement events—instances where the automated system yielded control to a human driver or it shut down due to a failure or planned deactivation. The occurrences of disengagement events may be associated with safety-critical events such as encroaching into the adjacent lane due to a severely curved road or poorly marked lanes [42,43]. It can be used as a key measure to determine the feasibility of utilizing the automation mode of AVs.

Apart from whether a disengagement event occurred, the dataset also considers road geometry information regarding horizontal and vertical alignments, such as curve radius and slope. Such information is included in the database of road geometric design administrated by the Korea Expressway Corporation’s expressway network. However, since the original dataset of the road geometric design was compiled for the construction, maintenance, management, and operation of expressways, it has necessitated changes in data attributes to ensure homogeneity within each road section for accurate classification.

Following the preprocessing of the data to ensure that individual road sections retain consistent characteristics, the experimental data for the disengagement event are generated, as shown in Table 1. The dataset encompasses 19 variables. Column 1 lists the identification number (ID), which uniquely identifies each event, while Column 2 specifies the route, in this case, the Gyeongbu expressway.

The directionality of the event is recorded in Column 3, where ‘2’ denotes a specific direction along the route. The start (Column 4) and end (Column 5) mileposts are recorded as 132.86 and 133.12, respectively. The distance covered by the disengagement event is merely 0.26 km (Column 6), with a slope (Column 7) of approximately 0.9079%, suggesting mild elevation changes. The curvature of the road is detailed in Columns 8 and 9, with a curve radius of 5000 m, indicative of a gentle bend, and a curve length of 260 m, defining the extent of the road affected. The remaining columns detail traffic and road conditions. Column 10 records the number of lanes as 4, suggesting a multi-lane expressway. Columns 11 through 13, labeled Delta Slope, Delta Radius, and Delta Lanes, reflect changes in these parameters, which are zero for Slope and Lanes, and a −5000 m change for Radius, indicating a transition from a curve to a straightway. The Average Annual Daily Traffic (AADT) is noted in Column 14 as 72,005, which denotes the traffic volume. Maximum and minimum speed limits are recorded in Columns 15 and 16 as 100 and 100.2 km/h, respectively, with the mean speed of travel at 110.03 km/h (Column 17). Column 18 describes the consequence of the disengagement event, which in this sample data is recorded as ‘0’, indicating no disengagement event occurred. Lastly, Column 19 shows the causal factors of each disengagement event, where they are coded from 1 to 9. More detailed statistics of disengagement events and their causal factors are described in Table 2. The whole dataset is available at https://github.com/dhleeGDH/Field_test_dataset_for_AV/blob/main/raw_dataset.csv (accessed on 9 May 2024).

In the experimental dataset encompassing 6102 observations of expressway driving conditions and events, a binary classification of disengagement events is recorded, with ‘1’ indicating the occurrence of a disengagement event and ‘0’ denoting its absence. This shows that there is a distribution of 6.2% instances where a disengagement event is noted, contrasted with 93.8% cases indicating no such event. This significant imbalance in the dataset underscores the rarity of disengagement events along the monitored segments of the expressway. In other words, this skewness poses a challenge for the accurate classification modeling of safe AV operation over different road sections. The rarity of disengagement events suggests that while automated vehicles may operate without incident the majority of the time, the infrequency of these events requires sophisticated augmentation techniques to ensure that the classifier models can reliably identify the conditions under which disengagements are likely to occur.

Figure 3 represents the overview of disengagement data in the K-fold dataset generated by the proposed framework. It is seen that there is a large difference in the proportion of disengagement and non-disengagement event cases in the raw dataset. Based on the oversampling-based augmentation with different augmentation algorithms presented in Section 2.2.1, Section 2.2.3 and Section 2.2.3, synthetic data for involving disengagement event are generated to tackle the imbalanced class issue. The newly generated synthetic dataset is iteratively used for training and testing each K-fold. The training and testing datasets are constructed by random sampling in an 8-to-2 ratio from the disengagement and non-disengagement events cases, respectively. This allows the RF classifier to train its learnable parameters with a balanced class dataset. Moreover, it is worth noting that the testing datasets in each K-fold involve disengagement and non-disengagement event cases obtained from a portion of real road sections that were not used in the training of classifiers, which are utilized to analyze the effect of prosed framework in terms of various performance metrics.

The datasets used in the following numerical studies involve high-dimensional explanatory variables for identifying safe driving areas for AVs. One can examine whether the newly generated synthetic datasets with different tabular data augmentation methods are valid based on principal component analysis (PCA). Figure 4 illustrates the loadings of some principal components in both the raw dataset and newly generated synthetic datasets with different augmentation methods based on the K-fold dataset. The loadings of each principal component in both the raw dataset and newly generated synthetic datasets can be used for analyzing the correlations between the original data and augmented synthetic data. The augmentation algorithms showing better quality can be found. As shown in Figure 4a–c, the similarity of the distributions between the original data and synthetic data generated by SMOTE, CTGAN, and TabDDPM models can be assessed in the planes of principal components 1 and 2. The synthetic data produced by the SMOTE model appear to be shifted by a certain amount from the original data. This suggests that the synthetic dataset deviates significantly from the original dataset’s distribution. On the other hand, the data synthesized by the CTGAN model appear to be more realistic, although a certain amount of them deviate considerably from the distribution of the original data. In contrast, it is observed that the TabDDPM model demonstrates superior efficacy in generating synthetic data of exceptional quality. Its data distribution exhibits a remarkable uniformity, closely mirroring the distribution of the original dataset while minimizing deviations. Such trends are also found in other principal component planes, as depicted in Figure 4d–f.

3.2. Performance Metrics

This study casts the problem of discerning road sections that are conductive to AV operation as a binary classification task, which is evaluated by a confusion matrix as delineated in Table 3. In the matrix, Positive signifies road segments where the automation mode is untenable due to the disengagement events, indicative of potential hazards or complexities in the driving environment. Conversely, Negative corresponds to the road segments where AVs are deemed capable of navigating safely without the occurrence of disengagement events.

True Positive (TP) encapsulates those instances wherein the model aptly identifies road sections as hazardous, correctly advising against the activation of automated driving modes. False Positive (FP) reflects misclassified safe segments as hazardous, leading to an unwarranted cessation of AV automation. This conservative error may lead to underutilization of AV capabilities but favors safety. On the other hand, False Negative (FN) represents a more critical error type wherein potentially unsafe roads are misclassified as safe, posing a significant risk to AV integrity and occupant safety. Lastly, True Negative (TN) indicates the classifier’s success in correctly identifying road sections where AVs can safely operate in automated driving mode.

Based on the confusion matrix, this study performs receiver operating characteristics (ROC) analysis. In the ROC analysis, the x-axis represents the false positive rate (FPR), which is the proportion of negative instances that were incorrectly classified as positive, while the y-axis represents the true positive rate (TPR), also known as recall or sensitivity, which is the proportion of actual positives correctly identified by the classifier. In the context of automated driving, a high TPR represents that a given classifier correctly recognizes situations where there are hazards or complexities in the driving environment, leading to fewer instances where the automated driving modes of AVs are implemented in potentially risky scenarios. On the other hand, a low TPR suggests that a given classifier struggles to ascertain these problematic road sections accurately, which may be caused by more FN cases. This implies that the classifier model fails to recognize situations where the automated driving mode is untenable, potentially leading to AVs being deployed in risky driving conditions.

In the FPR-TPR plane, TPR is plotted against FPR for different cut-off points of a parameter. An ideal ROC curve would reach the top left corner of the plot, indicating a high TPR and a low FPR. Moreover, the performance of given classifier can also be quantitatively measured by the area under the ROC curve (AUC) value, which ranges between 0.5 and 1. This implies that an AUC value of 1 is indicative of a classifier’s impeccable ability to identify which road sections the AVs can be safely maneuvered in. Conversely, an AUC of 0.5 or lower reflects a lack of classification performance in the classifier, suggesting that its performance is no better than random chance in discriminating between safe and unsafe road sections for automated driving.

In addition, the evaluation study also considers precision–recall (PR) analysis, which is generally more informative than ROC analysis when focusing on the performance concerning the positive class [44]. This is because the ROC curve can give an overly optimistic view of the model’s performance if the negative class (the majority class) overwhelms the positive class (the minority class). Unlike the ROC curves that are insensitive to changes in class distribution, the PR curves will change with the class distribution, providing a more sensitive measure of the classifier’s performance across different operational conditions. The PR analysis is typically visualized using a precision–recall curve, where precision is plotted on the y-axis and recall on the x-axis. The PR curve shows the trade-off between precision and recall for a classifier. A perfect classifier would create a curve that is closest to the top-right corner. In other words, the closer a curve stays to the top-right corner, the better the classifier’s performance.

Moreover, this study further uses the F-beta score to conduct a performance comparison with a single score by considering both precision and recall of the given binary classifier. It is defined as the weighted harmonic mean of precision and recall, taking into account a beta parameter β which determines the weight of recall in the combined score. The F-beta score F_β is mathematically expressed as follows:

F_{β} = (1 + β^{2}) \frac{p r e c i s i o n \cdot r e c a l l}{(β^{2} \cdot p r e c i s i o n) + r e c a l l}

(13)

where β indicates the importance of recall in the score. A β less than 1 leans the score towards precision, while a β greater than 1 favors recall. The numerical analysis of the present study considers β to be equal to 1. This means that precision and recall are equally important when β = 1. This represents the most commonly used value of the F-beta score, which is called the F1 score. The F1 score ranges from 0 to 1. It reaches its best value at 1 and worst at 0.

4. Result and Analysis

With the disengagement event data and performance metrics stated in the previous section, the following subsections conduct numerical studies for exploring the effectiveness of different data augmentation techniques, including Raw, SMOTE, GAN, and Diffusion models. The Raw model indicates the approach that identifies the road sections that are available for safe operation of AVs without any data augmentation methods. The other models are leveraged for the tabular data augmentation techniques based on the SMOTE, CTGAN, and TabDDPM approaches to classify roads that are safe for automated driving.

In addition, the numerical studies include a performance review and comparative study. The former is the performance evaluation of the four augmentation techniques based on the hold-out method. The latter is the statistical analyses of the data augmentation models with respect to the performance metrics based on the K-fold cross validation. The detailed descriptions on the numerical analyses are provided in the following subsections.

4.1. Performance Review

Figure 5 describes the ROC curves of different data augmentation models. The ROC curves compare the performances of the four different data augmentation techniques in the context of identifying the safety area of maneuvering AVs based on the confusion matrices. One can easily observe that the Raw model represented by a dashed purple line has the lowest performance, failing to achieve the high TPR at low FPRs compared to the other models. This suggests that the classifier has limited diagnostic ability without any augmentation techniques. In other words, it is practically difficult to precisely identify the road segments where the AVs can be operated safely without any data augmentation strategies given the constraints of minimal experimental driving datasets. Such misclassifications may show the potential risks associated with operational efficiency and service reliability. For instance, incorrectly identifying safe road segments as untenable for automation could lead to unnecessary disengagements and manual interventions by human drivers. This not only undermines the efficiency gains promised by AV technology, but also increases operational costs and disrupts transportation services. Moreover, it could erode public trust in the AV technology. If the AVs consistently fail to operate safely due to misclassifications, it could result in skepticism and reluctance between users and regulators to adopt automated driving systems. This loss of trust could significantly hinder the widespread adoption of AVs.

On the other hand, it is seen that the SMOTE model, shown by a dash-dotted orange line, improves upon the Raw model’s performance substantially, achieving higher TPR values for the same FPRs. Moreover, it is observed that the GAN model, depicted by a dotted green line, shows further improvement in performance over the SMOTE model. Furthermore, the best classifier performance among the four evaluated methods can be found in the Diffusion model. The Diffusion approach, illustrated by a solid blue line, appears to outperform all other augmentation techniques. The blue one approaches closer to the top left corner, indicating a superior balance between TPR and FPR, which suggests that the Diffusion model provides the most significant enhancement in the classifier’s diagnostic accuracy. Unlike the conventional tabular data augmentation approaches such as the SMOTE and GAN models, the Diffusion approach can capture high-dimensional dependencies and correlations by leveraging diffusion processes, which allows it to ensure that the synthetic samples closely resemble the original data. Consequently, the performance of the Diffusion method outperforms that of SMOTE and GAN methods in terms of ROC analysis.

The detailed performances of each model are quantitatively evaluated by the AUC. The AUC achieves its maximum at 1, signifying perfect classification accuracy by the given model in identifying the driving safety areas of AVs. Conversely, an AUC value of 0.5 or lower denotes an absence of discriminative capability in the model. Table 4 shows the AUC values with respect to each model. The classifier using the Raw data exhibits an AUC of 0.734, which serves as a baseline in this study. The application of SMOTE enhances the AUC to 0.886, which shows a substantial improvement over the Raw model. The use of the GAN model results in an AUC of 0.927, showing that the synthetic data generated by GANs can produce a more complex and beneficial feature space for the classifier to learn from. It is obvious that the Diffusion model achieves the highest AUC of 0.987, indicating an exceptional performance that almost reaches the theoretical maximum. This suggests that the diffusion approach for tabular data augmentation results in a highly effective representation of the test driving data, allowing the classifier to distinguish between safe and unsafe driving areas with high accuracy.

To further explore the performances of the binary classifiers with different augmentation techniques in the context of imbalanced datasets where positive cases are more important than negative cases, the PR analysis is conducted, as depicted in Figure 6. The Diffusion method appears to have the best performance since its curve is closest to the top-right corner for the majority of its path, followed by GAN, SMOTE, and then Raw. This research finding suggests that the Diffusion model achieves a good balance between precision and recall for a larger range of threshold settings compared to the other methods. It implies that the Diffusion model might be the most effective method of preprocessing the disengagement event dataset for the particular classification task.

Table 5 presents comparative F1 scores using different data augmentation models. It is seen that the Raw model yields a F1 score of 0.194. One can also observe that the SMOTE model improves the F1 score to 0.694. The GAN model results in an enhanced F1 score of 0.87. Lastly, the Diffusion model achieves the highest F1 score of 0.943. The ascending F1 scores demonstrate the efficacy of sophisticated data augmentation techniques in improving the classifier’s predictive performance, with the Diffusion model standing out as the most effective method in classifying the road sections in which AVs can driving safely.

4.2. Comparative Study

The previous subsection showed that generative model-based data augmentation techniques contributed to improving the accuracy of identifying the safe road sections for AVs. However, the validity remains doubtful in the context of generality since the results might be coincidental. Hence, this study performs further analysis based on K-fold cross validation with K set to 100. To analyze comprehensive measures of the classifier’s performance with different augmentation models, box plots of each model with respect to several performance metrics, namely, AUC value, precision, recall, and F1 score, are provided in Figure 7.

As shown in Figure 7a, it is observed that the Raw model exhibits a median AUC value around 0.7, with a relatively tight interquartile range (IQR). One can also find that the SMOTE model demonstrates a higher median AUC value with a slightly larger IQR, suggesting an enhanced performance over the Raw model, potentially due to the balancing of classes within the dataset through synthetic oversampling. In addition, a better performance improvement in terms of the AUC value is found in the GAN model. It shows a further increase in the median AUC value and describes a compact IQR, which indicates consistent performance across various folds. This suggests that the GAN-based data augmentation provides a meaningful improvement in the classifier’s robustness compared to the traditional oversampling method. Lastly, one can easily observe that the Diffusion model outperforms all others with the highest median AUC value, approaching the optimal score of 1.0, and exhibits an exceptionally tight IQR. The limited range of the AUC values, combined with the absence of outliers, denotes a high level of consistency in the classifier’s classification accuracy, which underscores the effectiveness of the Diffusion method in enhancing the data quality for clarifying which road sections are safe for AVs.

Such trends can also be observed in the results of other performance metrics such as precision, recall, and the F1 score. As shown in the box plots of Figure 7b–d, the Raw model has the lowest median value, while the models with data augmentation techniques have relatively higher median values, i.e., the SMOTE, GAN, and Diffusion models. On the other hand, it is found that the result of the Raw model tends to exhibit high precision and low recall. It means that the classifier model is characterized by its ability to correctly classify a high percentage of actual positive instances among the cases it predicts as positive, but it fails to identify a significant portion of all actual positive instances present in the dataset. In other words, the classifier without any augmentation techniques is likely to overestimate the level of service for given road sections where AVs easily encounter their disengagement events, often resulting in safety-critical issues such as increased accident risks and operational failures.

It is worth noting that such misclassification can be mitigated based on data augmentation methods. The ascending order of median values of performance metrics across the classifiers trained with different augmentation techniques clearly indicates the positive impact of advanced data augmentation on generalization performance, particularly in the generative model-based augmentation approaches. Notably, it is found that the Diffusion model exhibits the most favorable balance between precision and recall, as indicated by the statistics of its superior F1 score. The result highlights the potential of sophisticated synthetic data generation methods in deploying robust machine learning classifiers with high classification accuracy.

Detailed statistics of mean and standard deviation for the performance metrics in 100-fold cross validation are described in Table 6. The data succinctly corroborate the superior performance of the Diffusion model across all metrics. Moreover, the standard deviations associated with each metric signify the variability in the model’s performance across the folds, with lower values denoting more consistency. Notably, the ‘Diffusion’ model not only leads in mean performance scores but also maintains tight standard deviations, indicating reliable performance across different iterations of cross-validation. Such results indicate that the Diffusion model shows an exceptional ability to distinguish between the classes of interest.

On the other hand, the SMOTE and GAN models show intermediate performance levels, with the GAN model leading the SMOTE model slightly in the AUC and F1 score. The consistent increment in the mean values from the Raw to SMOTE to GAN models, and finally to the Diffusion model, substantiates the previously established ranking of the data augmentation models’ effectiveness. Similar to the findings in Figure 7, the Raw model exhibits notably lower performance figures, with the lowest mean values across all metrics, such as an AUC of 0.699 and an F1 score of 0.18, which signals room for improvement in its classification capabilities using other sophisticated data augmentation methods. In other words, there is sufficient evidence that the proposed framework enhanced the discriminative capability of the given classifiers in identifying safe driving areas for automated vehicles.

5. Conclusions

This study introduced a comprehensive framework that leverages advanced data augmentation techniques to enhance the performance of classifiers in identifying safe road sections for AVs. To evaluate the effectiveness of the proposed framework, this research conducted several numerical studies, including a performance review and comparative study, with the highly skewed tabular dataset for disengagement events of AVs in the expressways of South Korea. Through the performance review and comparative study, it was observed that the random forest-based classifiers trained on synthetic data generated by data augmentation approaches, including the SMOTE, GAN, and Diffusion models, outperformed the classifier without any data augmentation methods. Moreover, it was also found that the advanced generative models improve the discriminative performance of the classifiers, with the Diffusion model, in particular, demonstrating exceptional effectiveness. The research findings suggest that sophisticated data augmentation can mitigate the challenges posed by imbalanced datasets, particularly in domains where the cost of misclassification is high, such as automated driving. The use of generative models, especially those based on the diffusion-based approach, has proven to be exceptionally effective in generating realistic and varied synthetic data, thereby enriching the training sets and enabling the machine learning classifiers to achieve high accuracy and reliability in discerning safe driving zones.

Although this research has successfully verified the effectiveness of the proposed framework, it acknowledges several inherent data limitations that necessitate further research. The dataset used in this study primarily encompasses static information, such as the geometric characteristics of roadways. This focus on static features omits dynamic elements like weather conditions, which could significantly impact the feasibility of AVs’ automation mode in real-world scenarios. Moreover, the disengagement data currently available do not reflect a diverse range of vehicle types. This limitation hinders the generalizability of the findings across different vehicle dynamics and operational capabilities. Lastly, future research will explore the integration of additional data sources to enhance the classifier’s accuracy and adaptability.

Funding

This work was supported by the faculty research fund of Sejong University in 2023.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available within the article.

Conflicts of Interest

The author declares no conflicts of interest.

References

Choi, S.; Lee, D.; Kim, S.; Tak, S. Framework for connected and automated bus rapid transit with sectionalized speed guidance based on deep reinforcement learning: Field test in Sejong city. Transp. Res. Part C Emerg. Technol. 2023, 148, 104049. [Google Scholar] [CrossRef]
Oh, S.; Lentzakis, A.F.; Seshadri, R.; Ben-Akiva, M. Impacts of Automated Mobility-on-Demand on traffic dynamics, energy and emissions: A case study of Singapore. Simul. Model. Pract. Theory 2021, 110, 102327. [Google Scholar] [CrossRef]
Hyland, M.; Mahmassani, H.S. Operational benefits and challenges of shared-ride automated mobility-on-demand services. Transp. Res. Part A Policy Pract. 2020, 134, 251–270. [Google Scholar] [CrossRef]
Tak, S.; Kim, J.; Lee, D. Study on the Extraction Method of Sub-Network for Optimal Operation of Connected and Automated Vehicle-Based Mobility Service and Its Implication. Sustainability 2022, 14, 3688. [Google Scholar] [CrossRef]
Shladover, S.E. Connected and automated vehicle systems: Introduction and overview. J. Intell. Transp. Syst. 2018, 22, 190–200. [Google Scholar] [CrossRef]
Guanetti, J.; Kim, Y.; Borrelli, F. Control of connected and automated vehicles: State of the art and future challenges. Annu. Rev. Control 2018, 45, 18–40. [Google Scholar] [CrossRef]
Elliott, D.; Keen, W.; Miao, L. Recent advances in connected and automated vehicles. J. Traffic Transp. Eng. Engl. Ed. 2019, 6, 109–131. [Google Scholar] [CrossRef]
Zhao, X.; Robu, V.; Flynn, D.; Salako, K.; Strigini, L. Assessing the safety and reliability of autonomous vehicles from road testing. In Proceedings of the 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), Berlin, Germany, 28–31 October 2019; pp. 13–23. [Google Scholar]
De Gelder, E.; Camp, O.O.D. Procedure for the safety assessment of an autonomous vehicle using real-world scenarios. arXiv 2020, arXiv:2012.00643. [Google Scholar]
Wang, X.; Qin, D.; Cafiso, S.; Liang, K.K.; Zhu, X. Operational design domain of autonomous vehicles at skewed intersection. Accid. Anal. Prev. 2021, 159, 106241. [Google Scholar] [CrossRef]
Gouda, M.; Chowdhury, I.; Weiß, J.; Epp, A.; El-Basyouny, K. Automated assessment of infrastructure preparedness for autonomous vehicles. Autom. Constr. 2021, 129, 103820. [Google Scholar] [CrossRef]
Feng, S.; Feng, Y.; Yu, C.; Zhang, Y.; Liu, H.X. Testing scenario library generation for connected and automated vehicles, part I: Methodology. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1573–1582. [Google Scholar] [CrossRef]
Kim, J.; Kee, S. A Research on the ODD and OEDR Guidelines Based on the Demonstration Case of Autonomous Driving in Sejong City. Trans. Korean Soc. Automot. Eng. 2020, 28, 659–668. [Google Scholar] [CrossRef]
Sun, C.; Deng, Z.; Chu, W.; Li, S.; Cao, D. Acclimatizing the operational design domain for autonomous driving systems. IEEE Intell. Transp. Syst. Mag. 2021, 14, 10–24. [Google Scholar] [CrossRef]
Tak, S.; Kim, S.; Yu, H.; Lee, D. Analysis of relationship between road geometry and automated driving safety for Automated Vehicle-based Mobility Service. Sustainability 2022, 14, 2336. [Google Scholar] [CrossRef]
Zhang, F.; Li, C.; Yang, F. Vehicle detection in urban traffic surveillance images based on convolutional neural networks with feature concatenation. Sensors 2019, 19, 594. [Google Scholar] [CrossRef] [PubMed]
Ji, H.; Gao, Z.; Mei, T.; Li, Y. Improved faster R-CNN with multiscale feature fusion and homography augmentation for vehicle detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1761–1765. [Google Scholar] [CrossRef]
Wang, Z.; Huang, J.; Xiong, N.N.; Zhou, X.; Lin, X.; Ward, T.L. A robust vehicle detection scheme for intelligent traffic surveillance systems in smart cities. IEEE Access 2020, 8, 139299–139312. [Google Scholar] [CrossRef]
Pillai, A.S. Traffic Surveillance Systems through Advanced Detection, Tracking, and Classification Technique. Int. J. Sustain. Infrastruct. Cities Soc. 2023, 8, 11–23. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Han, H.; Wang, W.; Mao, B. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Islam, Z.; Abdel-Aty, M.; Cai, Q.; Yuan, J. Crash data augmentation using variational autoencoder. Accid. Anal. Prev. 2021, 151, 105950. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Moon, J.; Jung, S.; Park, S.; Hwang, E. Conditional tabular GAN-based two-stage data generation scheme for short-term load forecasting. IEEE Access 2020, 8, 205327–205339. [Google Scholar] [CrossRef]
Engelmann, J.; Lessmann, S. Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst. Appl. 2021, 174, 114582. [Google Scholar] [CrossRef]
Zhang, Y.; Zaidi, N.A.; Zhou, J.; Li, G. GANBLR: A tabular data generation model. In Proceedings of the 2021 IEEE International Conference on Data Mining (ICDM), Auckland, New Zealand, 7–10 December 2021; pp. 181–190. [Google Scholar]
Kotelnikov, A.; Baranchuk, D.; Rubachev, I.; Babenko, A. Tabddpm: Modelling tabular data with diffusion models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 17564–17579. [Google Scholar]
Tao, H. Erasing-inpainting-based data augmentation using denoising diffusion probabilistic models with limited samples for generalized surface defect inspection. Mech. Syst. Signal Process. 2024, 208, 111082. [Google Scholar] [CrossRef]
Ramezan, A.C.; Warner, T.A.; Maxwell, A.E. Evaluation of sampling and cross-validation tuning strategies for regional-scale machine learning classification. Remote Sens. 2019, 11, 185. [Google Scholar] [CrossRef]
Dutschmann, T.M.; Kinzel, L.; Ter Laak, A.; Baumann, K. Large-scale evaluation of k-fold cross-validation ensembles for uncertainty estimation. J. Cheminform. 2023, 15, 49. [Google Scholar] [CrossRef] [PubMed]
Fonseca, J.; Bacao, F. Tabular and latent space synthetic data generation: A literature review. J. Big Data 2023, 10, 115. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Xu, L.; Veeramachaneni, K. Synthesizing tabular data using generative adversarial networks. arXiv 2018, arXiv:1811.11264. [Google Scholar]
Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 2018, 91, 464–471. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in neural information processing systems 33, Online, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Shaik, A.B.; Srinivasan, S. A brief survey on random forest ensembles in classification model. In International Conference on Innovative Computing and Communications: Proceedings of ICICC 2018; Springer: Berlin/Heidelberg, Germany, 2019; pp. 253–260. [Google Scholar]
Hou, Y.; Edara, P.; Sun, C. Situation assessment and decision making for lane change assistance using ensemble learning methods. Expert Syst. Appl. 2015, 42, 3875–3882. [Google Scholar] [CrossRef]
Wang, Q.; Nguyen, T.T.; Huang, J.Z.; Nguyen, T.T. An efficient random forests algorithm for high dimensional data classification. Adv. Data Anal. Classif. 2018, 12, 953–972. [Google Scholar] [CrossRef]
Favarò, F.; Eurich, S.; Nader, N. Autonomous vehicles’ disengagements: Trends, triggers, and regulatory limitations. Accid. Anal. Prev. 2018, 110, 136–148. [Google Scholar] [CrossRef] [PubMed]
Tengilimoglu, O.; Carsten, O.; Wadud, Z. Implications of automated vehicles for physical road environment: A comprehensive review. Transp. Res. Part E Logist. Transp. Rev. 2023, 169, 102989. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]

Figure 1. The process of enhancing the generalized performance of the classifier model using tabular data augmentation algorithms.

Figure 2. Study site.

Figure 3. Overview of oversampling-based disengagement data augmentation in the K-fold dataset.

Figure 4. Comparison between the raw dataset and synthetic dataset with different augmentation methods based on principal component analysis (PCA).

Figure 5. Receiver operating characteristics (ROC) curves of different data augmentation models.

Figure 6. Precision–recall (PR) curves of each model.

Figure 7. Box plots of each model with respect to several performance metrics: (a) AUC value (b) precision (c) recall (d) F1 score.

Table 1. The sample data for disengagement events.

Column	Column Name	Sample Data	Column	Column Name	Sample Data
1	ID	435	10	Num_Lanes	4
2	Route	Gyeongbu	11	Delta_Slope	0
3	Direction	2	12	Delta_Radius	−5000
4	Milepost_S	132.86	12	Delta_Lanes	0
5	Milepost_E	133.12	14	AADT	72,005
6	Distance	0.26	15	Max_Speed	100
7	Slope	0.9079	16	Min_Speed	100.2
8	Curve_Radius	5000	17	Mean_Speed	110.03
9	Curve_Length	260	18	Consequence	0

Table 2. Statistics of disengagement events and their causal factors.

Causal Factor (Code)	Road Geometry		Lines and Lane Markings, and Road Surface Color			Pavement Condition	In-Vehicle Sensor
Causal Factor (Code)	Leaning to One Side (1)	Lane Departure (2)	Leaning to One Side (3)	Zigzag Driving Maneuver (4)	Lane Departure (5)	Zigzag Driving Maneuver (6)	Lane Departure due to Severe Weather (7)	Collision Risk due to Delayed Vehicle Recognition (8)	Leaning or Lane Departure due to Shadows (9)
Number of Disengagement Events	217	2	79	6	3	5	5	47	14
Disengagement Occurrence Frequency (per km)	14	1585	40	528	1056	634	634	67	226

Table 3. Confusion matrix.

		Actual Class
		Positive	Negative
Prediction Class	Positive	True Positive (TP)	False Positive (FP)
Prediction Class	Negative	False Negative (FN)	True Negative (TN)

Table 4. Area under the ROC curve (AUC) values of each model.

Model	AUC Value
Raw	0.734
SMOTE	0.886
GAN	0.927
Diffusion	0.987

Table 5. F1 scores of each model.

Model	F1 Score
Raw	0.194
SMOTE	0.694
GAN	0.87
Diffusion	0.943

Table 6. Statistics of mean and standard deviation for the performance metrics in 100-fold cross validation.

Model	AUC	Precision	Recall	F1 Score
Raw	0.699 ± 0.026	0.634 ± 0.027	0.105 ± 0.011	0.18 ± 0.017
SMOTE	0.886 ± 0.01	0.836 ± 0.011	0.548 ± 0.024	0.662 ± 0.021
GAN	0.916 ± 0.021	0.859 ± 0.015	0.726 ± 0.098	0.778 ± 0.065
Diffusion	0.986 ± 0.002	0.939 ± 0.004	0.946 ± 0.004	0.943 ± 0.004

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, D. Driving Safety Area Classification for Automated Vehicles Based on Data Augmentation Using Generative Models. Sustainability 2024, 16, 4337. https://doi.org/10.3390/su16114337

AMA Style

Lee D. Driving Safety Area Classification for Automated Vehicles Based on Data Augmentation Using Generative Models. Sustainability. 2024; 16(11):4337. https://doi.org/10.3390/su16114337

Chicago/Turabian Style

Lee, Donghoun. 2024. "Driving Safety Area Classification for Automated Vehicles Based on Data Augmentation Using Generative Models" Sustainability 16, no. 11: 4337. https://doi.org/10.3390/su16114337

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Driving Safety Area Classification for Automated Vehicles Based on Data Augmentation Using Generative Models

Abstract

1. Introduction

2. Methodology

2.1. Framework for Improving the Generalized Performance of a Classifier with Tabular Data

2.2. Oversampling-Based Tabular Data Augmentation Model

2.2.1. Synthetic Minority Oversampling Technique (SMOTE)

2.2.2. Conditional Tabular Generative Adversarial Network (CTGAN)

2.2.3. Tabular Data with Denoising Diffusion Probabilistic Models (TabDDPMs)

2.3. Classifier Model for Identifying the Driving Safety Road Sections of AVs

3. Data Description and Performance Metrics

3.1. Data Description

3.2. Performance Metrics

4. Result and Analysis

4.1. Performance Review

4.2. Comparative Study

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI