ReMAHA–CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks

Li, Guolian; Wu, Yadong; Bai, Yulong; Zhang, Weihan

doi:10.3390/app132413123

Open AccessArticle

ReMAHA–CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks

¹

School of Computer Science and Engineering, Sichuan University of Science and Engineering, Yibin 644002, China

²

Sichuan Provincial Engineering Laboratory of Big Data Visual Analysis, Yibin 644002, China

³

School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644002, China

⁴

Sichuan Key Provincial Research Base of Intelligent Tourism, Zigong 643000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(24), 13123; https://doi.org/10.3390/app132413123

Submission received: 31 October 2023 / Revised: 21 November 2023 / Accepted: 7 December 2023 / Published: 9 December 2023

(This article belongs to the Section Transportation and Future Mobility)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

ReMAHA–CatBoost is an advanced machine learning model designed for predicting traffic accident severity. It is constructed in two parts: ReMAHA (relief–F-based genetic algorithm with over-sampling algorithm for weighted Mahalanobis distance) and CatBoost, to offer an innovative solution in the field of imbalanced data classification. Key Features and Highlights: (1) ReMAHA Over-sampling: ReMAHA employs the Relief–F algorithm for feature selection and combines it with an innovative over-sampling technique to enhance prediction accuracy for minority classes; (2) Feature Engineering: The model leverages feature engineering to determine the significance of different attributes, enabling it to make precise predictions regarding accident severity; and (3) CatBoost Integration: ReMAHA incorporates CatBoost, a state-of-the-art gradient-boosting algorithm, to improve predictive performance by mitigating issues like overfitting and prediction bias. This paper elucidates the working principles of oversampling algorithms in machine learning tasks based on imbalanced datasets, specifically addressing how to resolve the issue of low accuracy stemming from imbalanced data at the data level. Based on the experimental results presented in this paper, it is evident that ReMAHA–CatBoost outperforms several other oversampling algorithms and models, especially on the US–Accidents traffic accident dataset characterized by an extreme class imbalance ratio of 91.40. This improved performance enhances the precision of traffic accident severity prediction.

Abstract

Using historical information from traffic accidents to predict accidents has always been an area of active exploration by researchers in the field of transportation. However, predicting only the occurrence of traffic accidents is insufficient for providing comprehensive information to relevant authorities. Therefore, further classification of predicted traffic accidents is necessary to better identify and prevent potential hazards and the escalation of accidents. Due to the significant disparity in the occurrence rates of different severity levels of traffic accidents, data imbalance becomes a critical issue. To address the challenge of predicting extremely imbalanced traffic accident events, this paper introduces a predictive framework named ReMAHA–CatBoost. To evaluate the effectiveness of ReMAHA–CatBoost, we conducted experiments on the US–Accidents traffic accident dataset, where the class label imbalance reaches up to 91.40 times. The experimental results demonstrate that the proposed model in this paper exhibits exceptional predictive performance in the domain of imbalanced traffic accident prediction.

Keywords:

relief–F; imbalanced data; CatBoost; traffic accident; class imbalance

1. Introduction

Road traffic accidents pose significant economic and medical burdens globally, often leading to catastrophic family tragedies and accounting for approximately 1.3 million fatalities annually. However, as highlighted by the World Health Organization (WHO), injuries caused by traffic accidents can be prevented [1,2]. With the rapid advancement of machine learning, the field of traffic accident prediction has demonstrated tremendous potential. It not only enhances the accuracy of accident prediction but also allows for real-time response to traffic conditions, reducing accident risks. Existing research has primarily focused on whether traffic accidents will occur, often neglecting the further classification of potential accidents. Ideally, all potential traffic accidents should receive significant attention, with resources mobilized to prevent their occurrence. However, due to limited resources and the need for multi-agency co-ordination, it is often challenging to address all traffic accidents promptly. Therefore, it is essential to give additional attention to the possibility of severe traffic accidents and take necessary measures to prevent their occurrence or prepare for their management in advance.

Simultaneously, traffic accidents typically exhibit characteristics of imbalance, such as the far smaller number of accident samples compared to normal traffic samples [3], a significantly higher number of accidents in high-risk areas compared to low-risk areas [4], and an imbalance in the sample proportions of different accident types. Particularly, in traffic accidents, the number of severe events is much lower than that of common events. Machine learning models tend to learn this prior information regarding the proportions of the two sample classes in the training set, which results in an emphasis on the majority class during actual prediction (possibly leading to better accuracy for common accidents but poorer accuracy for severe accidents). Unfortunately, in previous work, the probability of occurrence of these minority class accidents is often overlooked, and especially the occurrence of severe accidents in traffic accidents should receive more attention.

Due to the presence of a considerable number of boolean features, such as points of interest (POI), streets, wind direction, weather, and other factors in traffic accident datasets, there remains an issue of feature sparsity in traffic accident prediction. To address the aforementioned issues, this paper introduces the ReMAHA–CatBoost traffic accident prediction model, aiming to classify potential traffic accidents and to resolve the problems of imbalanced sample quantities, high-dimensional, and sparse feature-induced prediction inaccuracies in traffic accidents. This model incorporates feature selection and clustering operations into the oversampling process to improve the problem of traditional genetic algorithms generating overly divergent data. Simultaneously, it addresses the issue of Mahalanobis distance calculation during the oversampling process, caused by sparse features. Additionally, the weight matrix obtained using the relief–F algorithm not only aids in Mahalanobis distance weighted computation, increasing the impact of feature importance on sampling, but also serves for original feature selection to avoid the curse of dimensionality and overfitting. In this paper, the contributions can be summarized as follows:

We introduce ReMAHA–CatBoost, a traffic accident prediction model based on CatBoost [5]. ReMAHA–CatBoost leverages the relief–F algorithm to obtain feature importance and MAHAKIL to avoid generating data with a centralized distribution, effectively alleviating the imbalance issue in traffic accident data;
To address the concern of Mahalanobis distance computation in MAHAKIL, which does not consider feature importance, we employ the relief–F algorithm to obtain a diagonal matrix representing feature importance and perform weighted Mahalanobis distance calculations;
Additionally, to tackle the problem of generating overly divergent data and handling the overall large sample size in classical MAHAKIL algorithms, we utilize mini batch K-means to cluster the samples before data generation;
Furthermore, we leverage the advantage of CatBoost in reducing gradient bias, enhancing the model’s generalization capabilities;
We evaluate ReMAHA–CatBoost and other three sampling algorithms combined with CatBoost, as well as ReMAHA combined with four other prediction models on the US–Accidents dataset. The results demonstrate that ReMAHA–CatBoost outperforms other models in imbalanced traffic accident data, highlighting its generalization and effectiveness in the domain of traffic accidents.

The structure of this paper is outlined as follows: in Section 2, we delve into pertinent research concerning traffic accident prediction and approaches addressing data imbalance. Section 3 expounds upon the utilized dataset and delineates the architecture design of ReMAHA–CatBoost. The subsequent section, Section 4, presents a detailed analysis of the experimental results. Section 5 engages in a discussion of the research findings presented herein. Finally, Section 6 summarizes the broad body of work and real-world contributions of this study.

2. Related Work

In the field of traffic accident research, accurate accident prediction contributes to the identification of traffic hazards, optimization of traffic system resource allocation, and timely provision of medical assistance. Currently, there have been many excellent research achievements in the domain of traffic accident prediction [6,7,8], and they have played a crucial role in revealing the mechanisms behind traffic accident occurrences [9]. For example, Nur et al. [10] used the least-squares method to analyze the relationship between environmental factors and accidents, identifying correlations between rainfall, temperature, wind speed, and accidents. Li et al. [11] conducted association rule mining and classification studies, revealing that physiological factors such as alcohol consumption are more likely to lead to fatalities in accidents, while the impact of natural environmental factors like weather on the fatality rate is relatively smaller. Wang et al. [12] employed the a priori algorithm based on association rules to explore influential factors in traffic accidents and investigate strong association rules among various causal factors.

Through our investigation, we found that existing research often focuses on the causes and underlying factors of accident occurrences, using this information to predict whether traffic accidents will happen. Furthermore, most studies conduct single-dimensional analyses, only allowing for shallow data analysis and making it difficult to express the spatiotemporal correlations in traffic accidents [13]. Moreover, these studies do not conduct further classification of traffic accidents. Offering only the set of conditions that may trigger a traffic accident, such coarse-grained information also has limited contributions to the traffic system.

Simultaneously, due to the phenomenon of imbalanced samples in historical traffic accident data, especially where the number of severe traffic accidents is significantly lower than that of common accidents, traditional prediction models demonstrate poor performance. Currently, there are also many outstanding works addressing the common problem of sample imbalance, which has led to improved accuracy for minority class samples in prediction models. Broadly, their approaches mainly fall into two directions: data-level and algorithm-level [14,15]. Data-level approaches include oversampling, undersampling, and hybrid sampling, while algorithm-level approaches involve ensemble learning and cost-sensitive algorithms. Specifically, they are as follows:

(1) Undersampling Methods: Undersampling methods begin with the majority class data and reduce the dataset by removing the class with a higher quantity to balance the dataset. For example, Dai et al. [16] started by eliminating duplicate samples and expanded the detection range of the Tomek-link undersampling algorithm by introducing a global re-labeling index. Wei et al. [17], focusing on data complexity, proposed an undersampling algorithm based on weighted complexity, WCP-UnderSampler, which achieved promising results in defect prediction datasets. However, undersampling can lead to information loss in the majority class and affect classifier generalization.

(2) Oversampling Methods: Oversampling methods introduce new minority class data to create a superset of minority class samples, reducing the imbalance between data categories [18]. For instance, SMOTE is a classical oversampling algorithm for addressing imbalanced data [19]. Gao et al. [20] proposed an improved SMOTE oversampling algorithm based on ant clustering, addressing both inter-cluster and intra-cluster data imbalance. Bennin et al. [21] introduced genetic chromosome theory into the sampling domain, presenting the MAHAKIL algorithm, which ensures that the generated data inherit the features of parent data instances. It outperforms other oversampling methods on multiple datasets. However, oversampling methods can lead to data overlap and distribution marginalization issues, potentially trapping the algorithm in local optima.

(3) Mixed Sampling: Mixed sampling combines both undersampling and oversampling techniques to balance the data. For example, Wang et al. [22] synthesized minority class and majority class samples separately using a generative adversarial network (GAN) and SMOTE. They compared this dual oversampling strategy to results obtained from a single oversampling approach targeting the minority class only, and found that the dual oversampling strategy outperformed the single oversampling method. While mixed sampling can alleviate information loss in undersampling and overfitting in oversampling, research suggests that the order of executing undersampling and oversampling in mixed sampling can influence predictive accuracy [23].

(4) Ensemble Learning: Different from traditional individual learners, ensemble learning combines multiple weak learners to create a strong learner [24]. The two most typical forms of ensemble learning, based on the composition of base learners, are bagging and boosting. Navaneeth et al. [25] implemented a hybrid ensemble learning model that combines CNN with CatBoost, providing a new approach for non-invasive COVID-19 detection. Yan et al. [26] combined undersampling with ensemble learning and proposed a spatial undersampling model for local pattern learning. For ensemble methods, their purpose is to enhance overall accuracy, making them challenging to work effectively on imbalanced classification problems alone [27]. They often need to be used in conjunction with other algorithms.

(5) Cost-Sensitive Learning: A new cost-sensitive imprecise classification decision tree has been introduced, which considers error costs by weighting instances and incorporates these costs during tree construction. Serafín et al. [28] proposed the nonparametric predictive inference model to improve cost-sensitive decision trees. In addition to modifying tree generation, the fusion of active learning with cost-sensitive algorithms can effectively enhance the classification performance of imbalanced data [29]. Although combining cost-sensitive learning with classification models effectively improves predictive accuracy for classification results, determining misclassification costs still requires substantial effort.

These various types of methods each have their own characteristics and have made significant progress. Considering the randomness and diversity of traffic accident data [30], as well as the abundance of sparse features in the field of traffic accidents, applying them directly to address sample imbalance in traffic accident prediction can lead to suboptimal prediction performance. Given these considerations, this paper addresses the sample imbalance phenomenon in severe imbalanced traffic accident datasets from both a data-level and algorithm-level perspective, merging the two to enhance the robustness of predictions.

3. Methodology

This paper aims to further classify the results of traffic accident prediction and address the common issue of sample imbalance in traffic accident data, which can potentially reduce the accuracy of model predictions. To tackle this challenge, we propose a traffic accident prediction model that integrates relief–F, MAHAKIL, mini batch K-means, and CatBoost. The primary objective is to mitigate the adverse impact of sample imbalance on model performance and more accurately identify the severity of traffic accidents.

To evaluate the effectiveness of the ReMAHA–CatBoost model in the field of traffic accident prediction, we chose the US–Accidents dataset [31,32] as the subject of our study. We initially conducted a series of data preprocessing steps, including missing values, handling duplicate values, outliers, one-hot encoding, and feature engineering. These steps provided critical data for predicting the severity of traffic accidents. Subsequently, we developed a hybrid sampling prediction model using the ReMAHA–CatBoost architecture. This model is designed to alleviate the data imbalance issue in traffic accidents and perform the classification task for the severity of traffic accidents.

At the data-level, due to the strong randomness in traffic accident data, it is challenging to generate data using traditional oversampling methods. Therefore, we initially perform clustering on minority class samples, enhancing the similarity between samples. We then use MAHAKIL to generate new minority class samples between clusters. Furthermore, due to the presence of a considerable number of boolean features in traffic accident characteristics, feature sparsity can occur. When dealing with large datasets and extremely sparse features, computing the Mahalanobis distance becomes challenging as the covariance matrix might become non-invertible, thereby impeding the use of traditional Mahalanobis distance computation.

We introduce the relief–F algorithm to process the features after clustering, which can eliminate near-uniform labels in the clusters and populate them after generating the data. Such processing avoids the influence of sparse features on the computation of the Mahalanobis distance. In addition, we use the obtained diagonal matrix of feature importance weights for Mahalanobis distance weighting. The purpose of this step is to optimize the traditional Mahalanobis distance which has the problem of exaggerating the role of small variables, and to improve the influence of important features on the distance.

Despite the extensive volume of traffic accident data, which is highly advantageous for subsequent predictions. However, the internal data distribution is unbalanced, which leads to the problem of overfitting easily using traditional algorithms, making the prediction results biased towards categories with more samples. In this regard, we used CatBoost for prediction at the algorithm-level, which is used in order to mitigate the overfitting problem in the traffic accident classification task. CatBoost’s ranking boost and overcoming gradient bias enhance the model’s predictive capabilities on the imbalanced traffic accident dataset. Subsequent subsections will provide detailed explanations of the mentioned methods. The overall workflow of the proposed method is outlined in Figure 1, which will be elaborated in this section.

3.1. Dataset

Given the extensive volume, imbalance, and sparsity of features within traffic accident data, we selected the highly representative dataset US–Accidents as the focal point of our study. It contains all the traffic accident data in the United States from 2016 to 2022 and is sourced from both Bing and MapQuest. The dataset comprises over 7 million records, making it large in terms of data quantity, and it includes a wide range of feature dimensions. Based on different feature attributes, the dataset is categorized into three types: numerical, boolean, and text, as shown in Table 1.

The US–Accidents traffic accident dataset is imbalanced, where the accident severity levels range from 1 to 4 (Table 2). These severity levels indicate increasing degrees of severity, with 1 representing the least impact on traffic (i.e., causing short-term delays) and 4 representing a more significant impact on traffic (i.e., causing long-term delays). In terms of data quantity, there is a substantial difference between the number of events with severity levels 1 and 4 compared to severity levels 2 and 3.

The overall dataset exhibits a pronounced imbalance. Based on the imbalance ratio (IR) equation, it can be concluded that the imbalance ratio of this dataset is as high as 91.40.

I R = \frac{N_{m a j o r}}{N_{m i n o r}}

(1)

where N_major represents the sample size with the highest number of categories in the dataset and N_minor represents the sample size with the lowest number of categories in the dataset.

Data preprocessing stands as a crucial part throughout the entire machine learning process. In order to avoid the “noise” in the dataset from affecting the experiment, we conducted operations to handle missing values, duplicates, and outliers. Additionally, we engineered some new features. Following these data processing steps, we partitioned the dataset into two subsets, the training set and the test set, for model validation purposes.

3.1.1. Missing Value Handling

Table 3 describes the distribution of missing values in the dataset. In the experiment, missing values in natural environmental features were imputed using the median of the same weather station and the same month. Additionally, for spatial environmental features, missing values were imputed using data from geographically close locations. Some features with a significant number of missing values lost their original utility and were subsequently removed.

3.1.2. Duplicate Value Handling

From the perspective of the features in the US–Accidents traffic accident dataset, there are two sources of duplicate values in the dataset:

Repetitive reporting of the same traffic accident information by different reporting sources. For the same traffic accident, different reporting sources may report it, and differences in the source can also result in variations in geographic and time information. Based on the different data sources, the source field is categorized into Source 1: reported from Bing, Source 2: reported from MapQuest, and Source 3: reported from both parties;
Multiple reports of information caused by the same traffic accident source. This is because the determination of traffic accident types in US–Accidents is based on whether traffic delays occur in a specific geographical location, and there is a strong spatial correlation between adjacent roads. Therefore, closely located traffic congestion may have a chain reaction, leading to multiple reports of traffic congestion on closely located roads caused by the same traffic accident source.

Given the two mentioned reasons, to ensure minimal and accurate duplicate data in the experiment’s dataset, while also avoiding data leakage due to duplicate values, we excluded the 2016 data that had overall fewer records and a substantial disparity in data volume between different months. We retained data from 2017 to 2022 for the experiment. Subsequently, we screened the remaining dataset of over six million entries for duplicate values, filtering out entries with a time interval smaller than 10 min and a geographical distance less than 250 m, assessed by the geographical location formula illustrated in Equation (2). Here, φ_a and φ_b represent the longitudes of points a and b, respectively, while λ_a and λ_b denote the latitudes of points a and b, respectively.

dist (a, b) = 2 r a r c s i n (\sqrt{\sin^{2} \frac{(φ_{a} - φ_{b})}{2} + \cos (φ_{a}) \cos (φ_{b}) \sin^{2} \frac{(λ_{a} - λ_{b})}{2}})

(2)

To provide a more detailed demonstration of the process of eliminating duplicate data based on time and distance, we will illustrate it through the following example (Table 4):

Assuming that the above four data are subjected to repeat value judgment, the data pairs (X₁, X₃), (X₁, X₄), and (X₂, X₃) with time interval less than 10 min are firstly identified. Then, the longitude and latitude values of the above data pairs are substituted into Equation (2) for distance judgment, which yields dist(X₁, X₃) = 129,461.41 m, dist(X₁, X₄) = 240.38 m, and dist(X₂, X₃) = 179,192.03 m. Therefore, we believe that the duplicated point may be between X₁ and X₄, and that X₁ occurs earlier than X₄, therefore, X₁ is the real place where the accident originated.

In this section, we screened and deleted duplicate values from 2017 to 2022, and deleted duplicate value data with time intervals of less than 10 min and geographic locations of less than 250 m to ensure the purity of the data and to avoid data leakage.

3.1.3. Handling Outliers

In addition, there are some anomalies in the dataset due to malfunctioning environmental testing instruments and other reasons. If they are not dealt with, they will cause the model’s predictions to be skewed. In order to provide a visual representation of the original data, this experiment conducted a feature analysis. Table 5 presents descriptive statistics for some of the features in the original data.

The Table 5 includes the upper quartile (Q1), lower quartile (Q3), lower bound (lower), upper bound (upper), maximum (max), and minimum (min) for numerical features such as temperature (F), humidity (%), pressure (in), and wind speed (mph). The interquartile range (IQR) is calculated as:

I Q R = Q 3 - Q 1

(3)

l o w e r = Q 1 - 1.5 \times I Q R

(4)

u p p e r = Q 3 + 1.5 \times I Q R

(5)

The maximum and minimum values represent the maximum and minimum occurrences of a particular feature in the original dataset. If the maximum value is greater than the upper bound or the minimum value is lower than the lower bound, it indicates the presence of outliers in that feature (e.g., Wind_Spped (mph)). And when the maximum value is less than the upper bound and the minimum value is greater than the lower bound, it signifies the absence of outliers in that feature (e.g., humidity (%)). It is worth noting that the maximum and minimum values of some environmental features far exceed the most extreme values in a normal natural environment. Since the upper and lower bounds of the IQR can effectively represent the distribution of the majority of the data [33,34]. To ensure that the data are closer to real environmental conditions, in this experiment, data that fall significantly outside the interquartile range are removed.

3.1.4. Feature Engineering

In order to fully incorporate text features into the model building process, the original text features were one-hot encoded. Here are some examples of the original text features (Table 6):

The processed textual features are shown below (Table 7):

3.2. The ReMAHA Algorithm

This study introduces a novel model for predicting the severity of unbalanced traffic accidents (ReMAHA–CatBoost). It aims to address the issue of poor prediction performance caused by data imbalance in the traffic accidents domain. At the data level, we propose a new oversampling algorithm called ReMAHA, which effectively prevents overfitting in the generated data.

The ReMAHA model consists of three main components: feature weight calculation, inter-cluster weighted Mahalanobis distance ranking, and genetic algorithm for generating new data pairwise. First, we use the relief–F algorithm to extract significant features, alleviating the issue of excessively sparse features. Through the relief–F algorithm, we calculate the importance weights of features and represent them as a diagonal matrix. Next, we use a clustering algorithm to cluster minority class samples, aiming to ensure that the data do not become too scattered during the generation of new data pairwise using a genetic algorithm. Once the data are clustered into clusters, we apply weighted Mahalanobis distance ranking to the data between different clusters. Finally, we employ a genetic algorithm to generate new data pairwise, addressing the problems of data overlap and excessive concentration that may arise from oversampling. The flow of the ReMAHA model is shown in Figure 2.

3.2.1. Feature Weight Calculation

In this study, we first employed the relief–F algorithm for feature selection and obtained a weight matrix used to assess feature importance. The choice of using the relief–F algorithm is driven by the following reasons: due to the sparsity of features and the large volume of traffic accident data, both of which result in issues with the covariance matrix being non-invertible, using the relief–F algorithm to filter out these overly sparse and uninformative features can alleviate such problems.

Relief–F is an algorithm designed for multi-class problems to calculate feature weights based on the correlation between features and class labels. Its primary objective is to calculate the weights between features, thus determining which features are more influential for subsequent classification tasks [35]. Given a feature A, its weight at the t-th iteration is denoted as W_t(A). In any t + 1 iteration, the relief–F algorithm updates the weighting of the feature vector as follows:

W_{t + 1} (A) = W_{t} (A) - \frac{1}{T} \frac{\sum_{j = 1}^{k} d i f f (A, X_{i}, N H_{i})}{k} + \frac{1}{T} \frac{[\sum_{C \notin C l a s s (X_{i})} \frac{p (c)}{1 - p (C l a s s (X_{i}))} \times \sum_{i = 1}^{k} d i f f (A, X_{i}, N M_{i})]}{k}

(6)

where T represents the number of iterations, k is the number of nearest neighbors chosen, p(c) represents the prior probability of class c, and Class(X_i) indicates the class to which X_i belongs.

In a given dataset, the relief–F algorithm operates as follows: first, a sample X_i is randomly selected, and then the nearest same-class sample point NH_i and the nearest different-class sample point NM_i are found in the dataset. Here, the function diff(A, X_i, X_j) is used to calculate the distance between sample points X_i and X_j in feature A, where max(A) and min(A) denote the maximum and minimum values of feature A.

For numerical features, when feature A is numerical:

d i f f (A, X_{i}, N H_{i}) = \frac{| X_{i}, N H_{i} |}{| \max (A) - \min (A) |}

(7)

For discrete features, when feature A is discrete:

d i f f (A, X_{i}, N H_{i}) = {\begin{matrix} \begin{matrix} 0 & X_{i}^{} = N H_{i}^{} \end{matrix} \\ \begin{matrix} 1 & X_{i}^{} \neq N H_{i}^{} \end{matrix} \end{matrix}

(8)

After T iterations, each feature will obtain a feature weight w_i, forming a feature weight matrix W = [w₁, w₂, …, w_m]. The larger w_i is, the better the feature is for classification.

3.2.2. Inter-Cluster Weighted Mahalanobis Distance Sorting

Directly generating new data using a genetic algorithm can lead to excessive divergence in the generated data, potentially causing a loss of original features [36]. Given the large volume of traffic accident data, this study first employs mini batch K-means to cluster minority class sample data. This process groups data with similar features into clusters, which are then used for inter-cluster sorting of the sample data.

Additionally, traditional Mahalanobis distance, although addressing the issue of dimension measurement not resolved by Euclidean distance, neglects the varying influence of features on labels. Applying feature weighting can lead to more precise distance calculations between sample points. Here, it is assumed that the sample for calculating the distance originates from the same cluster, with a cluster center represented as C = (c₁, c₂, …, c_n), and the point for Mahalanobis distance calculation as X = (x₁, x₂, …, x_n). Following that, the feature importance diagonal matrix B is generated using the relief–F algorithm. The weighted Mahalanobis distance is then represented as follows:

D_{M} (X, C) = \sqrt{{(X - C)}^{T} B Σ^{- 1} B^{T} (X - C)}

(9)

Σ represents the covariance matrix, and the formula for calculating the covariance is as follows:

cov (X, Y) = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{n - 1}

(10)

3.2.3. Genetic Algorithm for Generating New Data

The genetic theory of chromosome inheritance posits that during the formation of chromosomes, each parent contributes an equal half of the total genes to their offspring [37]. The new offspring inherits 50% of their genes from each parent, making them similar to their parents while also possessing unique traits. In this study, sample data are treated as chromosomes, and new data are constructed from these samples. This approach ensures that the newly generated data retain common characteristics while also exhibiting the uniqueness of the original two datasets. Using genetic algorithms guarantees the diversity of generated data and effectively addresses the issues of marginalization of data distribution and data overlap that traditional oversampling methods encounter.

The formula for generating data using genetic algorithms, as represented in Equation (11), incorporates an influence factor β to preserve the uniqueness of samples. In this study, we set β to 0.5 to a greater extent:

x = β a + (1 - β) b

(11)

The letters a and b represent the parent samples used for generating data, while ‘x’ represents the newly generated sample. This means that, in this experiment, since β is equal to 0.5, both parent samples have the same effect on the newly generated sample.

3.2.4. ReMAHA

The ReMAHA algorithm, formed by combining the above-mentioned methods, can be summarized into three main steps:

Step 1: Utilize the relief–F feature selection algorithm on the entire dataset to assess the influence of features on accident severity and acquire a weight diagonal matrix;

Step 2: Perform clustering on the minority class samples in the dataset, ensuring that data with similar features are clustered together in the same clusters. Calculate the distance from each point to the cluster center within each cluster and rank them based on the Mahalanobis distance weighted using the weight diagonal matrix;

Step 3: Generate new data using genetic algorithms with β equal to 0.5 based on the sorted distances between data points, and then re-sort the newly generated data. To prevent excessive dispersion, iterate up to 5 times at most for each cluster.

The schematic of ReMAHA-generated data is shown in Figure 3 and the algorithm is shown in Algorithm 1.

Algorithm 1: ReMAHA

Input: Imbalanced dataset D, the number of clusters K.

Output: Oversampling dataset D_M

1: Split D into a minority class D⁺ and the largest majority class D^-

2: b = Relief–F_Feature_Selection (D)

3: B = diag(b)

4: Initialize dataset D_M

5: Calculate each cluster D⁺_i through Mini_Batch_Kmeans (D⁺)

6: FOR i in 1, 2, …, K do:

7: T = len (D_i⁺)/len (D⁺)

8: iter = 0

9: WHILE T > 0 and iter < 5:

10: D⁺_{sorted =} sorted_by_weightedMAHA (D⁺_k, B)

11: Initialize D⁺_temp = {}

12: iter++

13: FOR x₁, x₂ in D⁺_sorted:

14: x_new = Generate_by_GA (x₁, x₂)

15: IF T > 0:

16: D⁺_temp.append (x_new)

17: ELSE BREAK

18: T--

19: END FOR

20: D⁺_k.append (D⁺_temp)

21: END WHILE

22: D_M.append (D⁺_k)

23: END FOR

24: RETURN D_M

3.3. CatBoost

On the algorithm level, we have chosen to employ CatBoost, a variant of gradient-boosting algorithms [38,39]. Similar to other gradient-boosting algorithms, CatBoost aims to iteratively learn from errors by combining multiple weak learners.

In gradient-boost models, a commonly used node-splitting method is greedy target-based statics, which utilizes the average label value as the node-splitting criterion. However, when extreme values exist in the dataset or when there is a discrepancy between the data distribution of the training and test sets, using the mean value to split nodes can lead to conditional shift, resulting in decreased prediction accuracy. CatBoost mitigates conditional shift by introducing a prior distribution term, represented by p and a weight coefficient, represented by a. The formula is as follows:

{\hat{x}}_{k}^{i} = \frac{\sum_{j = 1}^{p - 1} [x_{σ_{j, k}} = x_{σ_{p, k}}] Y_{σ_{j}} + a \times p}{\sum_{j = 1}^{p - 1} [x_{σ_{j, k}} = x_{σ_{p, k}}] + a}

(12)

Furthermore, since gradient-boost models use the same dataset for training in each iteration, it may lead to prediction shift concerning the gradient and the true distribution [40]. CatBoost addresses this issue by employing a ranking boost approach, training a separate model M for each sample. Consequently, the current model M is not trained using the current sample. CatBoost’s handling of conditional and prediction shifts makes it more suitable for situations involving imbalanced data.

4. Experiments

When evaluating the performance of the proposed models, we addressed the following research questions:

(1) Does varying the number of clusters affect the generation of new data?

(2) How does the quality of the newly generated data using our proposed method compare with other sampling techniques?

(3) How does this method perform on traffic accident datasets compared to other baseline methods?

4.1. Experimental Setup

The experimental platform utilized in this study includes a system with 64 GB of RAM, an 11th Gen Intel (R) Core (TM) i7-11700K CPU clocked at 3.60 GHz, an NVIDIA GeForce RTX 3070 Ti GPU, all running on CentOS 7.9.

4.2. Evaluation Metrics

The confusion matrix is the most commonly used method for evaluating the performance of classification problems. The definition of a confusion matrix is shown in Table 8.

Based on the definition of the confusion matrix, various other data classification evaluation metrics have been derived.

Precision represents the proportion of correctly predicted samples for a specific class to the total number of samples predicted as that class:

Precision = \frac{T P}{T P + F P}

(13)

Recall represents the proportion of correctly predicted samples for a specific class to the total number of samples for that class:

R e c a l l = \frac{T P}{T P + F N}

(14)

Typically, precision and recall are trade-offs, and in many cases, it is necessary to consider both precision and recall simultaneously. The F_β score is used to provide a weighted harmonic mean of precision and recall [41]. Its calculation is as follows:

F_{β} = \frac{1}{C} \sum_{i = 1}^{C} \frac{(1 + β^{2}) \times P r e c i s i o n_{i} \times R e c a l l_{i}}{β^{2} \times P r e c i s i o n_{i} + R e c a l l_{i}}

(15)

The F1–Score is the most commonly used F_β score, where β is equal to 1. It balances precision and recall equally.

4.3. The Impact of Clustering on Generated Data

Considering that at the algorithm level, the proposed model in this paper involves the selection of the number of clusters. To demonstrate the effectiveness of clustering in traffic accident prediction, we conducted experiments to investigate the impact of varying cluster numbers on the results. Given that oversampling algorithms primarily affect the minority class in imbalanced datasets, this experiment particularly focused on observing the effect of cluster numbers on labels 1 and 4 in the dataset.

Clustering is one of the key steps of the model proposed in this paper, and Figure 4 depicts the relationship between the number of clusters and the F1–Score. In this experiment, experiments were conducted from the number of clusters being 1 to 9. Considering the problem of the relatively large amount of data, we choose mini batch K-means as the clustering method.

In Figure 4, the relationship between the number of clusters and the F1–Score variation is illustrated. When the number of clusters is set to 1 (i.e., the entire category forms a single cluster without clustering), the performance is relatively poor compared to cases where clustering is applied. Within a smaller range, the F1–Score improves as the number of clusters increases. However, when the number of clusters becomes excessive, it may lead to highly consistent features within clusters, potentially resulting in unstable quality in the generated data.

4.4. Quality Comparison of Data Generated by Sampling Algorithms

In the previous section, we assessed the impact of internal parameters on model performance. In this section, we compare ReMAHA with different sampling algorithms. At the data level, we compared three oversampling algorithms, namely, SMOTE [19], ADASYN [42], and random oversampling, with the proposed ReMAHA. The parameters used in the sampling models during the experiments are as follows (Table 9):

We used three oversampling algorithms, SMOTE, ADASYN, and random oversampler, and compared them with the proposed ReMAHA method. The predictive models in all cases were CatBoost. The experimental results are shown in Table 10.

Figure 5 shows that the addition of oversampling improves the predictive performance, with a significant increase in the number of predictions for minority class samples. This is mainly reflected in the improvement of recall, where higher recall indicates better identification of minority class samples. However, as recall increases, some precision values decrease. This is because with the increase in minority class samples predicted, the precision of predictions may decrease. As shown in Table 10, ReMAHA achieves a precision of 71.41 and 76.31, recall of 75.60 and 53.23, and an F1–Score of 73.44 and 62.71 for labels 1 and 4, respectively, with oversampling performance relatively superior to other models.

Figure 6a illustrates the ranking of feature importance in the original data. Figure 7 shows the effect of the oversampling algorithms ReMAHA, SMOTE, ADASYN, and ROS on the correlation of the data, respectively. It is noticeable that, compared to the original data, ReMAHA and ROS demonstrate the closest effects. However, ROS directly duplicates existing data, leading to a sensitive identification of certain classes during prediction, while yielding poor identification for other classes.

Figure 6b demonstrates the confusion matrix before sampling, while Figure 8 presents the confusion matrices for various oversampling algorithms. Manifested in the confusion matrices, SMOTE, ROS, and ADASYN all exhibited varying degrees of overfitting, primarily by misclassifying label 2 and label 3 as label 1 or label 4. This occurrence is primarily attributed to insufficient data quality in the generated samples. For instance, the process of generating new data might not account for correlations or result in excessive overlap among generated data, rendering correct categorization challenging.

4.5. Predictive Model Comparison Experiment

To evaluate the effectiveness of ReMAHA–CatBoost, we conducted experiments involving eight different models (Table 11). These eight models were divided into two groups for comparison: prediction model comparison experiments, and prediction + sampling (hybrid) model comparison experiments.

At the algorithm level, to ensure the accuracy of the experiments, we chose boosting algorithms, including AdaBoost [43], GBDT [44], XGBoost [45], LightGBM [46], and CatBoost.

To ensure fairness in the experiments, default parameters were used for all models.

First, we compared the prediction models without any sampling. US–Accidents Table 12 shows the results of the five algorithms on the unsampled U.S. accident dataset. It can be observed that AdaBoost performs significantly worse on the minority class label 1 compared to the other models. While it achieves a high recall for the minority class label 4, the low F1–Score indicates that AdaBoost identifies more instances of label 4 but with poor precision. This is because adaptive models focus more on misclassified samples in each iteration, and with the highly imbalanced dataset, the adaptive model tends to favor the majority class during prediction.

CatBoost, XGBoost, and LightGBM, as emerging variants of gradient-boosting algorithms, have similar performance. CatBoost achieves an F1–Score of 72.53 for class label 1 on this dataset, slightly higher than the other two. When considering both minority class labels, CatBoost performs better overall. CatBoost’s superior performance in identifying minority classes in this experiment is mainly attributed to its built-in symmetric trees used as split nodes, making it more suitable for imbalanced datasets.

Table 13 displays the performance of hybrid models. We combined ReMAHA with five predictive algorithms to assess the impact of sampling models on predictive models.

As shown in Figure 9, the experimental results show that all model predictions using ReMAHA oversampling mixing are better than the unmixed model. With the introduction of ReMAHA oversampling, the models have shown notable improvements in F1–Score, ranging from 0.3% to 9.32%, especially in the severity levels 1 and 4. While the addition of oversampling increases the number of predictions for minority class samples, it can lead to a decrease in precision if misclassified. However, ReMAHA has demonstrated improvements in F1–Score, precision, and recall for minority class categories 1 and 4. From the perspective of algorithms and sampling, the combined ReMAHA and CatBoost show the most stable predictive performance. Particularly, the F1–Score and recall for minority classes 1 and 4 are the highest, indicating that this hybrid model accurately identifies minority class samples.

To highlight the contrast before and after sampling, we computed the differences between the models using various prediction algorithms (CatBoost, AdaBoost, GBDT, XGBoost, and LightGBM) after employing the ReMAHA sampling and those used before, as depicted in Figure 10. Although after processing with the ReMAHA oversampling algorithm, the weights of minority class samples have increased, potentially leading to a decrease in precision while enhancing recall. Overall, upon incorporating ReMAHA oversampling, the majority of performance metrics for prediction models have significantly improved, especially for the label 4.

5. Discussion

ReMAHA–CatBoost’s comparisons at different levels confirm its excellent performance in oversampling and prediction. These experimental results indicate that ReMAHA–CatBoost is effective in recognizing different severity levels of traffic accidents. At the same time, we did not consume too much extra computer resources in the process of generating data, in the ReMAHA oversampling algorithm to generate new data, consumed 2594.3 MB of computer memory, less than the consumption of SMOTE, ROS, and ADASYN algorithms.

It is worth noting that the addition of oversampling algorithms can increase the number of predictions for minority class samples. However, with increased data generation, the risks also grow. If the generated data are too concentrated, they will not yield significant performance improvement, while overly dispersed generated data can blur the boundaries between different categories. ReMAHA, as proposed in this paper, balance the generated data well through the process of clustering and data generation.

Due to CatBoost’s improvements in handling prediction shift and conditional shift, it reduces overfitting. Combined with the characteristics of symmetric trees, it performs exceptionally well on datasets like traffic accidents, which are characterized by sparse features and extreme class imbalance. Recent research has already demonstrated the effectiveness of hybrid approaches combining sampling and prediction models [47]. Long [48] proposed a hybrid sampling model combined with the GNN model, validating the effectiveness of hybrid samplers on public datasets. Wang et al. [49] introduced the MatFind model, which used the K-nearest neighbors algorithm to balance the dataset and employed SVM for prediction, leading to improved performance in miRNA identification. These successful cases suggest that hybrid models combining sampling and prediction are feasible when dealing with imbalanced and large datasets.

However, based on existing work, it is clear that the problems caused by data imbalance in machine learning tasks are far from being completely resolved. Additionally, data imbalance is a common phenomenon in real-life datasets, and the current achievements are not enough to fully trust the models. Particularly, the dataset used in this study, the US–Accidents traffic accident dataset, has a class data difference of 91.40 times, such a massive difference leads to many oversampling algorithms and prediction models not achieving ideal results. Therefore, future work will continue to focus on improving the credibility of models and enhancing their generalization capabilities. Future work will primarily focus on the following three aspects: exploring correlations among data within the same category from multiple dimensions to generate more representative data; leveraging additional traffic accident data to enhance the model’s generalization ability; and further refining feature engineering to improve the model’s computational efficiency.

6. Conclusions

In this study, we proposed the ReMAHA–CatBoost model tailored for the imbalanced domain of traffic accident prediction. Our key design comprised three parts: utilizing the relief–F algorithm for feature selection and acquiring feature weight matrices in the feature selection step; integrating the feature weight matrices into the distance calculation process of sample points to enhance the reliability of oversampled data; and, finally, training the CatBoost model using the oversampled dataset.

These designs ensured the model’s capability to identify different features’ impact on accident severity while enhancing the recognition of minority class samples. Experimental results demonstrated the effectiveness of ReMAHA–CatBoost in predicting accident severity, especially its superior performance in identifying minority class samples compared to other oversampling algorithms (SMOTE, ADASYN, and ROS). ReMAHA–CatBoost enables us to accurately classify potential traffic accidents, assisting traffic management authorities in efficiently allocating limited resources for potential severe accidents. Moreover, the proposed model aids relevant authorities in taking necessary preventive measures to mitigate the economic and healthcare burdens caused by traffic accidents.

7. Patents

Patent application number: CN202211527542.5.

Author Contributions

Conceptualization, G.L. and Y.W.; methodology, G.L.; validation, G.L., Y.W., and Y.B.; formal analysis, W.Z.; investigation, G.L.; data curation, Y.B.; writing—original draft preparation, G.L.; writing—review and editing, W.Z.; visualization, Y.B.; supervision, Y.W.; project administration, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Sichuan University of Science and Engineering Graduate Student Innovation Fund (y2022179), the Sichuan Provincial Science and Technology Department Project (2023YFG0307) and the Sichuan Province Intelligent Tourism Research Base Project (ZHYJ22-02).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study used the US–Accidents dataset, which is publicly available datasets. They can be found at US Accidents (2016–2023) (https://www.kaggle.com/) (accessed on 31 October 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

World Health Organization. Road Traffic Injuries. Available online: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries (accessed on 31 October 2023).
World Health Organizat. Seize the Moment to Tackle Road Crash Deaths and Build a Safe and Sustainable Future. Available online: https://www.who.int/news/item/25-06-2023-seize-the-moment-to-tackle-road-crash-deaths-and-build-a-safe-and-sustainable-future (accessed on 31 October 2023).
Swathi, B.; Tiwari, H. Integrated Pairwise Testing based Genetic Algorithm for Test Optimization. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 144–150. [Google Scholar] [CrossRef]
Zheng, J.; Wang, J.; Lai, Z.; Wang, C.; Zhang, H. A deep spatiotemporal network for forecasting the risk of traffic accidents in low-risk regions. Neural Comput. Appl. 2023, 35, 5207–5220. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 6639–6649. [Google Scholar]
Yang, Y.; Yuan, Z.Z.; Meng, R. Exploring Traffic Crash Occurrence Mechanism toward Cross-Area Freeways via an Improved Data Mining Approach. J. Transp. Eng. Part A-Syst. 2022, 148, 04022052. [Google Scholar] [CrossRef]
Zhou, Z.; Dong, X.; Li, Z.; Yu, K.; Ding, C.; Yang, Y. Spatio-Temporal Feature Encoding for Traffic Accident Detection in VANET Environment. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19772–19781. [Google Scholar] [CrossRef]
Guru, J.; Devi, N. Road Traffic Accidents Analysis Using Data Mining Techniques. JITA-J. Inf. Technol. Appl.-APEIRON 2018, 14. [Google Scholar] [CrossRef]
Santos, K.; Dias, J.P.; Amado, C. A literature review of machine learning algorithms for crash injury severity prediction. J. Saf. Res. 2022, 80, 254–269. [Google Scholar] [CrossRef] [PubMed]
Yaacob, N.F.F.; Rusli, N.; Bohari, S.N. Relationship of Environmental Factors Toward Accident Cases using GIS Application in Kedah. In Proceedings of the 2019 IEEE 15th International Colloquium on Signal Processing & Its Applications (CSPA), Penang, Malaysia, 8–9 March 2019; pp. 137–141. [Google Scholar]
Li, L.; Shrestha, S.; Hu, G. Analysis of road traffic fatal accidents using data mining techniques. In Proceedings of the 2017 IEEE 15th International Conference on Software Engineering Research, Management and Applications (SERA), London, UK, 7–9 June 2017; pp. 363–370. [Google Scholar]
Wang, J.; Ma, S.; Jiao, P.; Ji, L.; Sun, X.; Lu, H. Analyzing the Risk Factors of Traffic Accident Severity Using a Combination of Random Forest and Association Rules. Appl. Sci. 2023, 13, 8559. [Google Scholar] [CrossRef]
Ning, J.; She, H.; Zhao, D.; Luo, D.; Wang, L. A Road-Level Traffic Accident Risk Prediction Method. Beijing Youdian Daxue Xuebao/J. Beijing Univ. Posts Telecommun. 2022, 45, 72–78. [Google Scholar] [CrossRef]
Li, A.; Han, M.; Mu, D.; Gao, Z.; Liu, S. Survey of multi-class imbalanced data classification methods. Appl. Res. Comput. 2022, 39, 3534–3545. [Google Scholar] [CrossRef]
Li, G.; Li, M.; Zheng, Q.; Qin, W.; Ren, X.; Liu, Y. Survey on imbalanced multi-class classification algorithms. J. Comput. Appl. 2022, 42, 3307–3321. [Google Scholar] [CrossRef]
Dai, Q.; Liu, J.-w.; Liu, Y. Multi-granularity relabeled under-sampling algorithm for imbalanced data. Appl. Soft Comput. 2022, 124, 109083. [Google Scholar] [CrossRef]
Wei, W.; Jiang, F.; Yu, X.; Du, J. An Under-sampling Algorithm Based on Weighted Complexity and Its Application in Software Defect Prediction. In Proceedings of the 2022 5th International Conference on Software Engineering and Information Management, Yokohama, Japan, 21–23 January 2022; pp. 38–44. [Google Scholar]
Zhu, T.; Liu, X.; Zhu, E. Oversampling With Reliably Expanding Minority Class Regions for Imbalanced Data Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 6167–6181. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Int. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Gao, Y.; Liu, Q. An Over Sampling Method of Unbalanced Data Based on Ant Colony Clustering. IEEE Access 2021, 9, 130990–130996. [Google Scholar] [CrossRef]
Bennin, K.E.; Keung, J.; Phannachitta, P.; Monden, A.; Mensah, S. [Journal First] MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), Gothenburg, Sweden, 27 May–3 June 2018; p. 699. [Google Scholar]
Wang, Q.; Lee, K.J.; Hong, J. DOSS: Dual Over Sampling Strategy for Imbalanced Data Classification. In Proceedings of the IECON 2018—44th Annual Conference of the IEEE Industrial Electronics Society, Washington, DC, USA, 21–23 October 2018; pp. 5389–5394. [Google Scholar]
Lin, C.; Tsai, C.-F.; Lin, W.-C. Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: An experimental study. Artif. Intell. Rev. 2023, 56, 845–863. [Google Scholar] [CrossRef]
Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
Bhaskar, N.; Bairagi, V.; Munot, M.V.; Gaikwad, K.M.; Jadhav, S.T. Automated COVID-19 Detection From Exhaled Human Breath Using CNN-CatBoost Ensemble Model. IEEE Sens. Lett. 2023, 7, 1–4. [Google Scholar] [CrossRef]
Yan, Y.T.; Zhu, Y.W.; Liu, R.Q.; Zhang, Y.W.; Zhang, Y.P.; Zhang, L. Spatial Distribution-Based Imbalanced Undersampling. IEEE Trans. Knowl. Data Eng. 2023, 35, 6376–6391. [Google Scholar] [CrossRef]
Wu, Q.; Lin, Y.; Zhu, T.; Wei, J. HUSBoost: A Hubness-Aware Boosting for High-Dimensional Imbalanced Data Classification. In Proceedings of the 2019 International Conference on Machine Learning and Data Engineering (iCMLDE), 2–4 December 2019; pp. 36–41. [Google Scholar]
Moral-García, S.; Abellán, J.; Coolen-Maturi, T.; Coolen, F.P.A. A cost-sensitive Imprecise Credal Decision Tree based on Nonparametric Predictive Inference. Appl. Soft Comput. 2022, 123, 108916. [Google Scholar] [CrossRef]
Chen, Y. Research on Cost-sensitive Classification Methods for Imbalanced Data. In Proceedings of the 2021 International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA), Xi’an, China, 28–30 May 2021; pp. 224–228. [Google Scholar]
Ahmad, N.; Ahmed, A.; Wali, B.; Saeed, T.U. Exploring factors associated with crash severity on motorways in Pakistan. Proc. Inst. Civ. Eng.-Transp. 2022, 175, 189–198. [Google Scholar] [CrossRef]
Moosavi, S.; Samavatian, M.H.; Parthasarathy, S.; Teodorescu, R.; Ramnath, R. Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, New York, NY, USA, 1–4 November 2019. [Google Scholar]
Moosavi, S.; Samavatian, M.H.; Parthasarathy, S.; Ramnath, R. A Countrywide Traffic Accident Dataset. arXiv 2019, arXiv:1906.05409. [Google Scholar]
Huang, W.T.K.; Masselot, P.; Bou-Zeid, E.; Fatichi, S.; Paschalis, A.; Sun, T.; Gasparrini, A.; Manoli, G. Economic valuation of temperature-related mortality attributed to urban heat islands in European cities. Nat. Commun. 2023, 14, 7438. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Cheng, X.; Liu, A.; Chen, Q.; Wang, C. Tracking lake drainage events and drained lake basin vegetation dynamics across the Arctic. Nat. Commun. 2023, 14, 7359. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Han, J.; Zhang, T. A Relief-PGS algorithm for feature selection and data classification. Intell. Data Anal. 2023, 27, 399–415. [Google Scholar] [CrossRef]
Zhang, Y.; Zuo, T.; Fang, L.; Li, J.; Xing, Z. An Improved MAHAKIL Oversampling Method for Imbalanced Dataset Classification. IEEE Access 2021, 9, 16030–16040. [Google Scholar] [CrossRef]
Abdel-Basset, M.; Mohamed, R.; Abouhawwash, M.; Chakrabortty, R.K.; Ryan, M.J. A Simple and Effective Approach for Tackling the Permutation Flow Shop Scheduling Problem. Mathematics 2021, 9, 270. [Google Scholar] [CrossRef]
Krishnan, S.; Aruna, S.K.; Kanagarathinam, K.; Venugopal, E. Identification of Dry Bean Varieties Based on Multiple Attributes Using CatBoost Machine Learning Algorithm. Sci. Program. 2023, 2023, 2556066. [Google Scholar] [CrossRef]
Zhou, F.; Pan, H.; Gao, Z.; Huang, X.; Qian, G.; Zhu, Y.; Xiao, F. Fire Prediction Based on CatBoost Algorithm. Math. Probl. Eng. 2021, 2021, 1929137. [Google Scholar] [CrossRef]
Zhang, S.; Liu, H.; Yang, Y.; Zhang, S.; Zhang, Z.; Wang, C.; Wang, M. Prediction of traffic accident impact range based on CatBoost ensemble algorithm. In SPIE Proceedings; SPIE: Bellingham, DC, USA, 2023; p. 1263503. [Google Scholar]
Chen, Y.-W.; Lin, C.-J. Combining SVMs with Various Feature Selection Strategies. In Feature Extraction: Foundations and Applications; Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 315–324. [Google Scholar]
Haibo, H.; Yang, B.; Garcia, E.A.; Shutao, L. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Schapire, R.E. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence—Volume 2, Stockholm, Sweden, 31 July–6 August 1999; pp. 1401–1406. [Google Scholar]
Ye, J.; Chow, J.-H.; Chen, J.; Zheng, Z. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, China, 2–6 November 2009; pp. 2061–2064. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
Rao, R.S.; Dewangan, S.; Mishra, A.; Gupta, M. A study of dealing class imbalance problem with machine learning methods for code smell severity detection using PCA-based feature selection technique. Sci. Rep. 2023, 13, 16245. [Google Scholar] [CrossRef]
Long, J.; Fang, F.; Luo, C.; Wei, Y.; Weng, T.-H. MS_HGNN: A hybrid online fraud detection model to alleviate graph-based data imbalance. Connect. Sci. 2023, 35, 2191893. [Google Scholar] [CrossRef]
Wang, Y.; Li, X.; Tao, B. Improving classification of mature microRNA by solving class imbalance problem. Sci. Rep. 2016, 6, 25941. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Methodology process diagram.

Figure 2. Flowchart of the ReMAHA algorithm.

Figure 3. ReMAHA schematic diagram.

Figure 4. The impact of cluster variations on F1–Score graph.

Figure 5. Predictive performance of different oversampling models.

Figure 6. Performance of the original data: (a) feature correlation; and (b) confusion matrix.

Figure 7. Feature correlation: (a) ReMAHA; (b) SMOTE; (c) ADASYN; and (d) ROS.

Figure 8. Confusion matrix: (a) ReMAHA; (b) SMOTE; (c) ADASYN; and (d) ROS.

Figure 9. Performance on different models: (a) performance on the prediction model; and (b) performance on the hybrid model.

Figure 10. Comparison of effects before and after sampling.

Table 1. Feature description.

Feature Type	Number of Features	Main Content
Numeric	15	Latitude and longitude, severity, and others
Boolean	13	Nearby points of interest and others
Text	33	Accident description, street, and others

Table 2. Label description.

Accident Severity	Number of Classes	Percentage (%)
1	67,366	0.87
2	6,156,981	79.67
3	1,299,337	16.81
4	204,710	2.65

Table 3. Distribution of missing values (%).

Feature	Missing	Feature	Missing	Feature	Missing
End_Lat	44.03	Weather_Timestamp	1.56	Precipitation (in)	28.51
End_Lng	44.03	Temperature (F)	2.12	Weather_Condition	2.24
Description	0.00	Wind_Chill (F)	25.87	Sunrise_Sunset	0.30
Street	0.14	Humidity (%)	2.25	Civil_Twilight	0.30
City	0.00	Pressure (in)	1.82	Nautical_Twilight	0.30
Zipcode	0.03	Visibility (mi)	2.29	Astronomical_Twilight	0.30
Timezone	0.10	Wind_Direction	2.27
Airport_Code	0.29	Wind_Speed (mph)	7.39

Table 4. Example of Duplicate Data Processing.

	Start_Time	Start_Lat	Start_Lng
X₁	23 August 2019 19:00:21	33.775450	−117.847790
X₂	23 August 2019 19:15:40	33.992460	−118.403020
X₃	23 August 2019 19:09:30	32.766960	−117.148060
X₄	23 August 2019 19:02:37	33.773290	−117.848001

Table 5. Quartile Range Description.

Feature	Q1	Q3	Upper	Lower	Max	Min
Temperature (F)	49.5	76.0	115.75	9.75	203.0	−89.015
Humidity (%)	48.0	84.0	138.0	−6.0	100.0	1.0
Pressure (in)	29.35	30.03	31.05	28.33	58.63	0.0
Wind_Speed (mph)	4.6	10.4	19.1	−4.10	1087.0	0.033

Table 6. Example of Text Features.

	Street	Wind_Direction	Weather_Condition
Ex.1	I-580 W	Calm	Light rain
Ex.2	N Main St	WSW	Overcast
Ex.3	Bayview Ave	S	Mostly cloudy

Table 7. Example of Processed Textual Features.

	St	Ave	WindS	WindW	Cloud	Rain
Ex.1	0	0	0	0	0	1
Ex.2	1	0	1	1	1	0
Ex.3	0	1	1	0	1	0

Table 8. Confusion matrix.

	Predicate Positive	Predicate Negative
True Positive	TP	FN
True Negative	FP	TN

Table 9. Sampling Parameter Settings.

Models	Settings
SMOTE	sampling_strategy = ‘auto’, k_neighbors = 5
ReMAHA	k = 4
ADASYN	sampling_strategy = ‘auto’, n_neighbors = 5
ROS	sampling_strategy = ‘auto’

Table 10. Performance of each sampling model (%).

Class	Metric	SMOTE	ADASYN	ROS	ReMAHA
Severity 1	Precision	52.91	43.05	29.21	71.41
	Recall	85.12	81.96	97.46	75.60
	F1–Score	65.26	56.45	44.95	73.44
Severity 2	Precision	93.26	91.65	90.81	91.51
	Recall	93.82	91.34	93.54	94.57
	F1–Score	93.54	91.49	92.15	93.02
Severity 3	Precision	78.01	69.02	73.93	73.43
	Recall	69.96	61.52	57.90	64.63
	F1–Score	73.76	65.06	64.94	68.75
Severity 4	Precision	40.28	33.47	73.05	76.31
	Recall	53.90	53.46	33.56	53.23
	F1–Score	46.11	41.17	45.99	62.71

Table 11. Predictive Model Parameter Settings.

Models	Settings
CatBoost	iterations = 500, learning_rate = 0.03, depth = 6
AdaBoost	n_estimators = 50, learning_rate = 1.0
GBDT	learning_rate = 0.1, n_estimators = 100, max_depth = 3
XGBoost	learning_rate = 0.3, max_depth = 3
LightGBM	num_leaves = 31, learning_rate = 0.1, max_depth = −1, n_estimators = 100

Table 12. Performance of each prediction model (%).

Class	Metric	CatBoost	AdaBoost	GBDT	XGBoost	LightGBM
Severity 1	Precision	71.82	53.71	64.99	72.70	69.25
	Recall	73.27	13.09	44.52	65.37	61.55
	F1–Score	72.53	21.05	52.84	68.84	65.17
Severity 2	Precision	91.02	84.04	86.52	90.63	89.64
	Recall	94.96	95.34	95.78	94.91	94.93
	F1–Score	92.95	89.33	90.91	92.73	92.21
Severity 3	Precision	73.92	58.32	68.94	72.70	71.42
	Recall	62.82	29.19	41.91	61.27	56.58
	F1–Score	67.92	38.90	52.13	66.50	63.14
Severity 4	Precision	73.97	77.58	80.56	77.17	75.63
	Recall	44.35	37.46	31.12	44.08	41.89
	F1–Score	55.46	50.53	44.89	56.11	53.91

Table 13. Performance of each hybrid model (%).

Class	Metric	CatBoost	AdaBoost	GBDT	XGBoost	LightGBM
Severity 1	Precision	71.41	53.96	65.26	74.40	69.65
	Recall	75.60	13.65	44.82	69.02	61.19
	F1–Score	73.44	21.78	53.15	71.61	65.15
Severity 2	Precision	91.51	84.55	87.06	91.66	89.69
	Recall	94.57	95.38	95.53	94.88	95.07
	F1–Score	93.02	89.64	91.10	93.24	92.30
Severity 3	Precision	73.43	60.42	68.92	74.29	71.92
	Recall	64.63	31.49	44.23	66.55	55.96
	F1–Score	68.75	41.40	53.88	70.20	62.94
Severity 4	Precision	76.31	80.68	84.32	78.17	78.93
	Recall	53.23	44.31	39.95	44.64	51.17
	F1–Score	62.71	57.20	54.21	56.83	62.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, G.; Wu, Y.; Bai, Y.; Zhang, W. ReMAHA–CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks. Appl. Sci. 2023, 13, 13123. https://doi.org/10.3390/app132413123

AMA Style

Li G, Wu Y, Bai Y, Zhang W. ReMAHA–CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks. Applied Sciences. 2023; 13(24):13123. https://doi.org/10.3390/app132413123

Chicago/Turabian Style

Li, Guolian, Yadong Wu, Yulong Bai, and Weihan Zhang. 2023. "ReMAHA–CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks" Applied Sciences 13, no. 24: 13123. https://doi.org/10.3390/app132413123

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ReMAHA–CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Dataset

3.1.1. Missing Value Handling

3.1.2. Duplicate Value Handling

3.1.3. Handling Outliers

3.1.4. Feature Engineering

3.2. The ReMAHA Algorithm

3.2.1. Feature Weight Calculation

3.2.2. Inter-Cluster Weighted Mahalanobis Distance Sorting

3.2.3. Genetic Algorithm for Generating New Data

3.2.4. ReMAHA

3.3. CatBoost

4. Experiments

4.1. Experimental Setup

4.2. Evaluation Metrics

4.3. The Impact of Clustering on Generated Data

4.4. Quality Comparison of Data Generated by Sampling Algorithms

4.5. Predictive Model Comparison Experiment

5. Discussion

6. Conclusions

7. Patents

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI