1. Introduction
Severe weather, such as high winds, hurricanes, thunderstorms, and tornadoes, frequently induces tree failures in proximity to power lines, resulting in power outages and damage to utility infrastructure. These weather-related power outages cost the United States of America (USA)
$25–75 billion annually [
1]. This cost will continue to increase with the increasing frequency and intensity of storm activities due to climate change. During severe weather, tree failures are responsible for most power outages [
2]. Tree failures not only cause catastrophic damage to power lines but also pose risks to public safety and infrastructure. The costs of restoring power lines and the anticipated increase in storm activity due to climate change have driven utilities to enhance grid resilience by identifying vegetation risks and reinforcing infrastructure.
Accurate modeling of tree-related power outage risks along distribution power lines is essential for effectively implementing grid resiliency programs ahead of storm events. Numerous scientific investigations have focused on predicting power outages due to vegetation and identifying the underlying factors contributing to these outages in power distribution systems. For instance, Guikema et al. [
3] introduced a statistical framework to predict tree-related outages under normal operational conditions. Their investigation specifically focused on the impact of tree-trimming practices on the incidence of vegetation-related outages, utilizing a dataset comprising historical outages, geographical data, and tree-trimming records. In a separate study, Radmer et al. [
4] proposed a methodology to predict the rate of tree-related outages due to annual vegetation growth, measured as the number of outages per mile-year. Key inputs to their models included historical outage data and climatic variables known to influence vegetation growth. Wanik et al. [
5] conducted a study to assess the effects of various factors on predicting vegetation-related outages during hurricane events. By leveraging LiDAR tree height data alongside information on vegetation management practices and system infrastructure, they developed an ensemble machine learning algorithm to predict the likelihood of vegetation-related outages during hurricane events. Doostan et al. [
6] proposed a data-driven methodology to predict the number of vegetation-related outages in power distribution systems using time series and nonlinear machine learning regression models.
Over the past two decades, much research has explored the applicability of various parametric and non-parametric models to power outage modeling problems. Initially, several parametric statistical models, such as the negative binomial regression model and Generalized Linear Models (GLMs), were used to predict the number of hurricane-related [
4,
7] and ice-storm-related outages [
7]. Guikema et al. [
8] compared multiple models (parametric, GLMs, and semi/non-parametric models, Generalized Additive Models (GAMs), Bayesian Additive Regression Trees (BART), and classification and regression trees (CART)) to predict post-hurricane damage to the electrical overhead distribution network (i.e., utility poles). They observed higher accuracy rates with semi- and non-parametric models than parametric ones. Nateghi et al. [
9] modeled the outage duration using both parametric (regression methods) and non-parametric models (BART, multivariate adaptive regression splines (MARS), and CART) and demonstrated the applicability of the BART model with its predictive accuracy and lower prediction error compared to other methods.
Recently, many studies have adopted machine learning algorithms for power outage-related research and demonstrated the benefits of using machine learning (ML) models to predict power outages [
5,
10,
11,
12]. ML models have become more popular due to numerous advantages, including flexibility, adaptability, and the ability to analyze diverse data types [
13]. Additionally, they are particularly effective in handling large volumes of data at high speeds, can continuously improve with more data, and can make predictions without explicit programming, making them highly efficient [
14]. Furthermore, they can solve complex real-world problems and provide automatic problem-solving approaches. Non-parametric machine learning algorithms have lately attracted significant attention in utility infrastructure risk modeling. Konstantakopoulos et al. [
15] used non-parametric methods such as bootstrapping, bagging, and gradient boosting to improve the prediction performance in utility learning frameworks. Imam et al. [
16] reviewed the application of parametric and non-parametric machine learning techniques to power system reliability, highlighting the predictive capabilities of non-parametric algorithms in maintenance-related aspects. Ajayi et al. [
17] further emphasized the importance of non-parametric methods in predicting health and safety hazards in power infrastructure operations, achieving near-perfect predictions.
When examining the literature, it becomes apparent that most studies on vegetation-related power outage modeling have been conducted at coarser resolutions, with limited research available that delves into the probability of tree-related outage risks on distribution power lines at a finer spatial granularity. Furthermore, the majority of these assessments have depended on the random forest (RF) algorithm [
18,
19]. To the best of our knowledge, no previous studies have been undertaken to predict tree-related outage risks at a finer spatial granularity using a wide range of non-parametric machine learning algorithms. Utilizing non-parametric machine learning modeling is crucial in this context, as it allows for the exploration of complex, nonlinear relationships within the data, which are often inherent in vegetation-related outage risk factors, thus enabling more accurate predictions at a finer spatial scale. Accurate identification and localization of vegetation-risk-prone areas are essential for improving grid reliability by aiding utility professionals in making informed decisions to implement appropriate tree-trimming and grid-hardening practices.
Therefore, the central objective of our study is to systematically evaluate the effectiveness of decision tree (DT), support vector machine (SVM), extreme gradient boosting (XGBoost), random forest (RF), and k-Nearest Neighbor (k-NN) algorithms in identifying the risk of tree-related power outages to distribution power lines at a finer spatial resolution. This work is an extension of our previous study [
19], in which we developed a vegetation risk model to assess the impact of local environmental variables on the outage probability along distribution power lines using the RF algorithm.
4. Discussion
In outage risk modeling, assessing the performance of different machine learning models is crucial to determining the model with the best predictive power. Many recent outage modeling studies have utilized various machine learning techniques to predict the number of outages or estimate the probability of outage risk. These methods include logistic regression, classification and regression trees, decision trees, multivariate adaptive regression splines, artificial neural networks, naïve Bayes regression, random forests, boosting, and an ensemble model of boosting and random forests [
5,
11,
18,
46]. Non-parametric machine learning algorithms have recently gained more attention than parametric models in outage modeling problems [
8] due to their ability to capture complex data relationships and make fewer assumptions about the data distribution [
47].
There is a lack of literature demonstrating how support vector machines, k-Nearest Neighbor (k-NN), and extreme gradient boosting (XGBoost) can provide inferences on estimating the probability of tree failure in addition to decision trees (DTs) and random forests (RFs). Despite differences in the machine learning algorithms used, we can compare our findings with previous studies on the applicability of different machine learning techniques for predicting the probability of tree-related outage risk during storms. The quality of the prediction is crucial, especially when dealing with issues related to electric power lines. Confidence intervals (CIs) have been utilized in many machine learning works to quantify the reliability or uncertainty of machine learning interpretations [
48,
49].
It is important to discuss how five different ML models performed since the performance of some of the ML models is quite similar for some performance metrics. Also, even a 0.1% improvement is considered significant given the scale of the study, encompassing approximately 49,000 device exposure zones (DEZs), and the substantial impact that certain DEZ locations have on the economic and power security aspects of the electric grid. Based on the model performance evaluation and ranking scheme, the random forest (RF) algorithm emerged as the best ML algorithm according to the metrics of the AUC-ROC, accuracy, precision, and the F1-score in assessing the tree-related outage probability on distribution power lines, implying superior performance compared to the other algorithms. The RF and the SVM showed similar results for the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) metric. This implies that both models have similar potential to measure the percentage of correct predictions and the ability to separate true classes from negative classes. k-NN and the SVM outperformed the RF when assessing the recall (sensitivity). Recall and precision have important implications for the operational value of machine learning in risk assessment. A high recall score shows that the model can detect the majority of DEZs with outages while limiting the number of DEZs falsely predicted as having no outages (false negatives). Such false predictions may fail to identify some sensitive DEZs in the power grid and lead to serious consequences when trees fall onto power lines during severe weather events. In contrast, the RF and XGBoost models reported higher precision values compared to the other models, indicating that the SVM, k-NN, and the DT tend to falsely detect outages in DEZs (false positives) more frequently than the RF and XGBoost. This is important because utilities do not have to invest capital in DEZs that do not pose a risk to power lines. According to the ranking scheme based on the performance metrics and confidence interval values, the random forest (RF) was identified as the best machine learning model for assessing the probability of tree failure. Numerous previous studies have demonstrated the effectiveness of RFs in assessing the likelihood of tree failure [
46] and predicting the number of power outages during storm events [
50], highlighting their applicability in vegetation risk assessment. The DT was selected as the least performant model based on the AUC-ROC, accuracy values, and ranking scheme. DTs can perform poorly due to a variety of factors, such as the limitations of the pruning algorithms, and the algorithm can be seriously affected by the curse of dimensionality [
51,
52]. In tree-related outage risk assessment, it is crucial to have a model that can accurately identify risk areas because misclassifications can result in missed opportunities for intervention to improve resilience, which could lead to higher outage risks in future storm conditions.
Recording the computational runtimes of different machine learning (ML) models is important in data science. Computational runtime provides insights into algorithm optimization opportunities and resource planning. XGBoost and the RF consumed a comparatively higher runtime (in seconds) for hyperparameter optimization when compared to the other three models. The most probable reason is a large parameter space: XGBoost and RFs have a large parameter space for hyperparameter tuning. The computational runtime of hyperparameter optimization and model development may vary from several minutes to days depending on the scale of the data, the available computational resources, and the model complexity [
53]. Moreover, the number of hyperparameters considered makes it time-consuming to search through all possible combinations of hyperparameters to find the optimal set. Since XGBoost and RFs are computationally intense algorithms, they can make hyperparameter optimization more time-consuming since the training process needs to be repeated multiple times to tune the hyperparameters. Additionally, certain types of hyperparameters and their values have a prompt effect on the execution time, such as the number of trees in RFs and XGBoost and the number of neighbors in k-NN [
54]. The decision tree (DT) algorithm is comparatively simple, consisting of only one decision tree [
22] and requiring fewer parameters. Therefore, it consumes less time compared to other complex algorithms.
This study offers significant insights into vegetation risk assessment across Connecticut, utilizing a different dataset and employing various non-parametric machine learning models. The findings are particularly relevant for utility companies and arborists in regions with similar environmental conditions and vegetation dynamics as those in the northeastern United States. While this study was initially designed for Connecticut, the methodologies and principles can be adapted and extended to other regions. But it is important to recognize potential limitations, such as differences in vegetation types in different regions, which might need to be adjusted for the best results.
5. Conclusions
This study sheds new light on the effectiveness of non-parametric machine learning algorithms in localizing storm-induced tree-related outage risk at the device exposure zone level, where faults are detected and handled by utilities. A total of 15 predictor variables were analyzed using five non-parametric machine learning algorithms and evaluated based on their model performance and confidence interval values. The RF emerged as the best model according to the performance metrics of accuracy, the AUC-ROC, precision, and the F1-score. Both the RF and the SVM showed superior performance according to the AUC-ROC metric when identifying DEZs with tree-related outage risk. The RF and XGBoost demonstrated higher precision values, highlighting weaknesses in other models’ ability to capture falsely identified outage presence areas (false positives) compared to the RF and XGBoost. Conversely, the SVM and k-NN reported higher recall values, indicating their ability to identify outage presence areas while minimizing falsely identified outage absence areas (false negatives). When a model captures a higher number of false positives, utilities have to spend extra capital and labor on resiliency programs. In contrast, utilities will be misguided when the risk model tends to detect more false negatives, causing risk to both humans and infrastructure. Accurate modeling of tree-related outage probability enables efficient resource allocation, prevents damage to the grid infrastructure, and lowers the cost of vegetation management.
While this study has discussed desirable outcomes, we believe it is important to acknowledge and address the limitations encountered. These limitations provide valuable insights for future research. Tree health is a critical determinant of tree failures during adverse weather events. Additionally, resistance to storm conditions largely varies across different species and the physical structure of the trees. Therefore, further development of vegetation risk models requires information on tree health and tree species. Also, further research is needed to address model uncertainty and to optimize the performance by employing different modeling techniques, including ensemble machine learning approaches.