Next Article in Journal
The Spatial Analysis of the Role of Green Finance in Carbon Emission Reduction
Previous Article in Journal
Dynamic Asset Pricing in a Unified Bachelier–Black–Scholes–Merton Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Claim Prediction and Premium Pricing for Telematics Auto Insurance Data Using Poisson Regression with Lasso Regularisation

1
School of Mathematics and Statistics, The University of Sydney, Camperdown, NSW 2050, Australia
2
Department of Statistics, University of Haifa, Haifa 3103301, Israel
3
Transdisciplinary School, University of Technology Sydney, Ultimo, NSW 2007, Australia
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Risks 2024, 12(9), 137; https://doi.org/10.3390/risks12090137
Submission received: 23 July 2024 / Revised: 15 August 2024 / Accepted: 22 August 2024 / Published: 28 August 2024

Abstract

:
We leverage telematics data on driving behavior variables to assess driver risk and predict future insurance claims in a case study utilising a representative telematics sample. In the study, we aim to categorise drivers according to their driving habits and establish premiums that accurately reflect their driving risk. To accomplish our goal, we employ the two-stage Poisson model, the Poisson mixture model, and the Zero-Inflated Poisson model to analyse the telematics data. These models are further enhanced by incorporating regularisation techniques such as lasso, adaptive lasso, elastic net, and adaptive elastic net. Our empirical findings demonstrate that the Poisson mixture model with the adaptive lasso regularisation outperforms other models. Based on predicted claim frequencies and drivers’ risk groups, we introduce a novel usage-based experience rating premium pricing method. This method enables more frequent premium updates based on recent driving behaviour, providing instant rewards and incentivising responsible driving practices. Consequently, it helps to alleviate cross-subsidization among risky drivers and improves the accuracy of loss reserving for auto insurance companies.

1. Introduction

Traditional auto insurance premiums have been based on driver-related risk (demographic) factors such as age, gender, marital status, claim history, credit risk and living district, and vehicle-related risk factors such as vehicle year/make/model, which represent the residual value of an insured vehicle. Although these traditional variables or factors indicate claim frequency and size, they do not reflect true driving risk and often lead to cross-subsidising higher-risk drivers by lower-risk drivers to balance the claim cost. These premiums have been criticised for being inefficient and socially unfair because they do not punish aggressive driving nor encourage prudent driving. Chassagnon and Chiappori (1997) reported that the accident risk depends not only on demographic variables but also on driver behaviour that reflects how drivers drive cautiously to reduce accident risk.
Usage-based insurance (UBI) relies on telematic data, often augmented by global positioning systems (GPSs), to gather vehicle information. UBI encompasses two primary models: Pay As You Drive (PAYD) and Pay How You Drive (PHYD). PAYD operates on a drive-less-pay-less principle, taking into account driving habits and travel details such as route choices, travel time, and mileage. This model represents a significant advancement over traditional auto insurance approaches. For instance, Ayuso et al. (2019) utilised a Poisson regression model to analyse a combination of seven traditional and six travel-related variables. However, Kantor and Stárek (2014) highlighted limitations in PAYD policies, notably their sole focus on kilometres driven, neglecting crucial driver behaviour aspects.
By integrating a telematics device into the vehicle, PHYD extends the principles of PAYD to encompass the monitoring of driving behaviour profiles over a specified policy period. Driving behaviour, encompassing operational choices such as speeding, harsh braking, hard acceleration, or sharp cornering in varying road types, traffic conditions, and weather, serves as a defining aspect of drivers’ styles (Tselentis et al. 2016; Winlaw et al. 2019). These collected driving data offer valuable insights into assessing true driving risks, enabling the calculation of the subsequent UBI experience rating premium. This advancement over traditional premiums incorporates both historical claim experiences and current driving risks. The UBI premium can undergo regular updates to provide feedback to drivers, incentivising improvements in driving skills through premium reductions. Research by Soleymanian et al. (2019) indicated that individuals drive less and safer when incentivised by UBI premiums. Moreover, Bolderdijk et al. (2011) demonstrated that monitoring driving behaviours can effectively reduce speeding and accidents by promoting drivers’ awareness and behavioural changes. Wouters and Bos (2000) showed that monitoring of driving resulted in a 20% reduction in accidents. The monitoring system enables early intervention for risky drivers, potentially saving lives (Hurley et al. 2015). Finally, Ellison et al. (2015) concluded that personalised feedback coupled with financial incentives yields the most significant changes in driving behaviour, emphasising the importance of a multifaceted approach to risk reduction.
The popularity of PHYD policies has surged in recent years, driven by the promise of lower premiums for safe driving behaviour. QBE Australia (Q for Queensland Insurance, B for Bankers’ and Traders’ and E for The Equitable Probate and General Insurance Company), renowned for its innovative approaches, introduced a product called the Insurance Box, a PHYD policy featuring in-vehicle telematics. This product not only offers lower premiums to good drivers but also delivers risk scores as actionable feedback on driving performance. These risk scores directly influence the calculation of insurance premiums. In essence, PHYD policies epitomise personalised insurance (Barry and Charpentier 2020), nurturing a culture of traffic safety while concurrently reducing congestion and environmental impact by curbing oil demand and pollutant emissions.
To assess UBI premiums, extensive driving data are initially gathered via telematics technology. Subsequently, a comprehensive set of driving behaviour variables, termed driving variables (DVs), is generated. These variables encompass four main categories: driver-related, vehicle-related, driving habits, and driving behaviours. These DVs are then analysed through regression against insurance claims data to unveil correlations between driving habits and associated risks, which is a process commonly referred to as knowledge discovery (Murphy 2012). Stipancic et al. (2018) determined drivers’ risk by analysing the correlations of accident frequency and accident severity with specific driving behaviours such as hard braking and acceleration.
To forecast and model accident frequencies, Guillen et al. (2021) proposed utilising Poisson regression models applied to both traditional variables (related to drivers, vehicles, and driving habits) and critical incidents, which encompass risky driving behaviours. Through this approach, the study delineates insurance premiums into a baseline component and supplemental charges tailored to near-miss events—defined as critical incidents like abrupt braking, acceleration, and smartphone usage while driving—which have the potential to precipitate accidents. Building on this, Guillen et al. (2020) employed negative binomial (NB) regression, regressing seven traditional variables, five travel-related factors, and three DVs to the frequency of near-miss events attributable to acceleration, braking, and cornering. Notably, the study suggests that these supplementary charges stemming from near-miss events could be dynamically updated on a weekly basis.
To comprehensively assess the potential nonlinear impacts of DVs on claim frequencies, Verbelen et al. (2018) utilised Poisson and negative binomial regression models within generalised additive models (GAMs). They focused on traditional variables, as well as telematic risk exposure DVs, such as total distance driven, yearly distance, number of trips, distance per trip, distance segmented by road type (urban, other, motorways, and abroad), time slot, and weekday/weekend. While these exposure-centric DVs can serve as offsets in regression models, they fail to capture the subtle details of actual driving behaviour. To attain a deeper understanding of driving behaviour, it becomes essential to extract a broader array of DVs that can discern between safe and risky driving practices while also delineating claim risk. For instance, rather than merely registering a braking event, a more comprehensive approach involves constructing a detailed ’braking story’ that accounts for various factors such as road characteristics (location, lanes, angles, etc.), braking style (abrupt, continuous, repeated, intensity, etc.), braking time (time of day, day of the week, etc.), road type (speed limit, normative speed, etc.), weather conditions, preceding actions (turning, lane changing, etc.), and more. Furthermore, the inclusion of additional environmental and traffic variables obtained through GPS enhances the richness of available information, facilitating a more thorough analysis of driving behaviour and associated risk factors.
As the number of variables describing driving behaviour increases, the data can become voluminous, volatile, and noisy. Managing this influx of variables is crucial to mitigate computational costs and address the issue of multicollinearity among them. Multicollinearity arises due to significant overlap in the predictive power of certain variables. For instance, a driver residing in an area with numerous traffic lights might engage in more forceful braking, or an elderly driver might tend to drive more frequently during midday rather than during typical rush hours or late nights. Consequently, it is possible for confusion to arise between factors such as location and aggressive braking or age and preferred driving times. Multicollinearity can lead to overfitting and instability in predictive models, diminishing their effectiveness. Thus, streamlining the variables to a manageable number is not only essential for computational efficiency but also critical for addressing multicollinearity and enhancing the reliability of predictive models.
Machine learning, employing statistical algorithms, holds remarkable potential in mitigating overfitting and bolstering the stability and predictability of various predictive models. These algorithms are typically categorised as supervised or unsupervised. In the realm of unsupervised learning, Osafune et al. (2017) conducted an analysis wherein they developed classifiers to distinguish between safe and risky drivers based on acceleration, deceleration, and left-acceleration frequencies gleaned from smartphone-equipped sensor data from over 800 drivers. By labelling drivers with at least 20 years of driving experience and no accident records as safe and those with more than two accident records as risky, they achieved a validation accuracy of 70%. Wüthrich (2017) introduced pattern recognition techniques utilising two-dimensional velocity and acceleration (VA) heat maps via K-means clustering. However, it is worth noting that neither study offers predictions related to insurance claims.
With claim risk information such as claim making, claim frequency, and claim size, supervised machine learning models embedded within generalised linear models (GLMs) can be constructed to unfold the hidden patterns in big data and predict future claims for premium pricing. Various machine learning techniques are widely utilised in predictive modelling, including clustering, decision trees, random forests, gradient boosting, and neural networks. Gao et al. (2019) investigated the effectiveness of Poisson GAMs, integrating two-dimensional speed–acceleration heat maps alongside traditional risk factors for predicting claim frequencies. They employed feature extraction methods outlined in their previous work (Gao and Wüthrich 2018), such as K-medoids clustering to group drivers with similar heatmaps and principal component analysis (PCA) to reduce the dimensionality of the design matrix, thereby enhancing computational efficiency. Furthermore, Gao et al. (2019) conducted an extensive analysis focusing on the predictive power of additional driving style and habit covariates using Poisson GAMs. In a similar vein, Makov and Weiss (2016) integrated decision trees into Poisson predictive models, expanding the repertoire of predictive algorithms in insurance claim forecasting.
In assessing various machine learning techniques, Paefgen et al. (2013) discovered that neural networks outperformed logistic regression and decision tree classifiers when analysing claim events using 15 travel-related variables. Ma et al. (2018) employed logistic regression to explore accident probabilities based on four traditional variables and 13 DVs, linking these probabilities to insurance premium ratings. Weerasinghe and Wijegunasekara (2016) categorised claim frequencies as low, fair, and high and compared neural networks, decision trees, and multinomial logistic regression models. Their findings indicated that neural networks achieved the best predictive performance, although logistic regression was recommended for its interpretability. Additionally, Huang and Meng (2019) utilised logistic regression for claim probability and Poisson regression for claim frequency, incorporating support vector machine, random forest, advanced gradient boosting, and neural networks. They examined seven traditional variables and 30 DVs grouped by travel habits, driving behaviour, and critical incidents, employing stepwise feature selection and providing an overview of UBI pricing models integrated with machine learning techniques.
However, machine learning techniques often encounter challenges with overfitting. One strategy to address both multicollinearity and overfitting is to regularise the loss function by penalising the likelihood based on the number of predictors. While ridge regression primarily offers shrinkage properties, it does not inherently select an optimal set of predictors to capture the best driving behaviours. Tibshirani (1996) introduced the Least Absolute Shrinkage and Selection Operator (lasso) regression, incorporating an L1 penalty for the predictors. Subsequently, the lasso regression framework underwent enhancements to improve model fitting and variable selection processes. For instance, Zou and Hastie (2005) proposed the elastic net, which combines the L1 and L2 penalties of lasso and ridge methods linearly. Zou (2006) introduced the adaptive lasso, employing adaptive weights to penalise different predictor coefficients in the L1 penalty. Moreover, Park and Casella (2008) presented the Bayesian implementation of lasso regression, wherein lasso estimates can be interpreted as Bayesian posterior mode estimates under the assumption of independent double-exponential (Laplace) distributions as priors on the regression parameters. This approach allows for the derivation of Bayesian credible intervals of parameters to guide variable selection. Jeong and Valdez (2018) expanded upon the Bayesian lasso framework proposed by Park and Casella (2008) by introducing conjugate hyperprior distributional assumptions. This extension led to the development of a new penalty function known as log-adjusted absolute deviation, enabling variable selection while ensuring the consistency of the estimator. While the Bayesian approach is appliable, the running of MCMC is often time-consuming.
When modelling claim frequencies, a common issue arises from an abundance of zero claims, which Poisson or negative binomial regression models may not effectively capture. These zero claims do not necessarily signify an absence of accidents during policy terms but rather indicate that some policyholders, particularly those pursuing no-claim discounts, may refrain from reporting accidents. To identify factors influencing zero and nonzero claims, Winlaw et al. (2019) employed logistic regression with lasso regularisation on a case–control study, assessing the impact of 24 DVs on acceleration, braking, speeding, and cornering. Their findings highlighted speeding as the most significant driver behaviour linked to accident risk. In a different approach, Guillen et al. (2019) and Deng et al. (2024) utilised zero-inflated Poisson (ZIP) regression models to model claim frequencies directly and Tang et al. (2014) further integrated the model with the EM algorithm and adaptive lasso penalty. However, Tang et al. (2014) observed suboptimal variable selection results for the zero-inflation component, suggesting a lower signal-to-noise ratio compared to the Poisson component. Banerjee et al. (2018) proposed a multicollinearity-adjusted adaptive lasso approach employing ZIP regression. They explored two data-adaptive weighting schemes: inverse of maximum likelihood estimates and inverse estimates divided by their standard errors. For a comprehensive overview of various modelling approaches in UBI, refer to Table A1 in Eling and Kraft (2020).
Numerous studies in UBI have employed a limited number of DVs to characterise a broad spectrum of driver behaviours. For instance, Jeong (2022) analysed synthetic telematic data sourced from So et al. (2021), encompassing 10 traditional variables and 39 DVs, including metrics like sudden acceleration and abrupt braking. While Jeong (2022) utilised PCA to reduce dimensionality and enhance predictive model stability, the interpretability of the principal components derived from PCA remains constrained. Regularisation provides a promising alternative for dimension reduction while addressing the challenge of overfitting. The literature on UBI predictive models employing GLMs with machine learning techniques, such as lasso regularisation to mitigate overfitting, is still relatively sparse, particularly concerning forecasting claim frequencies and addressing challenges like excessive zero claims and overdispersion arising from heterogeneous driving behaviours.
Our main objective is to propose predictive models to capture the impact of driving behaviours (safe or risky) on claim frequencies, aiming to enhance prediction accuracy, identify relevant DVs, and classify drivers based on their driving behaviours. This segmentation will enable the application of differential UBI premiums for safe and risky drivers. More importantly, we advocate for the regular updating of these UBI premiums to provide ongoing feedback to drivers through the relevant DVs and encourage safer driving habits.
We demonstrate the applicability of the proposed predictive models through a case study using a representative telematics dataset comprising 65 DVs. The proposed predictive models includes two-stage threshold Poisson (TP), Poisson mixture (PM), and ZIP regression models with lasso regularisation. We extend the regularisation technique to include adaptive lasso and elastic net, facilitating the identification of distinct sets of DVs that differentiate safe and risky behaviours. In the initial stage of regularised TP models, drivers are classified into risky (safe) group if their annual predicted claim frequencies, estimated by a single-component Poisson model, exceed (not exceed) predefined thresholds. Subsequently, in stage two, regularised Poisson regression models are refitted to each driver subgroup (exceeding thresholds or not) using different sets of selected DVs in each group. Alternatively, PM models simultaneously estimate parameters and classify drivers. Our findings reveal that PM models offer greater efficiency compared to TP models, providing added flexibility and capturing overdispersion akin to NB distributions.
In ZIP models, we observe that the structural zero component may not necessarily indicate safe drivers, as safe drivers may claim less frequently but not necessarily abstain from claims altogether, while risky drivers may avoid claims due to luck or incentives like bonus rewards. So et al. (2021) investigated the cost-sensitive multiclass adaptive boosting method, defining classes based on zero claims, one claim, and two or more claims, differing from our proposed safe and risky driver classes. We argue that the level of accident risk may not solely correlate with the number of claims but rather with driving behaviours. Hence, the regularised PM model proves more efficient in tracking the impact of DVs on claim frequencies, allowing for divergent effects between safe and risky drivers. This proposed PM model constitutes the primary contribution of this paper, addressing a critical research gap in telematics data analysis.
Our second contribution is to bolster the robustness of our approach and mitigate overfitting by incorporating resampling and cross-validation (CV) apart from lasso regularisation. These techniques help us attain more stable and reliable results. Additionally, we utilise the area under curve (AUC) from the receiver operating characteristic (ROC) curve as one of the performance metrics, which evaluates classification accuracy highlighting the contribution of predictive models in classifying drivers.
Our third contribution involves introducing an innovative UBI experience rating premium method. This method extends the traditional experience rating premium method by integrating classified claim groups and predicted claim frequencies derived from the best-trained model. This dynamic pricing approach also enables more frequent update of premiums to incentivise safer and reduced driving. Moreover, averaged and individual driving scores from the identified relevant DVs for each driver can inform their driving behaviour possibly with warnings and encourage skill improvement. By leveraging these advanced premium pricing models, we can improve loss reserving practices, and we can even evaluate the legitimacy of reported accidents based on driving behaviours.
Lastly, we highlight a recent paper (Duval et al. 2023) with similar aims to this paper. They applied logistic regression with elastic net regularisation to predict the probability of claims, but this paper considers two-group PM regression instead of logistic regression to predict claim frequency and allow different DVs to capture the distinct safe and risky driving behaviours. For predictive variables, they used driving habits information (when, where, and how much the insured drives) from telematics, as well as traditional risk factors such as gender and vehicle age, whereas this paper focuses on driving behaviour/style (how the insured drives). To avoid handcrafting of telematics information, they proposed measures using the Mahalanobis method, Local Outlier Factor, and Isolation Forest to summarise trip information into local/routine anomaly scores by trips and global/peculiar anomaly scores by drivers, which were used as features. This is an innovative idea in the literature. On the other hand, this paper uses common handcraft practices to summarise driving behaviour by drivers, using both driving habits (where and when) and driving styles (how) information by defining driving events such as braking and turning considering time, location, and severity of events. Duval et al. (2023) demonstrated that the improvement in classification using lower global/peculiar Mahalanobis anomaly scores enables a more precise pure premium (product of claim probability from logistic regression to insured amount) calculation. As stated above, this paper provides differential contributions by classifying drivers into safe and risky groups, predicting claims for drivers in their groups using regularised PM models (among regularised TP and ZIP models), which is pioneering in the UBI literature, and calculating premiums using the proposed innovative UBI experience rating premium based on drivers’ classifications (safe/risky) and predicted annual claims.
The paper is structured as follows: Section 2 outlines the GLMs, including Poisson, PM, and ZIP regression models, alongside lasso regularisation and its extensions. Section 3 presents the telematics data and conducts an extensive empirical analysis of the two-stage TP, PM, and ZIP models. Section 4 introduces the proposed UBI experience rating premium method. Lastly, Section 5 offers concluding remarks and implementation guidelines and explores potential avenues for future research.

2. Methodologies

We derive predictive models for the claim rate of any driver using GLMs with lasso regularisation when the true model is assumed to have a sparse representation of 65 DVs. This section also considers some model performance measures including AUC.

2.1. Regression Models

2.1.1. Poisson Regression Model

The Poisson regression model is commonly applied to count data, like claims. It is defined as
Y i Poisson μ i ( β ) , μ i ( β ) = n i a i = n i exp ( x i β ) = exp x i β + log ( n i )
where Y = Y 1 : N , β = β 0 : J 1 , and J are the number of selected DVs in the model; a i = exp ( x i β ) estimates the number of claim per year for driver i; and log ( n i ) is the offset parameter in a regression model. Poisson regression assumes equidispersion and is applied to the 2-stage TP model in Section 3.4. For overdispersed data, negative binomial (NB) distribution, as in the Poisson–Gamma mixture, provides extra dispersion. With NB regression, the term “ Poisson ( μ i ) ” in (1) is replaced with NB distribution NB ( ν , q i ) = NB ν , μ i / ( ν μ i ) —where ν is the shape parameter, and q i is the success probability of each trial. NB distribution converges to Poisson distribution if ν tends to infinity.

2.1.2. Poisson Mixture Model

The finite mixture Poisson model is another popular model for modelling unobserved heterogeneity. The model assumes G unobserved groups each with probability π g , 0 < π g < 1 , g = 1 , , G , and g = 1 G π g = 1 . Focusing on classifying safe and risky drivers, we model G = 2 groups. Assume that the claim frequency Y i for driver i in group g follows a Poisson distribution with mean μ i g , that is, Y i Poisson ( μ i g ) at probability π g , g = 1 , , G . The probability mass function (pmf) of the Poisson mixture model is
f y i | π , μ i 1 ( β 1 ) , μ i 2 ( β 2 ) = π f 1 y i | μ i 1 ( β 1 ) + ( 1 π ) f 2 y i | μ i 2 ( β 2 )
where π 1 = π and π 2 = 1 π , β g = ( β 0 : J , g ) , θ = ( β 1 , , β G , π 1 , , π G 1 ) is the model parameter vector, and f g y i | μ i g ( θ ) is the Poisson pmf with mean function μ i g ( β g ) = exp x i β g + log ( n i ) .
The expectation–maximisation (EM) algorithm is often used to estimate parameters θ . In the E step, the posterior group membership for driver i is estimated by
z i g = π g f g y i | μ i g ( β g ) g = 1 G π g f g y i | μ i g ( β g ) .
The marginal predicted claim is
y ^ i = z ^ i 1 μ i 1 ( β 1 ) + ( 1 z ^ i 1 ) μ i 2 ( β 2 ) .
If there is a high proportion of zero claims, the ZIP model (Lambert 1992) may be suitable to capture the excessive zeros. The model is a special case of a two-group mixture model that combines a zero point mass in group 1 with a Poisson distribution in group 2. The zeros may come from the point mass (structural zero) or the zero count (natural zero) in a Poisson distribution. The model is given by
Pr ( Y i = 0 ) = π i + ( 1 π i ) exp ( μ i ) , and Pr ( Y i = y i ) = ( 1 π i ) μ i y i exp ( μ i ) y i ! , y i 1
where the two regression models (called zero and count) for the probability π i of structural zero and the expected counts (including nonstructural zero) μ i , respectively, are given by
π i ( θ ) = exp x i ψ + log ( n i ) 1 + exp x i ψ + log ( n i ) , and μ i ( β ) = exp x i β + log ( n i ) ,
and the logistic regression parameters ψ = ( ψ 0 , , ψ J ψ ) define a vector of J ψ selected DVs to estimate the probability of extra zero; the vector of model parameters is θ = ( ψ , β , π ) .

2.2. Regularisation Techniques

The stepwise procedure to search for a good subset of DVs often suffers from high variability, a local optimum, and ignorance of uncertainty in the searching procedures (Fan and Li 2001). Lasso (L) regularisation offers an alternative approach to select variables for parsimonious models. It is further extended to adaptive lasso (A), elastic net (E), and adaptive elastic net (N). These regularisations with L1 penalty provide a simple way to enforce sparsity in variable selection by shrinking some coefficients β j to zero. This aligns with our aim to select important DVs, that is, those with coefficients not shrunk to zero.
To implement these regularisation techniques, we consider the penalised log likelihood (PLL) (Banerjee et al. 2018; Bhattacharya and McNicholas 2014). For the case of the most general adaptive elastic net regularisation, coefficients β λ , w , α , N of Poisson regression in (1) estimated by minimising the penalised log likelihood are given by
LOSS λ , α , w ( β ) = i = 1 N log f y i ; μ i ( β ) + λ 1 α 2 j = 1 J w j β j 2 + α j = 1 J w j | β j |
where the first term is the negative log likelihood (NLL), the second term is the penalty, f y i ; μ i ( β ) is the pmf of Poisson model, and w j are the data-driven adaptive weights. Equation (7) includes special cases: α = 1 , w j = 1 for lasso, α = 1 for adaptive lasso, and w j = 1 for elastic net.
The development of (7) starts with the basic lasso regularisation with α = 1 and w j = 1 . The parameter estimates condition on λ are given by
β ^ j , λ = β j , NLL max 0 , 1 N λ | β j , NLL |
where β NLL = ( β 1 , NLL , , β J , NLL ) minimise the NLL when λ = 0 . As λ increases, the term 1 N λ | β j , NLL | becomes negative, and so β ^ j , λ will shrink to zero. Then, the penalty term λ j = 1 J | β ^ j , λ | will drop as more β ^ j , λ = 0 , but NLL increases as β λ = ( β ^ 1 , λ , , β ^ J , λ ) get further away from β NLL so that one can choose a λ min that minimises the PLL to obtain β λ , L . Alternatively, one can perform a K-fold CV and choose λ min that provides the best overall model fit for all K validated samples. Different criterion may suggest different optimal λ min and hence the estimates β λ min . Details are provided points 1–2 in Appendix A.
However, Meinshausen and Bühlmann (2006) showed the conflict of optimal prediction and consistent variable selection in lasso regression. Moreover, whether lasso regression has an oracle procedure is debatable. An estimating procedure is an oracle if it can identify the right subset of variables and has an optimal estimation rate so that estimates are unbiased and asymptotically normal. Städler et al. (2010) also proclaimed these issues and addressed some bias problems of the (one-stage) lasso, which may shrink important variables too strongly. Zou (2006) introduced the two-stage adaptive lasso as a modification of lasso in which each coefficient β j is given its own weight w j to control the rate as each coefficient is shrunk towards 0.
Adaptive lasso deals with three issues, namely, inconsistent selection of coefficients, lack of oracle property, and unstable parameter estimation when working with high dimensional data. As smaller coefficients β j , NLL in (8) will leave the model faster than larger coefficients, Zou (2006) suggested the weights w j = | β ^ j , R | γ in (7), where the tuning parameter γ > 0 is to ensure that the adaptive lasso has oracle properties and that β ^ j , R is an initial estimate from ridge regression. The weights are rescaled so that their sum equals to the number of DVs. Städler et al. (2010) suggested the tuning parameter γ = 1 for low-claim threshold and γ = 2 for high-claim threshold. We adopted γ = 2 as the best tuning parameter to estimate weights w j in the subsequent adaptive lasso models.
Zou and Zhang (2009) argued that L1 penalty can perform poorly when there are multicollinearity problems, which is common in high-dimensional data. This severely degrades the performance of lasso. They proposed elastic net, which takes a weighted average of two penalties: ridge ( α = 0 ) and lasso ( α = 1 ). The mixing parameter α ( 0 , 1 ) in (7) balances the two penalties with α > 1 / 3 , indicating a heavier lasso penalty.
When regularisation is applied to ZIP and PM models, the penalised log likelihood in (7) is extended to
LOSS λ , α , w ( β 1 , β 2 ) = i = 1 N log f y i ; π , μ i 1 ( β 1 ) , μ i 2 ( β 2 ) + λ 1 1 α 2 j = 1 J β j 1 2 + α j = 1 J w j 1 | β j 1 | + λ 2 1 α 2 j = 1 J β j 2 2 + α j = 1 J w j 2 | β j 2 |
where f y i ; π , μ i 1 ( β 1 ) , μ i 2 ( β 2 ) is given by (2).
The optimal α needs to be searched over to identify the best α with the lowest mean square error (MSE), root MSE (RMSE), or R-squared. We searched for α in (7) and (9) for different models summarised in Table 1. For example, model TPAL-2 refers to a stage 2 TP model when adaptive lasso regularisation is applied in stage 1 and lasso is applied in stage 2. Different stage 1 TP, stage 2 (under TPL-1 and TPA-1) TP with threshold τ (to split predicted annual claim a i = y i / n i into low and high groups), PM, and ZIP models were considered. We ran each model over five α values (0.100, 0.325, 0.550, 0.775, 1.000) and identified the best α , which gives the lowest RMSE with K = 10 folds CV. To ensure the search is robust, results were repeated R = 100 times for each model based on R = 100 70% subsamples S 1 : R . Results show that low α = 0.1 should be adopted for stage 1 TP models, the low group of most stage 2 TP models, the PME model, and the PMN model, whereas higher α = 0.775 , 1 should be adopted for the higher group of stage 2 TP models. See point 1 in Appendix B for the implementation of all lasso regularisation procedures under Poisson regression and point 2 in Appendix B for the implementation details using caret package in R.

2.3. Model Performance Measures

Model performance can be evaluated from different aspects depending on the aims and model assumptions. The goodness of model fit, prediction accuracy, and classification of drivers are the main types of criteria that are linked to different metrics.
Firstly, the Bayesian information criterion (BIC) is a popular model fit measure that contains a deviance and a parameter penalty term using the log of sample size as the model complexity penalty weight. The Akaike information criterion (AIC) can also be used when the parameter penalty term uses 2 as the weight. Deviance (without parameter penalty) is also used by some packages to select models.
Secondly, for prediction accuracy, we adopted the popular mean square error MSE = i = 1 N ( y i μ i ) 2 / N and mean absolute error MAE = i = 1 N | y i μ i | 2 / N . The third measure we considered is the correlation ρ between observed and predicted annual claim frequencies (instead of claim frequencies in MSE). A higher correlation shows better performance.
Lastly, to quantify classification performance, the difference between observed group membership and predicted group membership should be quantified. In machine learning, AUC for ROC curve (Fawcett 2006) is a measure of model classification power. It constructs a confusion matrices condition on the classifier (e.g., a i in (1) for TP-2 and z i 1 in (3) for PM) cutoff, calculates the true positive rate (TPR) (sensitivity) and the false positive rate (FPR) (1-specificity), and plots the TPR against the FPR as the discrimination cutoff for the classifier varies to obtain the ROC curve. AUC is the probability that a randomly chosen member of the positive class has a lower estimated probability of belonging to the negative class than a randomly chosen member of the negative class. See point 3 in Appendix B for implementation. For the claim data, we let the binary classifier be the low-claim (safe driver) and high-claim (risky driver) groups. However, the group membership of each driver is not observed, so it is approximated using K-means clustering, which minimises the total within-cluster variation using the selected DVs for each model. These four types of measures, namely BIC, MSE, ρ , and AUC, assessing different performance perspectives, were applied to assess the performance of a set of models M .
Although these four measures are popular in statistical and machine learning models, they are not particularly built for count models. Czado et al. (2009) and Verbelen et al. (2018) proposed six scores for claim count models based on the idea of probability integral transform (PIT) or, equivalently, the predictive CDF. The six scores are defined as
Logarithmic : Log ( F , y ) = log ( f y ) Quadratic   ( Quad ) : Quad ( F , y ) = 2 f y + f Spherical : Spher ( F , y ) = f y / f Ranked   Probability : RankProb ( F , y ) = k = 1 [ F y 1 ( y k ) ] 2 Dawid Sebastiani : Dawid ( F , y ) = y μ F σ F 2 + 2 log ( σ F ) Squared   error : SqErr ( F , y ) = ( y μ F ) 2
where Y Poisson, f y = Pr ( Y = y ) , F y = Pr ( Y y ) , μ F = E ( Y ) , σ F = Var ( Y ) , and f = k = 0 f k 2 . For PM models, f y = π 1 f y 1 + ( 1 π 1 ) f y 2 , F y , and μ F are similarly defined, and σ F 2 = k = 0 ( k μ F ) 2 f k . To accommodate the effect of driver classification, the prior probability π g is replaced by posterior probabilities z i g in (3). These scores are averaged over drivers, and lower scores indicate better predictions. The Logarithmic score is the common NLL, which is a model-fit measure. Quadratic and Spherical scores are similar to Logarithmic scores for assessing model fit using different functional forms. Dawid–Sebastiani and Squared error (MSE) scores measure prediction accuracy. For the Dawid–Sebastiani score, the term 2 log ( σ F ) adjusts for the fact that the first term decays to zero as σ F tends to infinity. The Ranked probability score calculates the sum of squares value to summarise the PP plot, plotting the fitted cumulative probability F y against the observed proportion 1 ( y k ) / N when averaged. These six measures are used to select the final model to enrich the versatility of our model selection criteria.
To facilitate model selection, we ranked each model M in the model class M in descending order of preference for each performance measure m M , 1 : 4 for (BIC, MSE, ρ , AUC) and m M , 1 : 6 for (Log, Quad, Spher, RankProb, Dawid, SqErr) to obtain ranks R M , l = rank M M ( m M , l ) and sum of ranks
R M = l = 1 l R M , l , l = 4 , 6
to reflect the performance of each model M .

3. Empirical Studies

3.1. Data Description

The dataset is originated from cars driven in the US, where special UBI sensors were installed. The University of Haifa Actuarial Research Center provided the data, where UBI modelling was analysed (Chan et al. 2022). It contains two column vectors of claim frequencies y = y 1 : N and policy duration or exposure n = n 1 : N in a year. Ninety-two percent of y are zero. Figure 1 displays three histograms for y , n , and annual claim frequencies a ( a i = y i / n i ), respectively. The dataset also contains J 0 = 65 numerical DVs constructed based on the information collected from telematics and GPS for N = 14,157 drivers. Figure A2 in Appendix D.1 visualises the DVs by plotting x i j across driver i, with colours indicating the number of claims y i = 0 , 1 , 2 + . We remark that the DVs are labelled up to 77 with some skips of numbers. For example, DV 6, 11, 12, etc. do not exist in Figure A2. Each DV describing a specific event (details in the next section) has been aggregated over time to obtain certain incidence rates (per km or hour of driving) and scaled to normalise their ranges for better interpretability of their coefficients in the predictive models. These procedures transformed the multidimensional longitudinal DVs into a single row for each driver, which is the unit of analysis. All DVs are presented as column vectors x j , j = 1 , , J 0 .

3.2. Data Cleaning and DVs Setting

Telematics sensors are installed by car manufacturers to provide much cleaner signals. Therefore, standard data cleaning techniques, including the removal of outliers, were applied. External environmental information from GPS was utilised to minimise false signals, recognising that driving behaviours are often responsive to varying conditions. Then, the DVs can be defined to indicate specific driving events, which can associate with certain driving risks. However, while rapid acceleration is typically undesirable, it may be necessary when merging onto a busy highway. To accurately process and analyse telematics and GIS data, roads were categorised into specific types such as highways, junctions, roundabouts, and others. This segmentation enables a more precise assessment of driving behaviours across different contexts, improving safety measures and performance evaluations. Given the complexity of telematics data, including metrics like acceleration and braking, the definition of events like rapid acceleration or hard braking was adapted to account for varying road conditions depending also on time. Hence, the DVs were defined for a range of driving events as combinations of event types (e.g., accelerating, braking, left/right turning), environmental condition (e.g., interchange, junction), and time (e.g., the morning rush from 6 am to 9 am). Then, rates of the events (over standardised period or mileage) were evaluated and normalised. Appendix C provides the labels and interpretation of these DVs.

3.3. Exploratory Data Analyses

To summarise the variables, their averages are presented: y ¯ = 0.083 , n ¯ = 1.146 , and a ¯ = 0.075 . We split the drivers into three classes: C b , b = 0 , 1 , 2 + , with 0, 1, and at least 2 claims and class sizes N b . Their proportions— p b = N b / N —came out to (0.92, 0.07, 0.005), averaged exposures— n ¯ b = i C b n i / N b —came out to (1.13, 1.38, 1.64), and averaged annual claim frequencies— a ¯ b = i C b a i / N b —came out to (0, 0.92, 1.71). The average claim frequency for C 2 + was 2.11. Regressing the claim frequencies y i on the exposure n i , the R 2 was only 0.014, showing that the linear effect of exposure on claim frequency is weak and insignificant. Hence, it is possible that other effects, such as driving behaviour as measured by the DVs x j , may impact the claim frequency y i . Section 3.4, Section 3.5 and Section 3.6 will analyse such effects of DVs on claim frequencies.
Section 2.1.1 introduced the Poisson and NB regression for equidispersed and overdispersed data, respectively. To assess the level of dispersion, we used sample variance Var ( y ) = 0.089 , which shows equidispersion possibly due to the large proportion of zeros. We also tested the equidispersion assumption with the null hypothesis— H 0 : Var ( Y i ) = μ i —and alternative hypothesis— H 1 : Var ( Y i ) = μ i + ψ g ( μ i ) —where g ( · ) > 0 is a transformation function (Cameron and Trivedi 1990), and ψ > 0 ( ψ 0 ) indicates overdispersion (underdispersion). See point 4 in Appendix B for the implementation. For model TPL-1, ψ = 0.0369 ( p = 0.0482 ), and for model TPA-1, ψ = 0.0369 ( p = 0.0477 ), which are marginally significant outcomes. Moreover, the TP, PM, and ZIP models can capture some overdispersion by splitting according to threshold and mixture components. Hence, we focused on Poisson regression for all subsequent analyses.
Moreover, noninformative DVs can lead to unstable models. In Figure A2, seven DVs (1, 2, 7, 8, 10, 14, and 28) are shown to be sparse, with at most 13 nonzeros. Hence, we explored the information content of each DV. Firstly, nonsparsity S j , defined as the proportion of nonzero data for each DV, is reported. A refined measure is Shannon’s entropy (Shannon 2001) H j , which measures the degree of disorder/information of each DV. While H j provides no information of the relationship with y , the information gain I G j evaluates the additional information that the jth DV provides about the claims y with respect to the three classes C b ,   b = 0 , 1 , 2 + .
Apart from the information content of the DVs, it became clear that the multicollinearity between DVs also affects the stability of a regression model. Figure A3a in Appendix D.2 plots the correlation matrix of Corr ( x 1 : J , y , a ) , and the correlations of the DVs with y are denoted by ρ j = Corr ( x j , y ) . The correlation matrix shows that the DVs up to 16 (except 9) are nearly uncorrelated with each other, the next up to DV 39 are mildly correlated, and the rest are moderately correlated, reflecting some pattern of these DVs. However, they are only weakly correlated with y and a , showing low signal content of each DV in predicting y and a .
Table 2 reports ρ j , H j , S j , and I G j to quantify the information content of the 65 DVs, flags a DV as “” when H j < 1 and S j < 1 % (“” otherwise), and highlights I G j in boldface when I G j > 0 indicates information gain. Asterisks are added to the DVs’ ID to indicate the two levels of information content in “” and boldface, respectively. Twenty DVs with “” were classified as having low information content. Including them in the more complicated PM and ZIP models led to unstable results. Thus, we dropped them and considered J 1 = 65 20 = 45 DVs in the PM and ZIP models, but we considered all J 0 = 65 DVs in the TP models. All the DVs were normalised before analyses to ensure efficient modelling.
Figure A3 in Appendix D.2 plots the Euclidean distance d ( j , j ) = i = 1 N ( x i j x i j ) 2 between the ( j , j ) pair of DVs and demonstrates the hierarchical clustering based on d ( j , j ) . The results show one major cluster of size 54 and two more smaller clusters ( 49 * * , 36 * , 43 * * , 72 * , 56 * , 63 * ) and ( 55 * , 59 * , 71 * , 27 * * , 58 * ), with increasing pairwise distance from the major cluster. All spare DVs labelled as noninformative are in the major cluster. These DV features guided our interpretation of the selected DVs in subsequent analyses. Refer to Table A1 and Appendix C for the interpretation of these DVs.

3.4. Two-Stage Threshold Poisson Model

The TP model fit Poisson regression in Section 2.1.1 twice at stages 1 and 2 with the aim of classifying drivers into safe and risky groups at stage 1 and determining predictive DVs for each group at stage 2 using a single-component Poisson model. The DVs for TP models were selected from J = J 0 = 65 DVs.
At stage 1, lasso-regularised Poisson regression models were trained through resampling and applied to predict claim frequencies for all drivers. To ensure model robustness and reduce overfitting, we repeated regularised Poisson regression R = 100 times with 70% simulated subsamples of size N r = 9910 each and selected DVs for each repeat. For each repeat r, the optimal λ min , r was selected with K = 10 (default setting) folds CV. Then, the DVs most frequently selected were identified using a weighted count I j in (A3) based on the RMSE. There were J T 1 = 52 selected DVs for TPL-1 model and J T 1 = 39 DVs for TPA-1 model. The details are provided in Appendix A. Then, Poisson regression models with the selected DVs were refitted to all drivers. Parameter estimates β T 1 are reported in Table A2 in Appendix E under β j T 1 . To visualise these coefficients, Figure 2a plots the heat map of β T 1 for all PT-1 models. Let J S T 1 be the number of significant β j T 1 with a p value < 0.05. For the TPA-1 model, Table A2 shows that J S T 1 = 13 out of J T 1 = 44 selected DVs are significant. Table 3a shows that TPL-1 and TPE-1 provided a similar selected number J T 1 of DVs. The same applied to models TPA-1 and TPN-1 with adaptive weights. As feature selection is important, we selected the best model in each group, and they are highlighted in Table 3a.
Table 3. Model performance measures for stage 1 TP and ZIP models, stage 2 TP and PM models, and final selection.
Table 3. Model performance measures for stage 1 TP and ZIP models, stage 2 TP and PM models, and final selection.
Risks 12 00137 i001
At stage 2, predicted claims y ^ i = μ i were calculated using the fitted means in (1) and β T 1 . Then, the predicted annual claim frequencies a ^ i = y ^ i / n i were calculated, and drivers were classified into low- and high-claim groups according to a ^ i < τ h and a ^ i τ h , respectively. We considered four thresholds τ = { τ h } = ( 0.08 , 0.09 , 0.10 , 0.11 ) , and the proportion P h of drivers classified into the low claim group out of all drivers was (0.70, 0.79, 0.85, 0.90), respectively. Figure 3 shows how the drivers were classified to low- and high-claim groups according to the four thresholds τ using a ^ i from TPA-1 and visualises the relationship between the observed a i and predicted a ^ i . We attribute the nonlinear pattern in these scatter plots partially to the impact of driving behaviour revealed by the DVs.
To improve model robustness and reduce overfitting, subsamples S 1 : R of size N r = 0.7 N were drawn again, and each S r was further split into two groups
G L , r h T 2 = { y i S r : a ^ i < τ h } and G H , r h T 2 = { y i S r : a ^ i τ h } , h = 1 , , 4
where T2 indicates stage-2 of the TP model. Then, regularised Poisson regression was applied to each G L , r h T 2 and G H , r h T 2 . Let the index sets I L h β and I H h β for nonzero coefficients (that is, selected at least once from S r ) be defined similar to I β in (A2) for h = 1 , , 4 . Then, β L h = ( β L h , j I L h β ) and β H h = ( β H h , j I H h β ) are averaged parameter estimates for the low- and high-claim groups, respectively, obtained in a similar manner to β defined in (A1), and I L h T 2 and I H h T 2 are importance measures based on the RMSE   r , L h and RMSE   r , H h defined similarly to I in (A3); J L h T 2 and J H h T 2 are the number of frequently selected DVs out of β L h and β H h , with I L h j > 43 ( 62 × 0.70 ; R 1 = 0.70 for τ 0.08 and max ( I j , j I β ) = 62 is the lower threshold of I j ), 49 ( τ 0.09 ), 53 ( τ 0.10 ), 56 ( τ 0.11 ), I H h j > 19 ( 62 × 0.30 ), 13, 9, and 6, respectively, (and dropped otherwise). Poisson regression models with various selected DVs were refitted to the low- and high-claim groups for each h. Table A3 in Appendix E reports the parameter estimates β L h T 2 , β H h T 2 of the best model (TPLA-2 for τ = 0.08 , 0.09 and TPLN-2 for τ = 0.10 , 0.11 ) when the stage 1 model was TPL-1 (from J T 1 = 52 selected DVs) or TPA-1 ( J T 1 = 39 ). Table 3b reports the number J L h T 2 , J H h T 2 of selected β L h j T 2 and β H h j T 2 and the number J LS h T 2 , J HS h T 2 of significant β L h j T 2 and β H h j T 2 with p values < 0.05.
To visualise these coefficients, Figure 2a presents the heat map for models in the low- and high-claim groups. It shows that the DVs, which were mostly selected and significant in the low-claim groups, are 4, 9 * , 18 * * , 24, 32, 37 * , 47 * , 57, 67 * * , 73 * , and 74, while the least are 2, 7, 15, 16, 23 * , 46 * , and 53. For the high-claim group, DVs 4, 29 * , 52 * , and 67 * * were mostly selected. The information content of each DV is indicated by asterisks. See Table 2 for details and Table A1 for the interpretation of these DVs. We observed that there were more selected DVs for the low claim group with thresholds τ 2 : 4 = 0.09 , 0.10 , 0.11 , and the selected DVs are relatively more informative. Two DVs, 4 and 67, were selected in both the low- and high-level claim groups but with differential effects: negative for the low-claim group and positive for the high-claim group for DV 4, whereas DV 67 had a consistent negative effect for both groups.
For model selection, Table 3b summarises the model performance measures, BIC, MSE, ρ , and AUC (see Section 2.3) using all data for 32 models, with 8 models under each threshold. The two criteria, BIC and AUC, in Table 3b were averaged over the two groups using the ratio R h in Table A3. For each threshold, top ranked measures and the sum of rank R M in (10) are boldfaced and yellow highlighted. We first dropped those models with J h T 2 , J S h T 2 = 0 for either group and chose the best model M with the top R M . The best stage two model is TPLA-2 for τ = 0.08 , 0.09 and TPLN-2 for τ = 0.10 , 0.11 . The results will be compared for the PM and ZIP models in Section 3.7.

3.5. Poisson Mixture Model

To facilitate driver classification, we considered lasso-regularised PM models. To robustify our results, we again performed 70% resampling to obtain R = 100 subsamples of size N r = 9910 . The parameters were selected from the J = J 1 = 45 more informative DVs to provide stable results (see Section 3.3). In each subsample S r , the regularised PM model was estimated using K = 10 folds CV. Then, the drivers were classified into G L , r M and G H , r M according to z ^ i g 0.5 or < 0.5 , respectively, where z ^ i g was defined in (3). Let the index sets I L β and I H β for a nonzero coefficient (that is, selected at least once from S r ) be defined similar to I β in (A2). Then, β L = ( β L , j I L β ) and β H = ( β H , j I H β ) are averaged parameter estimates for the low- and high-claim groups, respectively, obtained in a similar manner to β defined in (A1); I L M and I H M are importance measures based on the RMSE   r , L M and RMSE   r , H M defined similar to I in (A3); R M is the average ratio of the low group size over R = 100 subsamples; and J L M and J H M are the number of selected DVs out of β L and β H , with I L j M > 43 ( 62 × 0.69 and R M = 0.69 for PML), 45 ( 62 × 0.73 for PMA), I H j M > 19 ( 62 × 0.31 for PML), and 17 ( 62 × 0.27 for PMA) similar to J T 1 . We note that some subsamples had too low of sample size (<200) for the high-claim group or too low differences (<0.005) of observed annual claim frequencies between the two groups or both. Both criteria indicate ineffective grouping and should be eliminated. Consequently, 172 and 113 subsamples were drawn for the PML and PMA models, respectively, in order to collect 100 effective subsamples.
To obtain the overall parameter estimates β L M and β H M , the selected DVs were refitted to the PM model again. Table A4 in Appendix E reports β L M and β H M of the PML and PMA models, together with I L M and I H M . Table 3b reports the number of DVs selected, J L M , J H M ( β L j M , β H j M 0 and I L j M , I H j M > 62 ), and the number of significant selected DVs, J LS M , J HS M , for each PM model. We note that the PM models had more selected and significant DVs for the high-claim group in general than the TP models. Figure 2c plots the heat map of the parameter estimates of the two groups for the four PM models. Across all four models, the two sets of mostly selected and significant variables are ( 18 * * , 20 * , 26 * , 29 * , 37 * , 45 * , 58 * , 61 * , 67 * * , 75) and ( 18 * * , 29 * , 31 * , 36 * , 59 * , 67 * * , 73 * , 75 * , 77 * ) for the low- and high-claim groups, respectively. These two sets of significant DVs are quite different from those of TP models, as only two DVs from each group (in boldface) were also selected by TP models. Again, DV 67 was selected by both groups, which was the same as the case of the TP models. To select the best PM model, Table 3b reports the performance measures BIC, MSE, ρ , and AUC. According to the sum of ranks R M in (10), the PMA model was selected. For the selected PMA model, Table A4 shows that there were J L M = 39 DVs selected for the low-claim group and J H M = 40 DVs selected for the high-claim group, of which six (18, 20, 45, 58, 61, 75) and five (29, 36, 59, 67, 73) DVs are significant. See point 5 in Appendix B for the implementation of the PM models, Appendix C for the interpretation of the significant selected DVs, and Section 3.7 for the implication of these DVs on risky driving.

3.6. Zero-Inflated Poisson Model

Since 92% of the claims are zero, we applied the ZIP model in (5) and (6) (Lambert 1992; Zeileis et al. 2008) to capture the structural zero portion of the claims and test if the structural zero claim group should be included in modelling claims. As with the TP and PM models, we drew subsamples S 1 : 100 , each with N r = 9910 drivers, and ZIP lasso-regularised regression was applied to each S r to robustify the selection of DVs. The procedures were similar to the cases of the TP and PM models. As with the PM model, the DVs are selected from J = J 1 = 45 more informative DVs to provide stable results.
Let I 0 β and I c β be the index sets of nonzero parameter estimates (that is, selected at least once from S r ) in the zero and count models, respectively, defined in a similar manner to I β in (A2). Then, the averaged parameter estimates β 0 = ( β 0 , j I 0 β ) for the zero model and β c = ( β c , j I c β ) for the count model are obtained in a similar manner to β in (A1); I 0 Z and I c Z are importance measures based on the RMSE   0 , r Z and RMSE   c , r Z , respectively, defined similarly to I in (A3), and J 0 Z and J c Z are the number of selected DVs out of β 0 and β c , with I 0 , j Z > 62 and I c , j Z > 62 similar to J T 1 . Next, the J 0 Z and J c Z selected DVs were refitted to the ZIP model for all data to obtain the overall parameter estimates β 0 Z and β c Z . Parameters β 0 ,   β c were averaged before refit, parameters β 0 Z ,   β c Z were averaged after refit, and the importance measures I 0 Z ,   I c Z are reported in Table A4. Table 3a reports the number of selected DVs J 0 Z and J c Z , and performance measures AIC, BIC, and MSE for the ZIPL and ZIPA models following the regularisation choices from the two chosen TPL-1 and TPA-1 models. Between the two ZIP models, the ZIPA model was chosen, because it had nonzero DVs selected for the zero component and a lower BIC. However, the MSE was the highest among all the models shown in Table 3b, indicating low predictive power. More importantly, the zero model estimates the probability of structural zero among all zero, but it does not guide the classification of safe and risky drivers, because safe drivers can claim less but not necessarily none, and risky drivers can claim none by luck or for a no-claim bonus. Hence, drivers classified to the structural zero claim group are not necessarily safe drivers. As a result, the ZIPA model was excluded from model comparison and selection. See Appendix B.6 for the implementation of ZIP models.

3.7. Model Comparison and Selection

We compared the performance of the TP and PM models in terms of claim prediction, predictive DVs selection, and driver classification. Table 3b displays the performance of all 32 TP models and four PM models. The best TP model for each threshold and the best PM model were selected according to R M in (10) using four measures. Among the selected TPLA-2 ( τ = 0.08 , 0.09 ), TPLN-2 ( τ = 0.10 , 0.11 ), and PMA models, Table 3c shows that the PMA model succeeded through the final selection using six count model scores, confirming the superiority of PM model in many aspects and its preferability over the Poisson model. For an interpretation of the significant selected DVs (29, 36, 59, 67, 73; all with positive coefficients except 59) using the PMA model, risky driving is associated with more frequent severe brake to slow-down at weekday and weekend nights, as well as more frequent severe right turns at the junction at weekday nights and Friday rush time.
Apart from the numerical measures, we also visualised their performance using ROC curves. Section 2.3 introduces the AUC according to three classes of drivers with y i = 0 , 1 , 2 or an overall class of all drivers. For each of these classes, the AUC was drawn for the binary classifier of low- and high-claim groups comparing the predicted group, with the proxy observed group of each driver estimated by K-means clustering using J T 1 = 52 selected DVs (from TPL-1 model) for all stage 2 TP models and J 1 = 45 informative DVs for the PM model. Figure 4 plots the two clusters using the first two principal components (PCs) that explain 22.11% and 17.44% of variance, respectively, using selected DVs for the two cases. The results show that the first PC could separate the two clusters well for both cases. For the stage 2 TP (PM) model, N 1 = 9450 (9372) claims were assigned to the low claim cluster accounting for 67% (66%) of drivers using K-means clustering. These two proportions of the low claim group using K-means clustering are not far away from the estimated proportions P h = 0.7 to 0.89 in Table A3 using the TPL-1 model, as well as P M = 0.73 using the PMA model.
Figure 5a–e plots the ROC curves with AUC values using classifier y ^ i / n i for the best stage 2 TP models under each threshold and classifier z ^ i 1 in (3) for the best PMA model. The AUCs have been calculated into subgroups y i = 0 (red), 1 (blue), 2 (green) in Figure 5a–e, and all the data are shown in Figure 5f. The zigzag patterns of the green lines indicate small sample sizes for drivers with y i 2 . Table 3b reports the overall AUC values, which are the weighted averages of the three AUC values in each of Figure 5a–e and show that model TPLN-2 when τ = 0.10 is the best classifier of low- and high-claim groups, while PMA displays the third classifying power. However, the accuracy of the results depends on whether K-means clustering can estimate the true latent groups well.
Since zero-claim drivers are not necessarily safe drivers after considering the DVs, we expect some zero-claim drivers to be risky and non-zero-claim drivers to be safe. This disagreement reflects the impact of DVs on assessing driving risk apart from the claim information. Zero disagreement (ZD) is defined as the proportion, out of all drivers, of those zero-claim drivers classified as risky and non-zero claim drivers classified as safe. This ZD is 3.2% for the best PMA model. This small ZD is due to the low MSE showing agreement between the predicted and observed claims.

4. UBI Experience Rating Premium

In the context of general insurance, a common approach for assessing risk in the typical short-tail portfolio involves multiplying predicted claims frequency by claims severity to determine the risk premium. This derived risk premium is subsequently factored into the profit margin, alongside operating expenses, to determine the final premium charged to customers. This paper centers on claims frequency, and in the premium calculation discussed herein, we assume that claim severity remains constant. Consequently, the premium calculation relies on predicting claims frequency.
The traditional experience rating method prices premiums using historical claims and offers the same rate for drivers within the same risk group (low/high or safe/risky). If individual claim history is available, premiums can be calculated using individual claims relative to overall claims—both from historical records. However, although this extended historical experience rating method can capture the individual differences of risk within a group, it still fails to reflect drivers’ recent driving risk. The integration of telematic data enables us to tailor pricing to current individual risks. This enhanced method is called the UBI experience rating method. We leverage premium pricing as a strategic approach to refine our pricing methodology.
Suppose that a new driver i was classified to claim group g with index set C g of all drivers in this group and i C g . Let P i t be his premium for year t, L i , t 1 be the historical claim/loss in year t 1 , L t 1 g = i C g L i , t 1 be the total claim/loss from the claim group g that driver i was classified to, and P t 1 g = i C g P i , t 1 be the total premium from the claim group g. Moreover, suppose that the best PMA model was trained using the sample of drivers. Let x i , t be the observed DVs for driver i at time t, g = 1 safe group ( g = 2 risky group) be the classified group if the group indicator z ^ i 1 > 0.5 (otherwise), y ^ i , t in (4) be the predicted claim frequency given x i , t , y ^ t g = ( i C g y ^ i , t ) / N g be the average predicted claim frequencies from the claim group g that driver i was classified to, and N g be the size of group g.
Using the proposed UBI experience rating method, the premium P i t for driver i in year t is given by
P i t , ϰ = ( 1 + R i , t Δ ) × P ¯ t g × E i t × F + P ¯ t * × E i t × ( 1 F )
where P ¯ t g is the group average annual premium in period t from the group data, P ¯ t * is the average annual premium from all data or some other data source, F is the credibility factor (Dean 1997), E i t is the exposure of driver i, and R i , t 1 Δ is the individual adjustment factor to the overall group loss ratio given by
R i , t Δ = R i , t 1 Δ , H + ϰ R i , t Δ , UB ,
which is the sum of the historical loss rate change adjustment R i , t 1 Δ , H and weighted UBI predicted loss rate change adjustment R i , t 1 Δ , UB ; ϰ [ 0 , 1 ] is the UBI policy parameter to determine how much UBI adjustment is applied to R i , t 1 Δ , UB in R i , t Δ when updating the premium to account for current driving behaviour. The historical loss rate change R i , t 1 Δ , y , historical individual loss ratio R i , t 1 , and historical group loss ratio R t 1 g are, respectively,
R i , t 1 Δ , H = R i , t 1 R t 1 g R t 1 g , R i , t 1 = L i , t 1 P i , t 1 , and R t 1 g = L t 1 g P t 1 g .
The UBI predicted loss rate change R i , t Δ , UB , UBI predicted individual loss ratio R i , t y , and UBI predicted group loss ratio R t y , g are, respectively,
R i , t Δ , UB = R i , t y R t y , g R t y , g , R i , t y = y ^ i , t P i , t , a n d R t y , g = y ^ t g P t g .
The credibility factor F is the weight of the best linear combination between the premium estimate ( 1 + R i , t Δ ) × P ¯ t g using the sample data to the premium estimate P ¯ t * using all data or data from another source to improve the reliability of the premium estimate P i t . The credibility factor increases with the business size and, hence, the number of drivers in the sample. Dean (1997) provided some methods to estimate F and suggested full credibility F = 1 when the sample size N is large enough, such as above 10,000 in an example. As this requirement is fulfilled for the telematic data with size N = 14,157, and all data are used to estimate the chosen PMA model, a full credibility of F = 1 was applied. In cases where insured vehicles are less in number in the sample, the credibility factor F may vary, and external data sources may be used to improve the reliability of the premium estimate. Moreover, as the selected PMA model can classify drivers, the premium calculation can focus on the classified driver group to provide a more precise premium calculation.
We give an example to demonstrate the experience rating method and its extension to UBI. Suppose that driver i is classified as a safe driver ( g = 1 ) in a driving test and wants to buy auto insurance for the next period ( E i t = 1 ). As summarised in Table 4, the annual premium for the safe group is P ¯ t 1 = $ 300 and for the risky group is P ¯ t 2 = $ 500 . Driver i has recorded L i , t 1 = 0.2 in annual claim frequency and paid an annual premium of P i , t 1 = $ 500 before. The safe group has recorded an average of L t 1 1 = 0.1 in annual claim frequency and paid an annual premium if P t 1 1 = $ 310 per driver before. The risky group has recorded an average of L t 1 2 = 0.3 claims/loss and paid P t 1 2 = $ 510 in annual premium per driver before. Driver i has more claims than the average of a safe group before. According to these historical claim frequencies, driver i is expected to be relatively more risky than the average of the safe group, so he should pay more.
To illustrate the UBI experience rating method, additional assumptions about the predicted annual claim frequencies for driver i have been added to the last row of Table 4. Assume that driver i has a predicted annual claim frequency y ^ i , t 1 = 0.15 before; then, that of the safe group is y ^ t 1 1 = 0.105 , and that of the risk group is y ^ t 1 2 = 0.305 . This suggests that driver i operates his vehicle more safely than his historical claims indicate. This information is summarised in Table 4.
Taking the policy parameter ϰ = 1 , the UBI experience rating premium is given by
P i t , 1 = ( 1 + R i , t Δ ) × P ¯ c i , t × E i × F = ( 1 + 0.1260 ) × 300 × 1 × 1 = $ 337.80
where the historical loss rate change R i , t 1 Δ , the historical loss ratio R i , t 1 for driver i, the historical loss ratio for safe group R t 1 1 , the UBI predicted loss rate change R i , t 1 Δ , UB , the UBI predicted loss ratio R i , t y for driver i, and the UBI predicted loss ratio R t y , 1 for the safe group are, respectively,
R i , t Δ = R i , t 1 Δ , H + 1 × R i , t Δ , UB = 0.2403 0.1143 = 0.1260 ,
R i , t 1 Δ , H = R i , t 1 R t 1 1 R t 1 1 = 0.4 0.3225 0.3225 = 0.2403 , R i , t 1 = L i , t 1 P i , t 1 = 0.2 0.5 = 0.4 , R t 1 1 = L t 1 1 P t 1 1 = 0.1 0.31 = 0.3225 , R i , t Δ , UB = R i , t y R t y , 1 R t y , 1 = 0.3 0.3387 0.3387 = 0.1143 , R i , t y = y ^ i , t P i , t 1 = 0.15 0.5 = 0.3 , R t y , 1 = y ^ t 1 P t 1 = 0.105 0.31 = 0.3387
using (12). So, the premium for driver i using the UBI experience rating method is $337.80. This premium is higher than the premium P ¯ t 1 = $300 for the safe group because the loss ratio for driver i is higher relative to the overall ratio in the safe group using historical claims. However, his current loss ratio due to current safe driving reduces the adverse effect due to the higher historical claims.
Nevertheless, we recognise that not all insured vehicles are equipped with telematic devices, introducing potential data gaps in the telematics insights. In response to this challenge, the UBI policy parameter ϰ in (13) can be set to 0. This adaptation to the UBI pricing model in (12) also allows for the application to newly insured drivers with only historical records (traditional demographic variables). This premium called historical experience rating premium for driver i during period t is
P i t , 0 = ( 1 + R i , t 1 Δ , H ) × P ¯ t 1 × E i × F = ( 1 + 0.24031 ) × 300 × 1 × 1 = $ 372.09
where the historical loss rate change R i , t 1 Δ is given by (16). This loss rate change can capture individual differences within a claim group using historical claims but fails to reflect the recent driving risk. Hence, this premium is higher than the UBI experience rating premium calculated using both historical and current driving experience. Thus, the historical experience rating method is unable to provide immediate compensation/reward for safe driving.
Moreover, the UBI premium can track driving behaviour more frequently and closely using regularly updated claim class and annual claim frequency prediction y ^ i , t . The updating period can be reduced to monthly or even weekly to provide more instant feedback using the live telematic data. In summary, the proposed UBI experience rating premium provides a correction of the loss rate change R i , t Δ of the experience rating only premium using the sum of both the historical loss rate change R i , t 1 Δ , H and the UBI predicted loss rate change R i , t Δ , UB . Here, the proposed PMA model can predict more instantly the annual claim frequencies y ^ i , t using live telematic data. Hence, the UBI premium can be updated more frequently to provide incentives for safe driving. The proposed UBI experience rating premium provides an incremental innovation to the business processes allowing the company to gradually transit to the new regime of UBI by adjusting the UBI policy factor ϰ in (13) such that ϰ can gradually increase from 0 to 1 if driver i wants his premium to gradually account for his current driving.
We remark that our analyses made a few assumptions. Firstly, we assumed that the annual premium P ¯ t g covers the total cost with possibly some profit, and the expectations of loss ratios R i , t 1 Δ , H and R i , t Δ , UB across drivers i in group g are around zero. To assess the validity of the assumptions on expectations, one can obtain the distributions of R i , t 1 Δ , H , R i , t Δ , UB based on the most recent data. If their means m g Δ , H , m g Δ , UB are not zero, the overall loss ratio R i , t Δ in (13) can be adjusted as
R i , t Δ = m g Δ , H R i , t 1 Δ , H + m g Δ , UB ϰ R i , t Δ , UB
for group g. For conservative purposes, the means m g Δ , H , m g Δ , UB can be replaced by say 75% quantiles q g , 0.75 Δ , H , q g , 0.75 Δ , UB of the distributions. Secondly, it also implicitly assumes perfect or near-perfect monitoring. However, the advent of monitoring technologies reduces the extent of asymmetric information between insureds and insurers and reduces moral hazard costs.

5. Conclusions

In summary, our study, based on claim data from 14,157 drivers exhibiting equidispersion and a substantial 92% of zero claims, introduces a novel approach using two-stage TP, PM, and ZIP regressions. Employing regularisation techniques such as lasso, elastic net, adaptive lasso, and adaptive elastic regularisation, we aimed to predict annual claim frequencies, identify significant DVs, and categorised drivers into low-claim (safe driver) and high-claim (risky driver) groups. To ensure the robustness of our findings, we performed 100 resampling iterations, each comprising 70% of the drivers for all TP, PM, and ZIP models. Our empirical results show that PMA model with adaptive lasso regularisation displayed the best performance in this study. This finding provides relevant guidelines for practitioners and researchers, as the analysis is based on a sound representative telematics sample. Moreover, the PMA model is highly favoured in Table 3c, and its implementation is more straightforward than the TP models.
Furthermore, we proposed to utilise the best-performing PMA model for implementing a UBI experience rating method, aiming to enhance the efficiency of premium pricing strategies. This approach shifts the focus from traditional claim history to recent driving behaviour, offering a nuanced assessment of drivers’ risk profiles. Notably, our proposed UBI premium pricing method departs from the annual premium revision characteristic of traditional methods and instead allows for more frequent updates based on recent driving performance to provide instant rewards for safe driving practices and feedback against risky driving using scores of the selected significant DVs for the high-claim group. This dynamic pricing approach not only incentivises responsible and less-frequent driving but also minimises the cross-subsidisation of risky drivers. By enabling a more accurate and timely reflection of driver risk, the UBI contributes to improved loss reserving practices for the auto insurance industry. In essence, our findings support the adoption of UBI experience rating methods as a progressive and effective means of enhancing both driver behaviour and the overall operational efficiency of auto insurance companies.
To implement the PMA models for premium pricing, Section 3.5 provides the modelling details and Appendix B.5 the technical application. If it is challenging, some data analytic companies are experienced to support the handling of telematics data, running of PMA models, predicting drivers’ claims, and revising the UBI experience rating premiums. Updating the frequency for models, claim predictions, and premiums depends on the resources and type of policies. As a suggestion, the PMA models can be updated annually to reflect the change in road conditions, transport policies, etc., and the drivers’ predicted annual claims can be updated fortnightly or monthly depending on drivers’ mileage. When their predicted annual claims are updated, the premium can also be updated to provide an incentive for good driving. Averaged and individual driving scores for the selected significant DVs (e.g., 29, 36, 59, 67, 73 for the high-claim group of PMA model) can be sent possibly with warnings to inform driving behaviour and encourage skill improvement. These selected significant DVs are associated with more frequent severe brake to slow down at weekday and weekend nights, as well as more frequent severe right turns at the junction at weekday nights and Friday rush time.
In the context of future research within this domain, expanding the classification of driver groups to three or more holds the potential to encompass a wider range of driving styles, ultimately leading to more accurate predictions of claim liability. Introducing an intermediary driver group, distinct from the existing safe and risky classifications, offers an avenue to capture unique driving behaviours and potentially enhances the predictive power of our models. This extension not only enables a closer examination of different driving behaviours but also poses challenges in terms of identifying and interpreting these additional groups. While the application of similar mixture models and regularisation techniques for modelling multiple components remains viable, unravelling the intricacies of distinct groups within the expanded framework introduces interpretative complexities. Determining whether the third group is a composite of the existing two or represents a genuinely distinct category presents additional challenges. Moreover, handling the label switching problem becomes more intricate when dealing with mixture models featuring multiple groups.
A parallel trajectory for future exploration centers around the integration of neural networks as an alternative modelling approach. In contrast to the selection of key driving variables, neural networks employ hidden layers to capture intricate dynamics, incorporating diverse weights and interaction terms. This modelling paradigm allows for the application of network models to trip data without temporal aggregation, as exemplified by Ma et al. (2018), facilitating a more detailed analysis of driving behaviours in conjunction with real-time information on surrounding traffic conditions.

Author Contributions

Conceptualisation, J.S.K.C. and U.E.M.; methodology, J.S.K.C.; software, F.U. and A.X.D.D.; validation, F.U., J.S.K.C. and A.X.D.D.; formal analysis, F.U. and Y.W.; investigation, F.U. and Y.W.; resources, U.E.M.; data curation, F.U.; writing—original draft preparation, F.U., J.S.K.C. and Y.W.; writing—review and editing, J.S.K.C.; visualisation, F.U.; supervision, J.S.K.C.; project administration, J.S.K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AAdaptive lasso
AICAkaike information criterion
BICBayesian information criterion
DVsDriver behaviour variables
EElastic net
GLMGeneralized linear model
GPSGlobal positioning system
IGInformation gain
LLasso
MSEMean squared error
NAdaptive elastic net
NBNegative binomial
PAYDPay As You Drive
PHYDPay How You Drive
PMPoisson mixture
RMSERoot mean squared error
ROCReceiver operating characteristic curve
TPTwo-stage threshold Poisson
UBIUsage-based auto insurance
ZIPZero-inflated Poisson

Appendix A. Details of Stage 1 TP Model Procedures

  • Draw subsamples S r = { ( x i , n i , y i ) , i I r } , r = 1 , , R , with each containing N r = 9910 drivers, where the index set I r contains all i being sampled. The K-fold CV ( K = 10 ) further splits S r into 10 nonoverlapping and equal-sized ( N k = 991 ) CV sets
    S r k = { ( x i , n i , y i ) , i I r k } , k = 1 , , K with   index   set I r k
    and the training sets are S r k T = S r S r k , with index set I r k T = I r I r k . Set λ = ( λ 1 , , λ M ) for some M to be the list of potential λ .
  • Estimate β λ m , r k = argmin β LOSS λ , α , w ( β ) in (7) for each λ m λ and training set S r k T at repeat r and CV k. Find optimal λ m that minimises some regularised CV test statistic such as MSE, MAE, or Deviance (Dev). Taking Dev as an example,
    λ r , min = argmin λ m λ Dev r ( λ m ) = argmin λ m λ 1 N r k = 1 K i I r k 2 log f y r k i ; μ r k i ( β λ m , r k )
    where the mean μ r k i , λ m = exp x i β λ m , r k + log ( n i ) . Among MSE, MAE, and Dev statistics, optimal λ r , min using Dev is selected according to the RMSE of predicted claims for all subsamples. Using λ r , min , β r = ( β r 1 , , β r J ) is re-estimated based on the subsample S r . Figure A1a plots Poisson deviance with SE against log ( λ m ) , showing how it drops to λ r , min for the first subsample ( r = 1 ). Figure A1b shows how β r j shrinks to zero as λ increases.
  • Average those nonzero coefficients (selected at least once) over repeats as below:
    β j = r = 1 R β r j I ( β r j 0 ) / r = 1 R I ( β r j 0 ) , j I β
    where I ( A ) is the indicator function of event A, and the index set
    I β = { j : r = 1 , , R , β r j 0 }
    contains those DVs selected at least once over R subsamples in stage 1. The averaged coefficients β j I β (based on Dev) are reported in Table A2 for the TPL-1 and TPA-1 models using the optimal λ min . For example, DV 10 is not even selected once for the TPL-1 model.
  • Further select DVs that are frequently (not rarely) selected according to a weighted selection frequency measure given by
    I j = r = 1 R 1 RMSE r I ( β r j 0 ) , j I β
    weighting inversely to RMSEr. Superscripts T1, T2, M, and Z are added to I j when applied to the stage 1 TP, stage 2 TP, PM, and ZIP models, respectively. This weighted selection counts I = ( I j I β ) using MSE, MAE, and deviance, which are also reported in Table A2 for the TPL-1 and TPA-1 models. Table 3a shows that the TPL-1 and TPA-1 models have been selected according to the model performance measures AIC, BIC, and MSE. The results of the TPL-1 model in Table A2 show that 12 DVs in deep grey highlight with I j < 0.2 max ( I j ) j I β = 62 have been dropped, as they are rarely selected, resulting in J T 1 = 65 1 12 = 52 DVs. These J T 1 DVs can be interpreted as frequently selected DVs or simply selected DVs.
Figure A1. (a) The Poisson deviance CV criteria across log λ m to find λ min . (b) Coefficient β j across log λ for stage 1 TP model using lasso regularisation based on the first subsample ( r = 1 ).
Figure A1. (a) The Poisson deviance CV criteria across log λ m to find λ min . (b) Coefficient β j across log λ for stage 1 TP model using lasso regularisation based on the first subsample ( r = 1 ).
Risks 12 00137 g0a1

Appendix B. Some Technical Details of Model Implementation

  • This study utilises R commands glm to fit Poisson regression and glmnet to fix Poisson regression with lasso regularisation (Zeileis et al. 2008). The latter command begins with adopting the R function sparse.model.matrix as
    data_feature <- sparse.model.matrix(∼., dt_feature).
    We use the argument penalty.factor in cv.glmnet for adaptive lasso. We remark that the glmnet package does not provide a p value. We extract the p value for the selected DVs by refitting the model using glm procedure.
  • We use the 100 simulated dataset in stages 1 and 2 of the TP and PM models to explore optimal α values in the elastic net. We first set up our 10-fold CV strategy. Using caret package in R, we use train() with method = “glmnet” to fit the elastic net.
    XX = model.matrix(Claims ~ . -EXP-1,data=stage1)
      YY = stage1$Claims
      OFF = log(stage1$EXP)
      Fit_stage1 <- caret::train(
        x = cbind(XX,OFF),
        y = YY,
        method = "glmnet",
        family = "poisson",
        tuneLength = 10,
        trControl = trainControl(method="cv", number = 10, repeats = 100)
      )
  • We use roc() in the pROC package to calculate the AUC. The latex2exp package also provides an ROC plot.
  • We implement the AER package in R using the built-in command dispersiontest() that assesses the alternative hypothesis H 1 : Var ( Y i ) = μ i + Ψ × trafo ( μ i ) , where the transformation function trafo ( μ i ) = μ i (by default, trafo = NULL) corresponds to the Poisson model with Var ( Y i ) = ( 1 + Ψ ) μ i . If the dispersion 1 + Ψ is greater than 1, it indicates overdispersion.
  • The PM regression model is estimated using
    FLXMRglmnet(formula = .∼., family = c("gaussian","binomial","poisson"),
    adaptive = TRUE, select = TRUE, offset = NULL, …)
    in the R package flexmix (Leisch 2004) to fit mixtures of GLMs with lasso regularisation. Setting adaptive = TRUE for the adaptive lasso triggers a two-step process. Initially, an unpenalised model is fitted to obtain the preliminary coefficient estimates β ^ j for the penalty weights w j = 1 / | β ^ j | .Then, w j values are applied to each coefficient in the subsequent model fitting. With the selected DVs for the low- and high-claim groups, FLXMRglmfix() refits the model, provides the significance of the coefficients, predicts claims, supports CV values and evaluates various goodness-of-fit measures.
  • The ZIP regression model is estimated using the zipath() function for lasso and elastic net regularisation and the ALasso() function for adaptive lasso regularisation from the mpath and AMAZonn packages. The optimal lambda minimum is searched via 10-fold cross-validation with cv.zipath() and applied to both fitted models, ZIPL and ZIPA, for R = 100 subsamples, each with 70% data. Full data are refitted to the PM model based on the selected DVs using Poisson zeroinf.

Appendix C. Driving Variable Description

Event type
ACCAcceleration Event—Accelerating/From full stop
 C1Smooth acceleration (acceleration to 30 MPH in more than 12 s)
 C2Moderate acceleration (acceleration to 30 MPH in 5–11 s)
BRKBraking Event—Full Stop/Slow down
 C1Smooth, even slowing down (up to about 7 mph/s)
 C2Mild to sharp brakes with adequate visibility and road grip (7–10 mph/s)
LFTLeft turning Event—None (Interchange, curved road, overtaking)/At Junction
 C1Smooth, even cornering within the posted speed and according to the road and visibility conditions
 C2Moderate cornering slightly above the posted speed (cornering with light disturbance to passengers)
RHTRight turning Event—None (Interchange, curved road, overtaking)/At Junction
 C1 and C2 are the same as LFT
Time type
T1Weekday late evening, night, midnight, early morning
T2Weekday morning rusk, noon, afternoon rush
T3Weekday morning, afternoon, no rush
T4Friday rush
T5Weekend night
T6Weekend day
Table A1. Driving variable labels.
Table A1. Driving variable labels.
DV1ACC_ACCELERATING_T3_C1DV19BRK_FULLSTOP_T1_C1DV39LFT_NONE_T1_C1DV57RHT_NONE_T1_C1
DV2ACC_ACCELERATING_T3_C2DV20BRK_FULLSTOP_T1_C2DV43LFT_NONE_T6_C1DV58RHT_NONE_T1_C2
DV3ACC_ACCELERATING_T4_C1DV22BRK_FULLSTOP_T2_C2DV44LFT_NONE_T6_C2DV59RHT_NONE_T4_C1
DV4ACC_ACCELERATING_T4_C2DV23BRK_FULLSTOP_T3_C1DV45LFT_ATJUNCTION_T1_C1DV60RHT_NONE_T4_C2
DV5ACC_ACCELERATING_T5_C1DV24BRK_FULLSTOP_T3_C2DV46LFT_ATJUNCTION_T1_C2DV61RHT_NONE_T5_C1
DV7ACC_ACCELERATING_T5_C2DV25BRK_FULLSTOP_T4_C1DV47LFT_ATJUNCTION_T2_C1DV63RHT_NONE_T5_C2
DV8ACC_FROMFULLSTOP_T1_C1DV26BRK_FULLSTOP_T4_C2DV49LFT_ATJUNCTION_T3_C1DV64RHT_NONE_T6_C1
DV9ACC_FROMFULLSTOP_T1_C2DV27BRK_FULLSTOP_T6_C1DV50LFT_ATJUNCTION_T3_C2DV65RHT_NONE_T6_C2
DV10ACC_FROMFULLSTOP_T2_C1DV28BRK_FULLSTOP_T6_C2DV51LFT_ATJUNCTION_T4_C1DV66RHT_ATJUNCTION_T1_C1
DV13ACC_FROMFULLSTOP_T3_C2DV29BRK_SLOWDOWN_T1_C1DV52LFT_ATJUNCTION_T4_C2DV67RHT_ATJUNCTION_T1_C2
DV14ACC_FROMFULLSTOP_T4_C1DV31BRK_SLOWDOWN_T2_C1DV53LFT_ATJUNCTION_T5_C1DV68RHT_ATJUNCTION_T2_C1
DV15ACC_FROMFULLSTOP_T4_C2DV32BRK_SLOWDOWN_T2_C2DV54LFT_ATJUNCTION_T5_C2DV69RHT_ATJUNCTION_T2_C2
DV16ACC_FROMFULLSTOP_T5_C1DV33BRK_SLOWDOWN_T4_C1DV55LFT_ATJUNCTION_T6_C1DV71RHT_ATJUNCTION_T3_C2
DV18ACC_FROMFULLSTOP_T5_C2DV34BRK_SLOWDOWN_T4_C2DV56LFT_ATJUNCTION_T6_C2DV72RHT_ATJUNCTION_T4_C1
DV35BRK_SLOWDOWN_T5_C1 DV73RHT_ATJUNCTION_T4_C2
DV36BRK_SLOWDOWN_T5_C2 DV74RHT_ATJUNCTION_T5_C1
DV37BRK_SLOWDOWN_T6_C1 DV75RHT_ATJUNCTION_T5_C2
DV38BRK_SLOWDOWN_T6_C2 DV76RHT_ATJUNCTION_T6_C1
DV77RHT_ATJUNCTION_T6_C2

Appendix D. Visualisation of Driver Variables

Appendix D.1. Driving Variables by Claim Frequency

Figure A2. Value against driver ID with colours showing claim frequency y i = 0 , 1 , 2 for 65 DVs.
Figure A2. Value against driver ID with colours showing claim frequency y i = 0 , 1 , 2 for 65 DVs.
Risks 12 00137 g0a2

Appendix D.2. Correlation Matrix and Hierarchical Clustering of Driving Variables

Figure A3. Relationship between variables using correlation matrix and hierarchical clustering.
Figure A3. Relationship between variables using correlation matrix and hierarchical clustering.
Risks 12 00137 g0a3

Appendix E. Parameter Estimates of All Models

Table A2. Parameter estimates β in (A1) for stage 1 TP models before refit with R = 100 subsamples of 70% data using Poisson glmnet, β T 1 after refitting to full data using Poisson glm on the selected DVs with I j T 2 > 62 (otherwise dropped as indicated in grey highlight), and selection criteria I T 1 in (A3). There are J T 1 = 52 DVs selected for TPL-1 and J T 1 = 39 DVs selected for TPA-1 under columns β j T 1 . The bold with yellow highlighted under β j T 1 are significant.
Table A2. Parameter estimates β in (A1) for stage 1 TP models before refit with R = 100 subsamples of 70% data using Poisson glmnet, β T 1 after refitting to full data using Poisson glm on the selected DVs with I j T 2 > 62 (otherwise dropped as indicated in grey highlight), and selection criteria I T 1 in (A3). There are J T 1 = 52 DVs selected for TPL-1 and J T 1 = 39 DVs selected for TPA-1 under columns β j T 1 . The bold with yellow highlighted under β j T 1 are significant.
TPL-1
glmnet with 100 Repeatsglm glmnet with 100 Repeatsglm glmnet with 100 Repeatsglm
MeasuresMSEMAEDeviancePoissonMeasuresMSEMAEDeviancePoissonMeasuresMSEMAEDeviancePoisson
DVs I j T 1 β j β j T 1 DVs I j T 1 β j β j T 1 DVs I j T 1 β j β j T 1
1-347−0.0029-28312661−0.0031-55-119680.00820.0134
2892322280.01760.0159292272762790.02790.036056371361230.01490.0061
31803173370.04020.0619312513243370.04090.053557139310334−0.0446−0.1696
4140307334−0.0409−0.0987323023273410.04740.051358241471090.01150.0095
53109610.0085-331332732660.02560.039359682522290.02820.0448
7-136116−0.0021−0.083034-8520−0.0073-6095198191−0.0243−0.0091
879537−0.0011-351462452420.02220.0168612553173200.04260.0626
92723103200.04170.0546362623203340.05760.079763982422350.02190.0346
10-17---372923273340.04240.05186430194160−0.0264−0.0348
131014072−0.0154−0.0253381842902890.02380.026365-9941−0.0084-
14-10514−0.0031-39712322280.01220.016066411641330.01640.0173
151411968−0.0190−0.00654378204204−0.0381−0.050567329330341−0.1400−0.1706
163113650.0001−0.000444311241−0.0113-681710261−0.0135-
183163273410.09690.125445378480.0172-6917188140−0.0166−0.0320
19-85200.0021-46311941640.02100.036171782312240.01720.0185
201733103230.03630.056347177314330−0.0611−0.0918721873032970.03190.0418
22412051770.01330.03094995245252−0.0341−0.0448733023273410.05870.0743
23-13375−0.0051−0.023550722282180.01570.0205741022422420.02160.0350
241362722620.02130.023651116289306−0.0517−0.0944753363303410.06210.0659
25-11958−0.0087-52150307324−0.0397−0.06237648239235−0.0236−0.0565
26581601390.01290.002453651741570.01560.0107771573073240.03240.0549
2717160129−0.0205−0.0402541632622720.02090.0208
TPA-1
glmnet with 100 Repeatsglm glmnet with 100 Repeatsglm glmnet with 100 Repeatsglm
MeasuresMSEMAEDeviancePoissonMeasuresMSEMAEDeviancePoissonMeasuresMSEMAEDeviancePoisson
DVs I j T 1 β j β j T 1 DVs I j T 1 β j β j T 1 DVs I j T 1 β j β j T 1
134124−0.0712-28-79---55-41200.0030-
26195680.03600.0149292282792760.03110.0357561499610.0311-
31603173100.05120.0608312173163130.05140.05365778306327−0.0797−0.1726
489296310−0.0630−0.1035323193373400.05520.050358-3830.0297-
5-61100.0482-33782482140.03520.036359312041390.03740.0478
7-24---34-3817−0.0201-6058143126−0.0288−0.0093
8-3413−0.0067-351021901770.02580.0199611632832820.05950.0702
91873002890.05170.0579362483343300.07030.077563411941500.03580.0422
10-----372853343300.05130.0530642714389−0.0551−0.0317
13-4820−0.0144-381602312180.02980.027165-1710−0.0100-
14-62---397116710.01320.01636617109550.0221-
1538631−0.0200-4348235204−0.0577−0.051067336340340−0.1752−0.1686
16-143−0.0088-44-447−0.0294-68-8555−0.0201-
183333403400.12120.120545372440.0293-6979934−0.0334-
191065170.0146-4610130510.0410-71371291020.02910.0204
201122862720.04260.056747170327327−0.0773−0.0913721562692480.04790.0443
2220143850.03000.02824958194187−0.0470−0.0367732893403370.07100.0733
23-5514−0.0171-5027129920.02590.020674512281670.03010.0341
24581881470.02370.02305151282262−0.0718−0.0918753163403330.07480.0709
25-343−0.0843-52136306303−0.0493−0.06187641225184−0.0446−0.0565
262168380.0201-531071540.0182-771162902730.04570.0554
2710153105−0.0349−0.0359541091761830.02590.0256
Table A3. Parameter estimates β L h T 2 ,   β H h T 2 for the stage 2 TP models with R = 100 subsamples of 70% data. Parameters are based on J T 1 = 52 DVs in stage 1, and J 2 T refers to the number of frequently selected DVs with I L h j > 43 ( τ 0.08 ), 49 ( τ 0.09 ), 53 ( τ 0.10 ). 56 ( τ 0.11 ), and I H h j > 19, 13, 9, and 6, respectively, which differ across threshold. Significant β L h j T 2 ,   β H h j T 2 are in boldface with yellow highlighted.
Table A3. Parameter estimates β L h T 2 ,   β H h T 2 for the stage 2 TP models with R = 100 subsamples of 70% data. Parameters are based on J T 1 = 52 DVs in stage 1, and J 2 T refers to the number of frequently selected DVs with I L h j > 43 ( τ 0.08 ), 49 ( τ 0.09 ), 53 ( τ 0.10 ). 56 ( τ 0.11 ), and I H h j > 19, 13, 9, and 6, respectively, which differ across threshold. Significant β L h j T 2 ,   β H h j T 2 are in boldface with yellow highlighted.
τ 0.08 : TPLA-2 τ 0.09 : TPLA-2 τ 0.10 : TPLN-2 τ 0.11 : TPLN-2
GroupsLow High Low High Low High Low High
R h 0.700.300.790.210.850.150.900.10
I h j 43194913539566
J 2 T 1722381442234328
DVs I L h j β L h j T 2 DVs I H h j β H h j T 2 DVs I L h j β L h j T 2 DVs I H h j β H h j T 2 DVs I L h j β L h j T 2 DVs I H h j β H h j T 2 DVs I L h j β L h j T 2 DVs I H h j β H h j T 2
2--2420.02772--2190.01692--2180.01082--250.0250
380.00933310.03563520.03313--31160.0515350.023531180.04033--
449−0.05674420.03664254−0.07144430.04664356−0.07484760.08644357−0.083341290.1443
7--7--7--7--7--7--7--7--
9950.063693−0.008193500.097693−0.041693480.0953910−0.042193570.0859915−0.0528
134−0.088213480.04101366−0.064813870.082413243−0.051313910.081413278−0.054513830.0667
1511−0.0010156−0.02051526−0.019315--1544−0.064315--1522−0.02861580.0355
1611−0.058416170.02521633−0.035316680.04241622−0.007716550.04261611−0.009716150.0219
18460.082018360.0608182100.08271830.0867183230.076418--183430.103118--
2015−0.005620450.035120400.042820--202070.043720--201710.03632030.0768
2272−0.063322890.04072215−0.04692280.02722236−0.01382250.01742247−0.021822180.0432
23110.00602360.019623330.0084233−0.0091231010.0292238−0.0442231390.02802340−0.0761
24350.064524110.0316241950.07272430.0456243480.090424--243390.084424--
261130.0577266−0.0608261990.05162611−0.0282261850.0377263−0.0094261540.0270268−0.0363
2761−0.050227340.04762795−0.0422275−0.026427214−0.04782710−0.03202775−0.02872733−0.0863
2912−0.074929950.02112948−0.049729320.0179291630.04762911−0.0494291570.035829130.0148
31120.056331170.021931740.04953130.0463312870.05553150.0323312860.054931--
32460.061432610.0320323320.111532--323450.0893338−0.0312323320.080132--
33230.049233--331470.052333--333090.056933--332640.0492335−0.0615
35--35250.034335850.027735--351270.026535--35610.02273550.0954
36690.05753630.0776361370.041136--362640.053636--363180.0591363−0.0304
372430.104737140.0276373540.133737--373550.123937--373570.115337--
3838−0.050238220.03043878−0.059638160.040138127−0.043738390.03863825−0.023738180.0239
3987−0.065239--39251−0.081939--39327−0.10153930.014839132−0.148239--
4327−0.03804360.01504359−0.05014313−0.03754365−0.02654326−0.08684382−0.02894383−0.0808
464−0.031046670.031046150.00884650.037646260.018846--46360.034546100.0868
4760−0.0564476−0.010347225−0.0711473−0.074547341−0.080447--47321−0.071347--
4915−0.037549--4959−0.03264911−0.05644954−0.02194975−0.075949193−0.03814926−0.0744
5015−0.039450110.03205029−0.054350--5087−0.030850130.02915064−0.024450220.0360
51152−0.0853511500.062551206−0.07695180.062351214−0.055551--51314−0.07765170.0969
524−0.00135245−0.082652151−0.0387528−0.076352268−0.04165216−0.020252293−0.0462527−0.0014
531520.073353--53950.04075330.01675351−0.029753100.0507531100.03805350.0094
54340.04245460.021054560.02025430.004754470.02565430.0332542320.039254--
55110.034055--551580.04835516−0.0460551520.03545513−0.0540552280.04915533−0.1115
56490.0443563−0.0516561220.0412563−0.0275561960.03275613−0.0398561320.04015620−0.0636
5715−0.062657--57214−0.067257920.074057337−0.06735749−0.197457346−0.07395745−0.2011
58490.04325878−0.054558740.02905857−0.054858910.03495896−0.0789581000.02805898−0.0941
59650.057159--592070.05665933−0.0480592940.05575991−0.0799592850.05525950−0.0947
6050−0.04136080.03456055−0.044260--60239−0.051460100.061560221−0.051460100.0957
6180.046561310.025861260.013561320.0445611490.048561--611460.05046180.0182
638−0.037863670.0422634−0.005463--6315−0.006863160.067563500.02196330.0811
6426−0.062964--6488−0.059364--64142−0.048964--64211−0.045864350.0833
66380.049966--662250.049666--662540.041666--662110.0400663−0.0610
6731−0.042067176−0.140167310−0.091267187−0.146167363−0.127567112−0.097767357−0.14186797−0.1485
6919−0.040769760.050869121−0.052369--6998−0.035369--6946−0.0232695−0.0503
71300.0396716−0.0682711920.0464718−0.0405711750.0328713−0.0374711000.0228712−0.0005
7280.032672110.026772410.026572--72760.023672--721040.021972100.0867
73420.059473500.0344731850.07477350.0291732680.07097350.0354733210.07097320.0015
74--74170.0248741220.04587430.0242742760.061174--741880.044674--
75150.0488751070.0456752540.070775160.0436753010.07307530.0451753320.084475--
7627−0.0662766−0.034376151−0.0588765−0.003576276−0.054976100.077976271−0.049176270.0867
7770.040077140.020977260.03737750.033077360.007577240.0574771000.03917750.0448
Table A4. Parameter estimates β 0 , β c for ZIP model before refit, β L M ,   β H M for PM, and β 0 Z , β c Z for ZIP models after refitted to all data based on selected DVs and selection criteria I L M , I H M , I 0 Z , I c Z , with R = 100 subsamples of 70% data. For PM model, J L M ,   J H M refer to the number of frequently selected DVs with I L j M > 43 (PML), 45 (PMA), and I H j M > 19 (PML), 17 (PMA); otherwise, they are dropped, as in grey highlight. For ZIP model, J 0 Z ,   J c Z refer to the number of frequently selected DVs with I 0 j Z > 62 and I c j Z > 62 ; otherwise, β 0 j and β c j are excluded, as in grey highlight. Significant parameters β L j M ,   β H j M ,   β 0 j Z ,   β c j Z are boldfaced and yellow highlighted.
Table A4. Parameter estimates β 0 , β c for ZIP model before refit, β L M ,   β H M for PM, and β 0 Z , β c Z for ZIP models after refitted to all data based on selected DVs and selection criteria I L M , I H M , I 0 Z , I c Z , with R = 100 subsamples of 70% data. For PM model, J L M ,   J H M refer to the number of frequently selected DVs with I L j M > 43 (PML), 45 (PMA), and I H j M > 19 (PML), 17 (PMA); otherwise, they are dropped, as in grey highlight. For ZIP model, J 0 Z ,   J c Z refer to the number of frequently selected DVs with I 0 j Z > 62 and I c j Z > 62 ; otherwise, β 0 j and β c j are excluded, as in grey highlight. Significant parameters β L j M ,   β H j M ,   β 0 j Z ,   β c j Z are boldfaced and yellow highlighted.
PMLPMAZIPA
I j M 4319 I j M 4517 I j Z 6262
J M 4518 J M 3940 J Z 445
DVsLowHighDVsLowHighDVsZeroCount
I L j M β L j M I H j M β H j M I L j M β L j M I H j M β H j M I 0 j Z β 0 j β 0 j Z I c j Z β c j β c j Z
31180.0182470.09833710.03801260.1182344−0.0022-2910.04040.0514
988−0.0003200.0275927−0.0226160.0106920−0.0047-1720.01700.0450
183240.0636270.0477182060.08771490.107818---880.00440.1352
193110.049130.1058192000.0877820.08211910−0.0003-1360.0050−0.0263
2078−0.0197370.09192071−0.044327−0.05322034−0.0053-2910.05520.0717
221620.0098540.06332291−0.04841010.08392247−0.0104-2170.03330.0441
23335−0.207624−0.263423219−0.2801159−0.273223---84−0.0025−0.0247
262500.0259980.0370262120.0375810.01192631.91 × 10−5-1250.00690.0014
273380.045030.0068272510.0755600.057627100.0004-339−0.1900−0.0495
293240.0355170.0633291720.0663790.07222924−0.0005-2670.01820.0363
312940.0352140.0390311650.0637870.04963120−0.0024-2340.02650.0515
33138−0.0150--3350−0.051613−0.02653355−0.0123-2130.02430.0426
342870.036970.1254341270.0586570.0533343−0.0001-88−0.0022−0.0234
353310.057870.0130352990.1089430.07353520−0.0029-1320.00990.0232
363350.0501310.0450362140.06901200.06423630−0.0103-2810.04640.0752
373040.0284130.0522371830.0416510.05673710−0.0009-2980.04290.0524
38230−0.051670.000238121−0.1553400.00173898−0.0265-39.06181210.0076−0.0065
4388−0.02673−0.02384336−0.029644−0.070343170.0008-173−0.0397−0.0471
44570.0078--44540.0575370.093244---155−0.0099−0.0287
452740.0541240.0395452490.1177790.09824510−0.0005-1530.00900.0188
46338−0.095714−0.044246259−0.166578−0.10814630−0.0042-2640.05690.0362
47338−0.164420−0.046947292−0.291477−0.14934730.0000-322−0.0843−0.0940
492490.0241100.023549910.0430230.0378491650.0366−0.0193251−0.1070−0.0597
50331−0.089824−0.044750197−0.1691126−0.141950---1220.00770.0145
51335−0.098114−0.038151225−0.1602119−0.137851170.0004-301−0.0865−0.0868
523140.0265370.026552930.0447810.053852100.0023-311−0.0738−0.0660
54840.0134--54400.011170.04965417−0.0013-2130.01970.0182
551120.018370.019655330.002714−0.00535570.0001-880.00160.0109
561420.0193440.063656330.04781140.0780563−0.0003-750.00120.0002
582400.0298170.0632581540.0718620.0692583−0.0001-1280.00870.0157
59294−0.043317−0.081759115−0.143173−0.15435924−0.0035-2000.01890.0409
603180.0574240.0934601830.11421620.094560130.0024-231−0.0259−0.0053
612230.029730.0193611670.0773370.05146120−0.0032-2440.04030.0560
63189−0.065220−0.003163162−0.1524125−0.05836337−0.0057-980.00880.0362
641620.013270.034664710.023170.064164---139−0.0220−0.0409
66338−0.16897−0.053866297−0.305464−0.1943667−0.0001-810.00500.0175
6788−0.0263--6737−0.0730400.041667200.0044-339−0.1312−0.1510
68210−0.02657−0.01026890−0.094052−0.13976870.0001-67−0.0013−0.0041
691990.0210170.061169950.0537260.04206970.0005-180−0.0190−0.0286
712700.0388140.0725711410.0869760.06867134−0.0040-1450.00980.0114
723180.04731830.1172722170.07741940.09507230−0.0040-2680.03380.0391
733350.0631880.1193732420.10721090.10647375−0.0129−0.13232880.03910.0617
75321−0.054527−0.062275142−0.1177136−0.16457520−0.0022-3010.05110.0634
762570.0361100.0327761610.0722830.08277670.0009-244−0.0380−0.0656
772840.0324170.0370771700.0740740.06157798−0.0189−34.97261490.0132−0.0313

References

  1. Ayuso, Mercedes, Montserrat Guillen, and Jens Perch Nielsen. 2019. Improving automobile insurance ratemaking using telematics: Incorporating mileage and driver behaviour data. Transportation 46: 735–52. [Google Scholar] [CrossRef]
  2. Banerjee, Prithish, Broti Garai, Himel Mallick, Shrabanti Chowdhury, and Saptarshi Chatterjee. 2018. A note on the adaptive lasso for zero-inflated Poisson regression. Journal of Probability and Statistics 2018: 2834183. [Google Scholar] [CrossRef]
  3. Barry, Laurence, and Arthur Charpentier. 2020. Personalization as a promise: Can big data change the practice of insurance? Big Data & Society 7: 2053951720935143. [Google Scholar]
  4. Bhattacharya, Sakyajit, and Paul D. McNicholas. 2014. An adaptive lasso-penalized BIC. arXiv arXiv:1406.1332. [Google Scholar]
  5. Bolderdijk, Jan Willem, Jasper Knockaert, E. M. Steg, and Erik T. Verhoef. 2011. Effects of Pay-As-You-Drive vehicle insurance on young drivers’ speed choice: Results of a dutch field experiment. Accident Analysis & Prevention 43: 1181–86. [Google Scholar]
  6. Cameron, A. Colin, and Pravin K. Trivedi. 1990. Regression-based tests for overdispersion in the Poisson model. Journal of Econometrics 46: 347–64. [Google Scholar] [CrossRef]
  7. Chan, Jennifer S. K., S. T. Boris Choy, Udi Makov, Ariel Shamir, and Vered Shapovalov. 2022. Variable selection algorithm for a mixture of poisson regression for handling overdispersion in claims frequency modeling using telematics car driving data. Risks 10: 83. [Google Scholar] [CrossRef]
  8. Chassagnon, Arnold, and Pierre-André Chiappori. 1997. Insurance under Moral Hazard and Adverse Selection: The Case of Pure Competition. Delta-CREST Document. Available online: https://econpapers.repec.org/paper/fthlavale/28.htm (accessed on 1 August 2024).
  9. Czado, Claudia, Tilmann Gneiting, and Leonhard Held. 2009. Predictive model assessment for count data. Biometrics 65: 1254–61. [Google Scholar] [CrossRef]
  10. Dean, Curtis Gary. 1997. An introduction to credibility. In Casualty Actuary Forum. Arlington: Casualty Actuarial Society, pp. 55–66. Available online: https://www.casact.org/sites/default/files/database/forum_97wforum_97wf055.pdf (accessed on 1 August 2024).
  11. Deng, Min, Mostafa S. Aminzadeh, and Banghee So. 2024. Inference for the parameters of a zero-inflated poisson predictive model. Risks 12: 104. [Google Scholar] [CrossRef]
  12. Duval, Francis, Jean-Philippe Boucher, and Mathieu Pigeon. 2023. Enhancing claim classification with feature extraction from anomaly-detection-derived routine and peculiarity profiles. Journal of Risk and Insurance 90: 421–58. [Google Scholar] [CrossRef]
  13. Eling, Martin, and Mirko Kraft. 2020. The impact of telematics on the insurability of risks. The Journal of Risk Finance 21: 77–109. [Google Scholar] [CrossRef]
  14. Ellison, Adrian B., Michiel C. J. Bliemer, and Stephen P. Greaves. 2015. Evaluating changes in driver behaviour: A risk profiling approach. Accident Analysis & Prevention 75: 298–309. [Google Scholar]
  15. Fan, Jianqing, and Runze Li. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96: 1348–60. [Google Scholar] [CrossRef]
  16. Fawcett, Tom. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27: 861–74. [Google Scholar]
  17. Gao, Guangyuan, and Mario V. Wüthrich. 2018. Feature extraction from telematics car driving heatmaps. European Actuarial Journal 8: 383–406. [Google Scholar] [CrossRef]
  18. Gao, Guangyuan, Mario V. Wüthrich, and Hanfang Yang. 2019. Evaluation of driving risk at different speeds. Insurance: Mathematics and Economics 88: 108–19. [Google Scholar] [CrossRef]
  19. Gao, Guangyuan, Shengwang Meng, and Mario V. Wüthrich. 2019. Claims frequency modeling using telematics car driving data. Scandinavian Actuarial Journal 2019: 143–62. [Google Scholar] [CrossRef]
  20. Guillen, Montserrat, Jens Perch Nielsen, Ana M. Pérez-Marín, and Valandis Elpidorou. 2020. Can automobile insurance telematics predict the risk of near-miss events? North American Actuarial Journal 24: 141–52. [Google Scholar] [CrossRef]
  21. Guillen, Montserrat, Jens Perch Nielsen, and Ana M. Pérez-Marín. 2021. Near-miss telematics in motor insurance. Journal of Risk and Insurance 88: 569–89. [Google Scholar] [CrossRef]
  22. Guillen, Montserrat, Jens Perch Nielsen, Mercedes Ayuso, and Ana M. Pérez-Marín. 2019. The use of telematics devices to improve automobile insurance rates. Risk Analysis 39: 662–72. [Google Scholar] [CrossRef]
  23. Huang, Yifan, and Shengwang Meng. 2019. Automobile insurance classification ratemaking based on telematics driving data. Decision Support Systems 127: 113156. [Google Scholar] [CrossRef]
  24. Hurley, Rich, Peter Evans, and Arun Menon. 2015. Insurance Disrupted: General Insurance in a Connected World. London: The Creative Studio, Deloitte. [Google Scholar]
  25. Jeong, Himchan. 2022. Dimension reduction techniques for summarized telematics data. The Journal of Risk Management 33: 1–24. [Google Scholar] [CrossRef]
  26. Jeong, Himchan, and Emiliano A. Valdez. 2018. Ratemaking Application of Bayesian LASSO with Conjugate Hyperprior. Available online: https://ssrn.com/abstract=3251623 (accessed on 1 December 2018).
  27. Kantor, S., and Tomas Stárek. 2014. Design of algorithms for payment telematics systems evaluating driver’s driving style. Transactions on Transport Sciences 7: 9. [Google Scholar] [CrossRef]
  28. Lambert, Diane. 1992. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34: 1–14. [Google Scholar] [CrossRef]
  29. Leisch, Friedrich. 2004. FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software 11: 1–18. [Google Scholar] [CrossRef]
  30. Ma, Yu-Luen, Xiaoyu Zhu, Xianbiao Hu, and Yi-Chang Chiu. 2018. The use of context-sensitive insurance telematics data in auto insurance rate making. Transportation Research Part A: Policy and Practice 113: 243–58. [Google Scholar] [CrossRef]
  31. Makov, Udi, and Jim Weiss. 2016. Predictive modeling for usage-based auto insurance. Predictive Modeling Applications in Actuarial Science 2: 290. [Google Scholar]
  32. Meinshausen, Nicolai, and Peter Bühlmann. 2006. Variable selection and high-dimensional graphs with the lasso. Annals of Statistics 34: 1436–62. [Google Scholar] [CrossRef]
  33. Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press. [Google Scholar]
  34. Osafune, Tatsuaki, Toshimitsu Takahashi, Noboru Kiyama, Tsuneo Sobue, Hirozumi Yamaguchi, and Teruo Higashino. 2017. Analysis of accident risks from driving behaviors. International Journal of Intelligent Transportation Systems Research 5: 192–202. [Google Scholar] [CrossRef]
  35. Paefgen, Johannes, Thorsten Staake, and Frédéric Thiesse. 2013. Evaluation and aggregation of Pay-As-You-Drive insurance rate factors: A classification analysis approach. Decision Support Systems 56: 192–201. [Google Scholar] [CrossRef]
  36. Park, Trevor, and George Casella. 2008. The Bayesian lasso. Journal of the American Statistical Association 103: 681–86. [Google Scholar] [CrossRef]
  37. Shannon, Claude Elwood. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5: 3–55. [Google Scholar] [CrossRef]
  38. So, Banghee, Jean-Philippe Boucher, and Emiliano A. Valdez. 2021. Cost-sensitive multi-class adaboost for understanding driving behavior based on telematics. ASTIN Bulletin: The Journal of the IAA 51: 719–51. [Google Scholar] [CrossRef]
  39. Soleymanian, Miremad, Charles B. Weinberg, and Ting Zhu. 2019. Sensor data and behavioral tracking: Does usage-based auto insurance benefit drivers? Marketing Science 38: 21–43. [Google Scholar] [CrossRef]
  40. Städler, Nicolas, Peter Bühlmann, and Sara Van De Geer. 2010. L1-penalization for mixture regression models. TEST: An Official Journal of the Spanish Society of Statistics and Operations Research 19: 209–56. [Google Scholar] [CrossRef]
  41. Stipancic, Joshua, Luis Miranda-Moreno, and Nicolas Saunier. 2018. Vehicle manoeuvers as surrogate safety measures: Extracting data from the GPS-enabled smartphones of regular drivers. Accident Analysis & Prevention 115: 160–69. [Google Scholar]
  42. Tang, Yanlin, Liya Xiang, and Zhongyi Zhu. 2014. Risk factor selection in rate making: EM adaptive lasso for zero-inflated Poisson regression models. Risk Analysis 34: 1112–27. [Google Scholar] [CrossRef]
  43. Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58: 267–88. [Google Scholar] [CrossRef]
  44. Tselentis, Dimitrios I., George Yannis, and Eleni I. Vlahogianni. 2016. Innovative insurance schemes: Pay As/How You Drive. Transportation Research Procedia 14: 362–71. [Google Scholar] [CrossRef]
  45. Verbelen, Roel, Katrien Antonio, and Gerda Claeskens. 2018. Unravelling the predictive power of telematics data in car insurance pricing. Journal of the Royal Statistical Society: Series C (Applied Statistics) 67: 1275–304. [Google Scholar] [CrossRef]
  46. Weerasinghe, K. P. M. L., and M. C. Wijegunasekara. 2016. A comparative study of data mining algorithms in the prediction of auto insurance claims. European International Journal of Science and Technology 5: 47–54. [Google Scholar]
  47. Winlaw, Manda, Stefan H. Steiner, R. Jock MacKay, and Allaa R. Hilal. 2019. Using telematics data to find risky driver behaviour. Accident Analysis & Prevention 131: 131–36. [Google Scholar]
  48. Wouters, Peter I. J., and John M. J. Bos. 2000. Traffic accident reduction by monitoring driver behaviour with in-car data recorders. Accident Analysis & Prevention 32: 643–50. [Google Scholar]
  49. Wüthrich, Mario V. 2017. Covariate selection from telematics car driving data. European Actuarial Journal 7: 89–108. [Google Scholar] [CrossRef]
  50. Zeileis, Achim, Christian Kleiber, and Simon Jackman. 2008. Regression models for count data in R. Journal of Statistical Software 27: 1–25. [Google Scholar] [CrossRef]
  51. Zou, Hui. 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101: 1418–29. [Google Scholar] [CrossRef]
  52. Zou, Hui, and Hao Helen Zhang. 2009. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics 37: 1733. [Google Scholar] [CrossRef]
  53. Zou, Hui, and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67: 301–20. [Google Scholar] [CrossRef]
Figure 1. Histogram of claims (left), exposure (mid), and claims per exposure (right).
Figure 1. Histogram of claims (left), exposure (mid), and claims per exposure (right).
Risks 12 00137 g001
Figure 2. Heat map of coefficients, with significant values denoted by “S”.
Figure 2. Heat map of coefficients, with significant values denoted by “S”.
Risks 12 00137 g002
Figure 3. Scatter plots of observed annual claim frequencies a i against predicted annual claim frequencies a ^ i using TPA-1 model cross-classified with low- and high-claim groups by the four thresholds, with colour indicating claims y i = 0 , 1 , 2 .
Figure 3. Scatter plots of observed annual claim frequencies a i against predicted annual claim frequencies a ^ i using TPA-1 model cross-classified with low- and high-claim groups by the four thresholds, with colour indicating claims y i = 0 , 1 , 2 .
Risks 12 00137 g003
Figure 4. K-means clustering analysis to segment drivers into low-claim cluster in red, with blue shade ellipse and high-claim cluster in blue, as well as red shade ellipse for TP and PM models.
Figure 4. K-means clustering analysis to segment drivers into low-claim cluster in red, with blue shade ellipse and high-claim cluster in blue, as well as red shade ellipse for TP and PM models.
Risks 12 00137 g004
Figure 5. ROC curve and AUC values for (ad) the four best stage 2 TP; (e) PMA model; and (f) four best stage 2 TP models and one PM model.
Figure 5. ROC curve and AUC values for (ad) the four best stage 2 TP; (e) PMA model; and (f) four best stage 2 TP models and one PM model.
Risks 12 00137 g005aRisks 12 00137 g005b
Table 1. Model names for TP, PM, and ZIP models with different lasso regularisation.
Table 1. Model names for TP, PM, and ZIP models with different lasso regularisation.
Stage 1 Threshold PoissonStage 2 Threshold PoissonPoisson MixtureZero-Inflated
TPL-1LassoTPLL-2TPAL-2LassoPMLLassoZIPLLasso
TPE-1Elastic netTPLE-2TPAE-2Elastic netPMEElastic net
TPA-1Adaptive
lasso
TPLA-2TPAA-2Adaptive
lasso
PMAAdaptive
lasso
ZIPAAdaptive
lasso
TPN-1Adaptive
elastic net
TPLN-2TPAN-2Adaptive
elastic net
PMNAdaptive
elastic net
Table 2. Identification of informative DVs. DVs with one asterisk have H j 1 and S j 1 % . DVs with two asterisks have I G j > 0 , indicating information gain.
Table 2. Identification of informative DVs. DVs with one asterisk have H j 1 and S j 1 % . DVs with two asterisks have I G j > 0 , indicating information gain.
DVs ρ H j S j ( % ) IG j Flag DVs ρ H j S j ( % ) IG j Flag DVs ρ H j S j ( % ) IG j Flag DVs ρ H j S j ( % ) IG j Flag
1−0.0020.080.0120 22 * 0.00389.4512.9910 390.0080.420.0760 59 * 0.00245.898.3120
20.0120.020.0020 23 * −0.00187.7512.8710 43 * * −0.05399.9913.7890.002 60 * * −0.04199.9313.7870.001
3 * 0.0187.041.2310240.0171.020.1920 44 * −0.00419.693.7950 61 * 0.01835.716.4720
4−0.0111.910.310025−0.0020.300.0460 45 * 0.00218.613.6780 63 * 0.00632.916.4620
50.0030.790.1190 26 * 0.00517.503.9900 46 * −0.00390.5113.0080 64 * −0.02161.7810.1490
7−0.0020.010.0010 27 * * −0.06099.6913.7730.002 47 * −0.03592.6813.2460650.00051.220.2950
80.0040.100.014028−0.0040.030.0030 49 * * −0.06199.9813.7890.002 66 * 0.0084.411.3390
9 * 0.01028.695.6660 29 * 0.0234.411.2880 50 * 0.0126.651.2470 67 * * −0.06099.5413.7660.002
10−0.0020.010.0010 31 * 0.01415.933.6980 51 * −0.02567.4110.7180 68 * −0.01976.1711.9530
13−0.0030.450.0690320.02474.411.2290 52 * −0.03994.1813.3570 69 * −0.0077.831.8950
14−0.0060.060.0090 33 * 0.01139.017.9570530.0153.000.6450 71 * 0.00632.116.5850
15−0.00010.500.0760 34 * −0.00121.445.1140 54 * 0.0235.031.1610 72 * 0.00741.247.8610
160.0060.240.0360 35 * 0.01035.547.2570 55 * −0.00221.214.4240 73 * 0.02311.092.7750
18 * * −0.01099.9013.7850.001 36 * 0.00954.809.7010 56 * 0.00134.256.6540740.0131.030.2220
19 * 0.00377.8812.1290 37 * 0.0242.400.856057−0.0121.230.2290 75 * 0.02910.852.6690
20 * 0.00667.1010.9800 38 * 0.0223.841.3550 58 * −0.00861.7410.3780 76 * −0.02135.507.0430
77 * 0.01113.613.3540
Table 4. Assumptions summary in a case study in thousand dollars.
Table 4. Assumptions summary in a case study in thousand dollars.
Driver i (Safe)Safe GroupRisky Group
Average annual premium P ¯ t g -0.30.5
Historical annual premium P i , t 1 , P t 1 g 0.50.310.51
Historical annual claims L i , t 1 , L t 1 g 0.20.10.3
Predicted annual claim frequencies y ^ i , t 1 , y ^ t 1 g 0.150.1050.305
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Usman, F.; Chan, J.S.K.; Makov, U.E.; Wang, Y.; Dong, A.X.D. Claim Prediction and Premium Pricing for Telematics Auto Insurance Data Using Poisson Regression with Lasso Regularisation. Risks 2024, 12, 137. https://doi.org/10.3390/risks12090137

AMA Style

Usman F, Chan JSK, Makov UE, Wang Y, Dong AXD. Claim Prediction and Premium Pricing for Telematics Auto Insurance Data Using Poisson Regression with Lasso Regularisation. Risks. 2024; 12(9):137. https://doi.org/10.3390/risks12090137

Chicago/Turabian Style

Usman, Farha, Jennifer S. K. Chan, Udi E. Makov, Yang Wang, and Alice X. D. Dong. 2024. "Claim Prediction and Premium Pricing for Telematics Auto Insurance Data Using Poisson Regression with Lasso Regularisation" Risks 12, no. 9: 137. https://doi.org/10.3390/risks12090137

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop