1. Introduction
Statistical learning has been widely used in actuarial science since the 1980s. In the pricing domain, actuaries rapidly embraced linear models and generalized linear models (GLMs), which became the standard approach for pricing models. Meanwhile, advancements in statistical learning and computer science have led to the development of more sophisticated machine learning techniques.
A seminal work encompassing the use of several machine learning algorithms for insurance ratemaking was put forward by
Dugas et al. (
2003), where the authors compare linear regression, generalized linear models, tree-based models, neural networks, and support vector machines. As noted in
Blier-Wong et al. (
2021);
Dugas et al. (
2003) concluded with the hope that their work would encourage actuaries to adopt neural networks as a modeling tool for ratemaking. Just over a decade later, machine learning algorithms—especially neural networks—have become increasingly common in actuarial research and practice. For instance,
Guelman (
2012) uses gradient boosting for auto insurance cost modeling, while
Spedicato et al. (
2018) explores insurance pricing optimization with multiple machine learning models.
Ghahari et al. (
2019) applies deep learning to agricultural risk management, and
Lee and Lin (
2018) introduces a boosting machine for general insurance. Similarly,
Henckaerts et al. (
2021) leverages tree-based techniques like GBM for car insurance tariffs, and
Schelldorfer and Wüthrich (
2019) employs neural networks to enhance GLM performance in non-life insurance. Additionally,
Wüthrich (
2020) suggests methods to address bias in neural network models for insurance portfolios. Machine learning has also been extensively used in insurance claim reserving.
Gabrielli et al. (
2020) improves the performance of the over-dispersed Poisson model for general insurance claim reserving using neural network embedding, while
Wüthrich (
2018) explores various machine learning techniques to assess individual claim reserving.
Neural networks have also been applied in actuarial research for various tasks.
Chapados et al. (
2002) compared different statistical learning models for estimating the pure premium, while
Rajitha and Sakthivel (
2017) used neural networks to estimate a posteriori claim frequency. The renewed interest in neural networks for actuarial pricing is largely driven by M. Wüthrich and the Swiss actuarial community.
Wüthrich and Merz (
2019) first introduced a Poisson neural network to estimate claim frequencies in a car insurance dataset. Later,
Schelldorfer and Wüthrich (
2019) proposed the Combined Actuarial Neural Network (CANN), an innovative approach that integrates a GLM with a neural network to capture nonlinear relationships.
Wüthrich (
2019) further explored the CANN methodology, while
Wüthrich (
2020) addressed bias regularization in neural network models. Additionally,
Lorentzen and Mayer (
2020) introduced a set of model-agnostic tools to extract interpretable insights from neural networks. For a comprehensive review of machine learning applications in actuarial sciences, see
Blier-Wong et al. (
2021),
Richman (
2021a,
2021b).
While neural networks have been explored in insurance pricing, most research has focused on property and casualty insurance, with limited attention to health insurance applications. Some works in the literature deal with health insurance pricing using tree-based machine learning techniques, such as XGB, random forest, decision tree, or other machine learning techniques (k-Nearest Neighbours, Support Vector Machine).
Duncan et al. (
2016) test various regression models, including random forests, decision trees, and boosted trees, directly modeling total allowed health care costs.
ul Hassan et al. (
2021) use machine learning techniques to predict medical insurance costs, not accounting for the claim frequency.
Orji and Ukwandu (
2024) leverage ensemble machine learning methods—Extreme Gradient Boosting, Gradient-boosting Machine, and Random Forest to predict medical insurance costs.
Kaushik et al. (
2022) use neural networks to predict health insurance premiums and costs, but do not consider the claim frequency. Our work aims to bridge the gap in applying neural networks to health insurance by demonstrating how they can be used for pricing coverage within the framework of a classical frequency-severity model.
Neural networks offer significant advantages over traditional machine learning models in health insurance pricing. Health insurance data are characterized by complex interdependencies among factors such as medical history, demographics, and claims information. Neural networks excel at capturing these nonlinear relationships, ensuring a more precise representation of risk and cost patterns than machine learning algorithms (
Talaei Khoei et al. 2023). They eliminate the need for extensive manual feature engineering, as they automatically extract relevant features from raw data. In contrast, traditional machine learning algorithms often require significant manual feature selection. While they can perform well with smaller datasets and are generally more interpretable, they may not capture intricate, nonlinear relationships within large datasets.
Accuracy in predictive modeling is crucial for health insurers, particularly in pricing, underwriting, and assessing healthcare costs (
Drewe-Boss et al. 2022;
Kaushik et al. 2022). Neural networks have demonstrated superior predictive performance in insurance pricing, leveraging large-scale datasets to refine risk assessments (
Holvoet et al. 2025). Their adaptability further allows them to process heterogeneous data sources effectively.
While neural networks have often been regarded as “black-box” models, explainable AI (XAI) addresses this limitation by identifying the key factors driving predictions, improving transparency and building trust, essential for ensuring regulatory compliance and facilitating business adoption in the insurance industry.
Health insurance contracts often cover a wide range of correlated medical events, such as medical visits and diagnostic tests. For example, a diagnostic test often requires a referral from a prior medical visit. To account for these dependencies, we adopt a multivariate approach to model medical claims. We introduce a negative multinomial neural network to model the frequency of correlated medical claims jointly. This approach is novel in insurance pricing, as most existing studies rely on a univariate Poisson distribution, which is more suited to car insurance claims. While
Jose et al. (
2022) previously proposed a negative binomial model, their approach remains limited to the univariate case. We compare the performance of the proposed model against the estimates produced by a negative multinomial regression (see
Zhang et al. 2017). We then use a Gamma neural network to estimate the expected claim severity, i.e., the average cost of a given claim, already introduced in a different insurance domain by
Delong et al. (
2021). Using a neural network approach appears to be particularly appealing in health insurance since the number of claims (and thus the size of the data) is usually larger than in other insurance branches. Moreover, we deepen the understanding of our neural network models by applying a set of model-agnostic tools (XAI) proposed in
Lorentzen and Mayer (
2020), which allow us to shed light on the data representation learned by the models. The premiums estimated by neural networks are then compared to those provided by the simpler regression models through the methods proposed in
Denuit et al. (
2019). Our analysis is carried out on a health claim dataset provided by a primary Italian health insurance company.
In summary, our study provides the following contributions:
Neural network implementation within a classical frequency-severity framework: A key contribution of this study is the integration of neural networks into the traditional frequency-severity structure, maintaining a model structure that is familiar and interpretable from an actuarial perspective.
Use of negative multinomial neural networks for correlated claims modeling: The paper introduces a negative multinomial neural network to model the frequency of correlated medical claims (e.g., visits and diagnostics) jointly, offering a more realistic representation of health insurance claims processes compared to the widely used univariate Poisson models.
Accounting for modeling claim dependencies: Using a multivariate approach, the model explicitly captures dependencies among different types of medical claims, which, to our knowledge, has not been previously explored in health insurance.
Gamma neural network for claim severity: We propose Gamma neural networks to model claim severity, complementing the frequency model and enabling a full end-to-end neural network-based pricing model.
Empirical assessment on real-world health insurance data: The proposed models are validated on a real-world dataset, showing superior performance over traditional regression-based methods.
Model interpretability through XAI Tools: We also deepen the understanding of the models’ internal representations by applying a set of XAI tools, allowing us to investigate the data representation learned by the neural networks and identify key drivers of model predictions.
The remainder of the paper is structured as follows: In
Section 2, we briefly review the frequency-severity approach and we detail the proposed models based on neural networks;
Section 3 is devoted to data description;
Section 4 reports the results obtained using the proposed models;
Section 5 is devoted to ratemaking;
Section 6 concludes the paper.
2. Frequency-Severity Approach and Proposed Models
Setting the pure premium of an insurance policy mainly consists of evaluating the cost associated with the risk coverage provided by the insurance contract. Therefore, the insurer has to predict the expected total claim amount
for each policyholder through a predictive model
mapping the policyholder risk factors
to the predicted loss cost:
. A popular method for health insurance cost modeling is to consider frequency-severity modeling (see
Frees et al. 2011), which is the primary statistical approach for modeling non-life insurance claims. This approach splits the total claim amount for a given policyholder
i into a compound sum that accounts for the number of filed claims and determines the individual medical claim sizes. Thus, the total claim amount
is represented by a compound random variable with
describing the number of claims that occur over one year to the generic policyholder
i and
, …,
describing the i.i.d. individual claim sizes defined as claim severities. More formally:
Assuming the independence between
and
, we have that:
where
is the cost for a generic claim filed by policyholder
i (see
Klugman et al. 2012 for further details). Health insurance pricing, defined in terms of the pure premium, involves estimating the expected total claim amount
for an insurance policy over the course of one year, based on a set of risk factors
. In the following, we illustrate how to model the claim frequency
and the claim severity
for a set of different claim types using neural networks and making a comparison with other regressions.
2.1. Claim Frequency Model
Here, we describe the neural network we propose for modeling the claim frequency of medical visits, dental care treatments, and diagnostic testing. A typical approach involves fitting a univariate negative binomial regression to estimate the claim frequency of each claim type. The negative binomial is a common distributional assumption for health insurance claim counts since it has the advantage of capturing the overdispersion characterizing such claims (see, for instance,
Ismail and Zamani (
2013) and
Frees et al. (
2011)). However, using a univariate technique would neglect possible correlations between the occurrence of the different claim types, which is often observed in health insurance (
Erhardt and Czado 2012).
For this reason, we adopt a multivariate approach by proposing negative multinomial neural networks to estimate the claim frequency for different claim types. The advantage of this technique is twofold. First, it models the claim frequency of various claim types via neural network, accounting for nonlinearities, covariates interactions, and overdispersion. Second, the multivariate approach provided by the negative multinomial distribution allows capturing possible correlations between different perils.
2.1.1. Negative Multinomial Distribution and Negative Multinomial Regression
A negative multinomial distribution, extensively discussed in
Sibuya et al. (
1964), provides a model for positively correlated multivariate count data characterized by overdispersion, i.e., where the variance is greater than the mean. Regression models relying on this distributional assumption have already been implemented in different fields, such as genomics (
Kim et al. 2018) and medical statistics (
Waller and Zelterman 1997).
More formally, let us consider an r-dimensional vector of counts
. The probability mass for
under a negative multinomial distribution with parameters
and shape parameter
, where
and
, is:
where the vector of parameters
is defined as follows:
with
as the mean parameter vector.
For ease of discussion, we set
and rearrange the probability mass function in Equation (
3) as follows:
As shown in
Waller and Zelterman (
1997), the expectation of the count random variable
characterized by a negative multinomial distribution is defined as
and its covariance matrix is
Fitting the distribution involves estimating the parameters
and
presented in Equation (
5), which are usually obtained via MLE.
The general framework of a regression approach with a negative multinomial distribution was first introduced by
Kim et al. (
2018). This approach models the relationship between the multivariate response count variable and a set of covariates capturing the inner correlation structure between counts. Namely, we have:
where
is the
regression parameters matrix with
J regressors and
I is the number of policyholders. Then, from Equations (
4) and (
6), we have:
and the vector
in Equation (
3) is given by:
where
and
is the set of parameters to be estimated via maximum likelihood. Then, the covariance matrix is obtained by feeding back Equation (
10) in Equation (
7). The set of parameters
and
is obtained maximizing the following log-likelihood:
where
.
Note that maximizing
in Equation (
11) is equivalent to minimizing the deviance
:
The same deviance is used to train the neural network presented in the next subsection.
2.1.2. Negative Multinomial Neural Networks (NM-NNs)
The neural network approach we propose consists of a feed-forward neural network with three output layers to model the multivariate claim count response with respect to the feature set . The higher flexibility of neural networks provided by their intricate inner structure should account for the dependence that characterizes the different types of claim counts.
Given our set of covariates
and considering a network of depth
K, the output layer for the neural network is defined as follows:
where
is the J-dimensional output produced by network for the
string of observations. Note that in this specific formulation
is a tridimensional vector of output biases and
is the tridimensional vector of weights connecting the
l-th neuron in the last hidden layer to the output layer. The function
is the activation function applied to the input
of the output layer. For a visual representation of the network with
and
, see
Figure 1.
The network has to be trained to minimize a given loss function. Since our goal is to model multivariate correlated count data, a logical choice for the loss function is the negative multinomial deviance. Rearranging Equation (
12), we have the following deviance defined with respect to the network parameters
:
More specifically, the optimal set of parameters
is obtained by minimizing the negative multinomial deviance via stochastic gradient descent. Note that in the estimation process, the scale parameter
is considered as a given, and is obtained by preemptively performing a negative multinomial regression on the same set of data
. Estimating the
parameter using an ad hoc network structure might represent possible future research. Given the optimal set of parameters
, we can compute the expected number of claims as follows:
for
, where the superscript ‘
’ denotes that the expectation is obtained via a neural network.
2.1.3. Gamma Neural Network (Gamma-NN)
The Gamma-NN is a feed-forward neural network with a one-dimensional output layer. Considering the typical set of covariates
and a network of depth
K, keeping in mind the notation of
Section 2.1.2, we can define the output layer of the network as follows:
The network architecture is shown in
Figure 2.
To estimate the set of parameters
, we have to define an appropriate loss function to be minimized by the network model. We train the model on Gamma deviance:
where
is the number of claims submitted by a given insured
i. In particular, we employ stochastic gradient descent to obtain the estimate for the set of parameters
. We can compute the expected claim severity for the
i-th policyholder as follows:
where the superscript ‘
’ shows that the expectation is obtained via a neural network.
Note that in the following sections, we will fit a separate Gamma neural network for each type of peril considered in our application.
3. The Data
In this paper, the dataset considered stems from an Italian insurer and contains the claims collected in a general health insurance plan during 2018. The insurance plan is designed for managers and retired managers affiliated with a specific industry in Italy, making it an employer-based health insurance. The dataset comprises 132,499 policyholders, including both current employees and former employees. Additionally, policyholders have the option to enroll their relatives, such as parents, spouse, and children below 25 years of age, in the insurance coverage. In total, there are 273,950 insured individuals included in the dataset.
The dataset covers three classes of claims:
Medical visits with a wide range of specialist doctors, such as cardiologists, pediatricians, neurologists, and many more.
Dental care treatments, including, among others, dental braces, implants, and oral surgery.
Diagnostic tests, e.g., magnetic resonance imaging, blood tests, and electrocardiogram.
For each insured, the dataset reports the following information: number of claims filed during the year (split between
Visits,
Dentalcare, and
Diagnostic), severity for such claims, age, gender, regional area, firm dimension, family member type, years of permanence in the coverage, and ID code.
Table 1 offers an overview of the data contained in the dataset. The response variables for the frequency models discussed in
Section 2.1.2 are the claim counts
. While the responses for the model proposed in
Section 2.1.3 are
,
and
.
The dataset consists of 273,950 observations and six covariates. There are 205,625 claimants, with the total monetary value of submitted claims amounting to approximately 233 million euros. Below, we explore the dataset via summary and descriptive statistics.
3.1. Covariates
Figure 3 presents the histograms and frequency tables for the covariates included in the dataset. The age variable (
AG) is heavily skewed toward older individuals, reflecting a predominantly senior population. Notably, there is a marked ‘dip’ in the distribution between ages 25 and 40. This gap stems from the specific subscription policies of the Italian insurer: since policyholders are typically firm managers, it is uncommon for them to be under 40 years old
1. Additionally, managers are not permitted to extend insurance coverage to their children over the age of 25. These factors together explain the limited number of insured individuals in the 25–40 age range. The gender variable (
GE) shows a relatively balanced distribution between males and females. However, although the overall population shows a balanced gender distribution, the gender balance is not maintained within the subset of policyholders. In fact, approximately 80% of the policyholders are male. This skew is due to the specific demographic characteristics of the firm managers who are eligible for the insurance policy, where the managerial positions are predominantly held by men. The permanence (
PE) is an integer variable reporting the years of permanence in the insurance plan; its minimum is 1 (for newcomers), and its maximum is 41 (for early adopters). The histogram of this variable shows a decreasing trend. However, there is a strong peak at 38 years, connected to the subscription of the health coverage by a large number of firms whose employees (or pensioners) are still enrolled in the insurance plan. As for the Region (
RE), we observe a strong concentration of insured individuals in two out of the 21 Italian regions: ‘Lombardia’ and ‘Lazio’, which account for the majority of policyholders. The dimension variable (
DM) is a proxy for the firm’s dimension the policyholder belongs. More specifically, the variable reports the number of managers working in the firm, and its value is the same across all the insureds belonging to the same family. It ranges from 1 to 1500, with a strong concentration below 100, representing the small to medium-sized firms (that are specific to the Italian economy). The family member type (
FA) is a categorical variable representing which kind of family member the insured is. From
Figure 3, we observe that the most relevant classes are: Policyholder, Spouse, and Children. In contrast, Parents and Ex-Spouse are almost negligible since they cover, on aggregate, fewer than 200 insured in the dataset.
Table 2 presents the measures of association between the various covariates. The results indicate a strong positive relationship between family member type (
FA) and age (
AG), as evidenced by an
value of 0.739, which is intuitively reasonable. Similarly,
PE and
AG exhibit a moderate correlation of 0.676, reflecting the fact that individuals with longer durations of coverage are more likely to be older. Associations between categorical and numerical variables are generally weak, with most correlations close to zero (e.g., between
GE and
PE). On the other hand, the association between categorical variables appears to be relatively strong, as highlighted by the
statistic presented in the table.
3.2. Response Variable
In
Table 3, we give a general overview of the claim counts in the dataset,
,
, and
, by reporting their summary statistics.
From
Table 3, we notice that a considerable portion of insureds submit at least a claim (last row in the table), which is peculiar to health insurance, where events have a higher frequency with respect to other non-life insurance types, i.e., property insurance. More specifically, 50% of the insureds request at least a medical visit or a diagnostic test during the year, while 25% undergo some sort of dental treatment. The average claim frequency for an average insured during the year is
for medical visits,
for dental treatments, and
for diagnostic tests. The frequency for the latter claim type appears to be exceptionally high, and this is also due to the approach used by the insurer to record the claim when an insured undergoes a diagnostic test
2.
These claims exhibit overdispersion, as shown in
Table 3, where the variance significantly exceeds the mean. To choose the suitable discrete distribution for the marginals of the claim counts
, we compare the Poisson distribution with a negative binomial distribution by considering two Goodness-of-fit (GoF) measures: the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). The Poisson is a typical distributional assumption when it comes to claim frequency modeling. However, a different distribution may be more appropriate when the data are characterized by overdispersion. The results in
Table 4 indicate that the negative binomial distribution better fits the claim counts, as reflected by its lower AIC and BIC values. Although the SSE and MAE are identical for both distributions, the lower AIC and BIC suggest the negative binomial provides a more accurate fit.
Given the nature of the claim types, it is interesting to evaluate the correlation between their occurrence. In
Figure 4, we report the Spearman correlogram between the claim counts. The results show that diagnostic exams are strongly correlated to visits; this is a mild surprise since the referral given by a medical visit is frequently an essential requirement to undergo an in-depth diagnostic test. Therefore, we choose to jointly model the different claim counts to capture such correlation using a multivariate approach via the negative multinomial distribution. Given the low correlation between dental care claims and the other types of claims, an argument could be made for modeling the dental claims alone while using a multivariate approach for visits and diagnostic tests. However, for simplicity, we still decide to model the three perils jointly.
In the next section, we introduce the negative multinomial regression framework that we will use to model the claim frequencies of visits, dental treatments, and diagnostic tests.
3.3. Claim Severity
Here, we provide some descriptive statistics to characterize the claim costs. In
Table 5, we display some summary statistics. The distributions of the different claim severities are right-skewed. This feature is commonly well described through a Gamma distribution. In particular, we notice that
Diagnostic and
Dentalcare claims seem far more skewed than medical visits.
Note
3 we do not distinguish between regular and large claims in this study, and we consider all of them together. However, a case could be made for using two different approaches when modeling small and large claims, as in
Denuit and Lang (
2004) and
Albrecher et al. (
2017).
4. Results
To evaluate the general performance of the NM-NN with respect to the benchmark NM-Regression (NMR), we test the model over the dataset presented above. The results are obtained through five-fold cross-validation, where the dataset is divided into five folds, and at each iteration, three of the five folds are used as a training set for the models (NM-NN and NMR), one as a validation set, and one as a testing set.
The network is trained for up to 2000 epochs with early stopping based on the validation loss, using a patience of 200 epochs to ensure a balance between convergence and overfitting prevention. The best-performing weights on the validation set are retained. The model adopts a five-hidden layer structure of dimension
with the ReLu as the activation function
4. As for the variables presented in
Table 1:
AG,
PE5,
DM are min-max scaled,
RE and
FA are treated using a
embedding layer, and
GE is dummy encoded. In the multinomial regression model,
AG,
PE, and
DM are treated as continuous variables, while
RE,
FA, and
GE are dummy encoded.
We fit a different Gamma GAMLSS (Generalized Additive Model for Location, Scale, and Shape) and a Gamma-NN for each claim type (
Visits,
Dentalcare, and
Diagnostic). In the GAMLSS, continuous variables are treated via cubic splines, while categorical variables are dummy encoded. For the Gamma-NN, the models are trained over 1000 epochs using early stopping on the validation set to prevent overfitting, with a patience of 200 epochs. Each network adopts a standard three-hidden layer structure of dimension
(
Schelldorfer and Wüthrich 2019) with a ReLu activation function. Data are preprocessed as follows:
AG,
PE,
DM are min-max scaled,
RE and
FA are treated using a
embedding layer, and
GE is dummy encoded. The results discussed below stem from a five-fold cross-validation. We implemented the models in R (version: 4.2.1) using
Tensorflow (version: 2.11.0).
4.1. Models Performance
The performance of the frequency models is measured in terms of negative multinomial deviance (see Equations (
12) and (
14)), where a lower deviance signals a better model. For both models, we estimate the claim frequencies for medical visits, dental treatments, and diagnostic tests.
Figure 5 compares the in-sample (left panel) and the out-of-sample (right panel) performance for the NMR and the NM-NN. In particular, we report the negative multinomial deviance over the five data folds. The results are very stable over the different folds, both in-sample and out-of-sample. The neural network model consistently achieves a better performance with respect to NMR since it returns a lower deviance. In
Appendix A, we report a comparison between the models discussed above and more traditional univariate approaches, such as GLMs with Poisson and Negative Binomial distributions. The comparison is based on both in-sample and out-of-sample performance, evaluated using SSE and MAE metrics. Overall, the NM-NN model consistently outperforms the benchmark models across all measures.
As for the Gamma-NN, we evaluate the performance of the different models using the Gamma deviance, where the lower the deviance, the better the model. Given the set of features
, each model returns an estimate for the expected claim severity
of a medical visit, dental treatment, or diagnostic test.
Figure 6 compares the in-sample (left panels) and the out-of-sample (right panels) performance for the Gamma-GAMLSS and the Gamma-NN. In each plot, we report the Gamma deviance over the five data folds to evaluate the stability of our results. The outcomes seem relatively robust for each claim severity model across the five-fold, both in-sample and out-of-sample. Neural network models consistently outperform classical regression models, always returning lower deviance. In
Appendix B, we also compare the Gamma-NN with a traditional Gamma GLM using non-penalized performance metrics such as SSE and MAE. The results confirm that the neural network model achieves superior performance in both cases.
Neural networks are often celebrated for their outstanding predictive performance since they easily learn a good representation of the training data that generalizes well to new data. Despite their effectiveness, these models frequently encounter a meaningful drawback: their lack of explainability. Neural networks, in particular, consist of many parameters and a deeply layered architecture, making it challenging for modelers to interpret the outcomes. To address these limitations, recent years have seen a surge in research dedicated to model-agnostic techniques (see
Friedman and Popescu 2008a, as well as actuarial case studies in
Lorentzen and Mayer 2020 and
Henckaerts et al. 2021), all designed to enhance the interpretability of machine learning models. Therefore, in the remainder of this section, we present the information retrieved using such model-agnostic tools. We investigate the variables’ importance, their main effects and the possible presence of interactions.
4.2. Variable Importance
In
Figure 7, we report the Variable Importance metric (see
Friedman and Popescu 2008a) to find the most relevant variables in our claim frequency models. The variables are ranked from left to right, from the most important to the less relevant. For both models, the two most relevant variables are age (
AG) and region (
RE). However, their ranking is different: the age variable is by far the most important variable for NM-NN, while it only achieves second place in NMR. In particular, the increase in deviance is much higher in the Network model (
) than in the NMR (
), signaling that the multinomial regression is probably missing some information when modeling the relationship between the age variable and the claim frequencies. As for the regional variable, even though it scores first in the NMR and second in the NM-NN, it has almost the same importance for the two models
. In both models, the other variables are far less relevant; however, some of them appear to hold slightly greater significance in neural network models with respect to the NMR (
GE and
PE), except for the family member variable (
FA), which is more relevant in the regression than in the neural network.
To learn which variables are more relevant in claim severity prediction, in
Figure 8, we compare the variable importance for the different regression techniques. The covariates are ranked from left to right, starting with the most important one. The top plots report the variable importance for GAMLSSs, while the bottom plots display the variable importance for neural networks. Claim severity models for
Visits (
Figure 8a) show some radical differences when it comes to variable importance. The only variable that seems to be relevant for the GAMLSS is the region (
RE), while the NN has several important covariates—age (
AG) and region (
RE) above all. In particular, the region’s importance is comparable between the two models, while the age is where the real difference arises since the variable is the most important in the NN, whilst it is almost irrelevant for the GAMLSS. Other differences between GAMLSS and NN are given by the
FA and
PE variables, which are relevant only in the latter model. In contrast,
Dentalcare models (
Figure 8b) show a similar variable importance plot. Both models strongly rely on the age variable for their predictions, with minor but still relevant importance for the region. Even though the plots are roughly the same, we notice a slightly higher importance for variables in the NN model. In a similar fashion to
Visits, the claim severity models for
Diagnostic display two different variable importance plots. Indeed, except for region (
RE), the variables entering the GAMLSS are deemed irrelevant. The neural network extracts important information also from the age (
AG), the permanence (
PE) and the family member type (
FA).
In the following, we further investigate our models, looking at main effects.
4.3. Main Effects
Partial Dependence (PD) profiles, as introduced by
Friedman and Popescu (
2008a), are known to be a powerful tool for analyzing the marginal effect of a covariate on the model’s response. They provide insight into the main effect of covariate
j by averaging its influence across all observations. In essence, PD profiles capture the overall impact of variable
j on the model’s predictions, revealing whether its relationship with the response is linear, monotonic, or exhibits more complex patterns.
In the following lines, we analyze the marginal effect for the different covariates considered in the NMR and the NM-NN. In particular, we will discuss the marginal effect of some selected variables on the claim frequencies of medical visits, dental treatments, and diagnostic tests.
In particular, we observe:
Visits: Figure 9 displays the partial dependence plots produced by the neural network and the NMR for the
Visits claim frequency, for (
AG) and (
RE). We notice a first significant difference when comparing the PD plot for the age variable (
AG). The marginal effect captured by NMR (in red) is exponential. It starts at a claim frequency of 0 and caps at 5. In contrast, the behavior captured by the NM-NN (in blue) is more complex. This PD plot starts at 2, and then the curve decreases to its minimum around 15 years of age, reaching a frequency of 1. Then, the curve starts slowly increasing, followed by a substantial increase after the age of 50, reaching a maximum of 3.3 at around 80 years. Therefore, the marginal effect produced by the Network captures specific features, such as the higher frequency of medical visits associated with younger ages connected to pediatric visits and the steady claim frequency at older ages. The region effect is somewhat different for the two models. The NMR captures a strong claim frequency for ’Lazio’, which does not happen for the neural network.
Dentalcare: Figure 10 reports the PD plots for
Dentalcare claim frequency. Also, this kind of claim shows some different behaviors for the PD plot age. The NMR PD plot shows an exponential trend, while the marginal effect produced by the neural network is almost parabolic, capping at 75 years of age, with a slight bump around 15 years of age. The bump is connected to dental braces and teenage oral surgery. Looking at the PD plots for the
FA, we notice that the effect of this variable for the NM-NN is almost flat, while the NMR effects change across the different types of family members. Similar to
Visits claim frequency, the NMR tries to capture the effects not registered by the
AG PD plots.
Diagnostic: Figure 11 reports the PD plots for the Diagnostic claim frequency. The age (
AG) has a quasi-linear trend in the neural network with two small bumps around the thirties and eighties, while the trend is exponential in the NMR. The neural network PD plot for permanence (
PE) shows a substantial effect on the claim frequency for a high value of the permanence, while the NMR fails at doing so.
As shown via the different PD plots, the major difference between the two models is related to the age
AG main effect. This difference is primarily due to the NMR structural form, which lacks the flexibility to capture the shape of the main effect. In contrast, the NM-NN seems to capture all the information provided by the age variable. Therefore, the first reason for the performance gap in
Figure 5 is probably due to a poor modeling of the
AG main effect from the NMR. This issue could be addressed using a polynomial variate or a spline. However, this is only part of the story since the different performances may also be associated with possible interactions between variables, which the PD plot cannot detect.
To complete the analysis, we study the main effects for the claim severity models (Gamma GAMLSS and Gamma NN) and the different perils. For the sake of brevity, in
Figure 12, we report only the Partial Dependence plot for the age obtained for Dentalcare perils, which appears to be particularly interesting. This PD plot shows a relevant difference at younger ages, where the GAMLSS shows a decreasing unitary claim cost between 0 and 35 years of age; while the NN main effect starts at a low value and then increases, reaching a peak at 15 years, capturing the high cost associated to dental braces that characterize teenagers. In this case, the neural network seems to find a better fit for this specific effect.
4.4. Interaction Effects
After examining the main effects, we now conduct a detailed analysis of potential interaction effects among covariates as captured by the neural network model. A model-agnostic way to measure the interaction between two variables is based on partial dependence profiles introduced by
Friedman and Popescu (
2008a). To evaluate the existence of interaction effects, we employ the H-statistic proposed by
Friedman and Popescu (
2008a), which quantifies the interaction strength between two covariates by determining the proportion of prediction variance attributable to their interaction.
In
Figure 13, we plot the values of the H-statistic for the NM-NN and each possible pairwise interaction between covariates. We do not report the plot for the NMR since the model is not designed to capture pairwise interactions between variables. Each pane in
Figure 13 reports the interactions detected for the different claim types: for
Visits the age
AG has a weak interaction with both permanence
PE and gender
GE;
Dentalcare claim frequency shows two strong interactions for the permanence
PE variable with dimension
DM and gender
GE;
Diagnostic claims have two relevant pairwise interactions for the gender
GE variable with permanence
PE and dimension
DM, moreover, we also notice a small interaction between age
AG and region
RE.
In
Figure 14, we report the H-statistic for each possible pairwise interaction between variables entering the Gamma Neural Networks.
Namely, interactions captured by the Gamma-NN for medical visits are presented in the left pane of
Figure 14. The most relevant interaction is the one between permanence
PE and age
AG. Such interaction is quite peculiar since its H-statistic exceeds 1; this happens when the variance of joint interaction between the variables is greater than that of their 2-dimensional PD plot. Other relevant interactions are observed between gender and family member type, firm dimension and gender, and permanence and dimension.
From the central pane of
Figure 14, we notice that all the relevant interactions for the dental treatment Gamma-NN model lean on the company dimension variable (
DM). In particular, this variable interacts with permanence (
PE), gender (
GE), and family member type (
FA). It is curious to observe such relevance for the dimensional variable since this covariate has a low importance (
Figure 8b) and an almost flat PD plot (
Figure 12). The grouped PD plots that will be discussed in the second part of this subsection will allow understanding if such interactions are relevant.
The left plot of
Figure 14 reports variable interactions for the
Diagnostic Gamma-NN model. The plot reports a plethora of relevant interactions, and among the most relevant ones, we have those between the permanence and the gender, the dimension and the gender, the permanence and the age, and the gender and the family member.
To gain insight into the behavior of such interaction effects, we use grouped PD plots to visualize the effect given by the interaction on claim frequencies in the following lines. A grouped PD plot represents the main effect of a variable conditioned on the different values of another variable. Therefore, the plot reports v PD curves, where v is the number of possible values for the conditioning covariate. The interaction is considered meaningful if the curves exhibit distinct patterns when analyzed under varying values of the conditioning variable. In particular, we expect the different PD curves to be non-parallel.
As an example, with reference to the interaction spotted by the NM-NN for Visits, in
Figure 15, we study the interaction between gender
GE and age
AG. Such a figure displays a different behavior for the age PD plot conditioned on females. In particular, we notice a higher claim frequency for female insureds between 20 and 60 years of age. This increased frequency is probably associated with gynecologic visits and pregnancy-related visits.
From the results discussed in this section, the proposed neural network models appear as a clear winner over the benchmark regressions. The network models perform better in-sample and out-of-sample. These results are achieved thanks to the neural network’s flexibility, which is not restricted by a multiplicative structural form. The network architecture offers sufficient complexity so that it is capable of reflecting nonlinearities in the explanatory variables and interactions between them. It would also be possible to improve further the regressions by manually plugging the different pairwise interactions spotted by the NM-NN or the Gamma-NN. In this sense, neural networks can also serve as a complementary tool for the regressions, where, in a first step, the modeler using a neural network spots the weaknesses of the simpler regression model, such as missing interactions or main effects. Then, in a second step, the modeler improves the simpler regression model via brute force modeling of its functional form.
However, this is not enough to determine whether it is worth choosing neural networks over NMR and GAMLSS. When determining the price (or the potential cost) of a risk coverage, it is also vital to consider business-related metrics. Therefore, in the next section, we combine the frequency models discussed and the claim severity model in a pricing model and compare the different tariff structures using practical economic metrics relevant for an insurance company.
5. Ratemaking
Now that we have extensively discussed claim frequency models (
Section 2.1.2) and claim severity models (
Section 2.1.3), we can combine them to complete the frequency-severity approach displayed in
Section 2 devoted to pure premium evaluation. For instance, it is possible to compute the pure premium for a set of insureds according to their characteristics by combining the NM-NN and the Gamma-NN. In our specific case, to obtain the pure premium for the health insurance plan presented in this work, we must tweak Equation (
2) to account for the different claim types. In particular, the pure premium obtained via neural network models is defined as follows:
for
, where
is defined as in Equation (
15) and
is obtained as in Equation (
18).
While using the NMR and GAMLSS, we have:
for
.
As is common in actuarial pricing, the claim frequency and claim severity models discussed in this work are estimated in order to optimize a goodness-of-fit measure. Thus, until now, when comparing neural networks to simpler regressions, we have mainly focused on the statistical performance of the models. Therefore, even if neural networks have outperformed regressions from a statistical standpoint in the previous section, it is crucial to understand if the premium produced by such models provides added value to the business in which the premium is to be implemented. Thus, it is also important to consider an economic criterion when deciding whether it is worth implementing a given model in insurance applications. Therefore, it is crucial to go beyond the classical deviance metric. That is where model lift metrics are useful. In summary, the model’s effectiveness reflects its capacity to mitigate adverse selection in pricing. Specifically, it measures how well the model assigns actuarially fair rates to policyholders, reducing the risk of losing potential policyholders to competitors offering more refined pricing structures.
In this section, we employ two model lift methods proposed by
Denuit et al. (
2019) to evaluate the performance of a set of candidate premiums. The metrics proposed by the authors aim at assessing the following two aspects of a given premium: the variability of the resulting premium amounts, as larger premium differentiation induces greater lift, and the ability of the premium amount to match the actual total claim amount
S for increasing risk profiles. The first objective is tackled using Lorenz curves (LC), while the second point is assessed considering concentration curves (CC). For an extensive discussion on this kind of curve, see
Denuit et al. (
2019). Given an insurance portfolio, if we consider the subset of insured gathering a certain percentage of policies associated with smaller premiums (i.e., those insured that are likely to be lost to a potential competitor because of their low-risk profile), the concentration curve compares the premium amounts belonging to such group to their aggregate losses. The relative positioning of the Lorenz and concentration curves enables the actuary to accurately evaluate the effectiveness of the premium under analysis.
Thus, we employ the two lift metrics discussed by
Denuit et al. (
2019) (ABC and ICC) to compare the premiums obtained combining the NMR and the GAMLSS
(Equation (
20)) and neural networks
(Equation (
19)). We use such metrics also to compare the results obtained by our novel approach with a set of premiums
produced using neural networks without taking into account the dependence between the different health perils. More specifically, we have combined the claim frequencies obtained via three different Negative Binomial Neural Networks, one for each claim type, with the claim severities stemming from the Gamma-NN discussed in
Section 2.1.3. This comparison highlights the added value of the multinomial approach, which accounts for the dependence between different claim types, compared to a ratemaking approach that does not consider such dependencies.
In
Table 6, we report the ABC and ICC metrics for the three sets of premiums. All premiums are computed on the out-of-sample data, more specifically, on the first fold of the 5-fold cross-validation. In particular, we notice that premiums issued from neural network models proposed in this work return both a lower ABC and ICC, signaling that neural network models produce a better lift if compared to GLMs. In other terms, the lower ABC registered by NN signals that the premium produced by this model is closer to the actual risk presented in the insurance portfolio, while the lower ICC means that such premiums cover the expected share of true premiums in the portfolio. The same can be argued by comparing the premiums
obtained using the NM-NN and the Gamma-NN with the benchmark premium
stemming from the so-called independent approach. We observe that the model introduced in this work produces lower values for both ABC and ICC.
Thus, even from a business metric standpoint, neural network models have proven to have some added value if compared to NMR and GAMLSS, since their greater precision translates into better premiums. In
Appendix C, we further explore the comparison between the set of premiums generated by neural networks and those obtained through more traditional approaches, providing additional insights using both graphical tools and standard evaluation metrics.
To further improve the discussed premiums, it would be necessary to complement the informative set presented in
Section 3, including, for instance, additional covariates, such as policyholders’ yearly income and level of education, which are generally good drivers for health expenditure.
6. Conclusions
Neural networks are a powerful tool for performing multi-dimensional, nonlinear regressions in insurance ratemaking, enabling more precise risk assessment and pricing by effectively capturing complex relationships within large-scale insurance datasets.
In this paper, we propose an innovative application of neural network models within the context of health insurance pricing, with a specific focus on their actuarial relevance and interpretability. We first introduce a neural network with a multivariate output structure designed to model potentially correlated health claim counts, namely, medical visits, dental care treatments, and diagnostic exams. This model minimizes a Negative Multinomial deviance and is coupled with Gamma neural networks to estimate claim severities, ultimately providing an estimate of the pure premium for each insured individual.
The performance of the proposed models has been benchmarked against more traditional regression-based methods, such as Negative Multinomial Regression (NMR) and GAMLSS. Our results indicate that neural networks offer clear advantages in terms of predictive accuracy, as confirmed by lower deviance, SSE, and MAE values, as well as in terms of risk segmentation, as demonstrated by model lift metrics discussed in
Section 5. These findings suggest that neural network models not only provide a statistically superior fit, but also lead to better risk diversification and more efficient premium structures
6 Moreover, we enhance understanding of models’ internal representations by model-agnostic XAI tools, improving the transparency and trustworthiness of machine learning applications in actuarial practice, which is essential for regulatory and business adoption.