1. Introduction
Aggregating is a statistical technique that helps to reduce noise and uncertainty in predictors. Typically, it is applied to an i.i.d. sequence of predictors and it is justified theoretically using the law of large numbers. In many practical applications one is not in the comfortable situation of having an i.i.d. sequence of predictors to which aggregating could be applied. For this reason,
Breiman (
1996) combined
bootstrapping and
aggregat
ing, called
bagging, bootstrapping being used to generate a sequence of randomized predictors and then aggregating them to receive an averaged bootstrap predictor. The title of this paper has been inspired by
Breiman (
1996), in other words, it is not meant in the sense of grumbling or the like, but we are going to combine
networks and
aggregat
ing to receive the
nagging predictor. Thereby, we benefit from the fact that typically neural network regression models have infinitely many equally good predictors which are determined by gradient descent algorithms. These equally good predictors depend on the (random) starting point of the gradient descent algorithm, thus, starting the algorithm in i.i.d. seeds for different runs results in an infinite sequence of (equally good) neural network predictors. Moreover, other common neural network training techniques may even lead to more randomness in neural network results, for example, dropout, which relies on randomly setting parts of the network to zero during training to improve performance at test time, see
Srivastava et al. (
2014) for more details, or the random selection of data to implement stochastic or mini-batch gradient descent methods. On the one hand, this creates difficulties for using neural network models within an insurance pricing context, since the predictive performance of the models measured at portfolio level will vary with each run and the predictions for individual policies will vary even more, leading to uncertainty about the prices that should ultimately be charged to individuals. On the other hand, having multiple network predictors puts us in the same situation as
Breiman (
1996) after having received the bootstrap samples, and we can aggregate them, leading to more stable results and enhanced predictive performance. In this paper, we explore statistical properties of nagging predictors at a portfolio and at a policy level, in the context of insurance pricing.
We give a short review of related literature. Aggregation of predictors is also known as ensembling.
Dietterich (
2000a,
2000b) discusses ensembling within various machine learning methods such as decision trees and neural networks. As stated in
Dietterich (
2000b), successful ensembling crucially relies on the fact that one has to be able to construct multiple suitable predictors. This is usually achieved by injecting randomness either into the data or into the fitting algorithm. In essence, this is what we do in this study, however, our main objective is to reduce randomness (in the results of the fitting algorithms) because, from a practical perspective, we need to have uniqueness in insurance pricing, in particular, we prove rates of convergence at which the random element can be diminished.
Zhou (
2012) and
Zhou et al. (
2002) study optimal ensembling of multiple predictors. Their proposal leads to a quadratic optimization problem that can be solved with the method of Lagrange. This is related to our task below of finding an optimal meta model using individual uncertainties which we discuss in
Section 5.8, below. Related empirical studies on financial data are given in
Di Persio and Honchar (
2016),
du Jardin (
2016) and
Wang et al. (
2011). These studies conclude that ensembling is very beneficial to improve predictive performance which, indeed, is also one of our main findings.
Organization of manuscript. In the next section we introduce the framework of neural network regression models, and we discuss their calibration within Tweedie’s compound Poisson models. In
Section 3 we give the theoretical foundation of aggregating predictors, in particular, we prove that aggregating leads to more stability in prediction in terms of a central limit theorem. In
Section 4 we implement aggregation within the framework of neural network predictors providing the nagging predictor. In
Section 5 we empirically prove the effectiveness of nagging predictors based on a French car insurance data set. Since nagging predictors involve simultaneously dealing with multiple networks, we provide a more practical meta network that approximates the nagging predictor at a minimal loss of accuracy.
2. Feed-Forward Neural Network Regression Models
These days neural networks are state-of-the-art for performing complex regression and classification tasks, particularly on unstructured data such as images or text. For a general introduction to neural networks we refer to
LeCun et al. (
2015),
Goodfellow et al. (
2016) and the references therein. In the present work, we follow the terminology and notation of
Wüthrich (
2019). We design feed-forward neural network regression models, referred to in short as networks, to predict insurance claims. In the spirit of
Wüthrich (
2019), we understand these networks as extensions of generalized linear models (GLMs), the latter being introduced by
Nelder and Wedderburn (
1972).
2.1. Generic Definition of Feed-Forward Neural Networks
Assume we have independent observations
,
; the variables
describe the responses (here: insurance claims),
describe the real-valued feature information (also known as covariates, explanatory variables, independent variables or predictors), and
are known exposures. The main goal is to find a regression functional
that appropriately describes the expected insurance claims
as a function of the feature information
, i.e.,
Since, typically, the true regression function
is unknown we approximate it by a network regression function. Under a suitable link function choice
, we assume that
can be described by the following network of depth
where
is the scalar product in Euclidean space
, the operation ○ gives a composition of hidden network layers
,
, of dimensions
, and
is the readout parameter. For a given non-linear activation function
, the
m-th hidden network layer is a mapping
with hidden neurons
,
, being described by ridge functions
for network parameters
. This network regression function has a network parameter
of dimension
.
Based on so-called universality theorems, see, for example,
Cybenko (
1989) and
Hornik et al. (
1989), we know that networks are very flexible and can approximate any continuous and compactly supported regression function arbitrarily well if one allows for a sufficiently complex network architecture. In practical applications this means that one has to choose sufficiently many hidden layers with sufficiently many hidden neurons to be assured of adequate approximation capacity, and then one can work with this network architecture as a replacement of the unknown regression function. This complex network is calibrated to (learning) data using the gradient descent algorithm. To prevent the network from over-fitting to the learning data, one exercises early stopping of this calibration algorithm by applying a given stopping rule. This early stopping of the algorithm before convergence typically implies that one cannot expect to receive a “unique best” model, in fact, the early stopped calibration depends on the starting point of the gradient descent algorithm and, usually, since each run has a different starting point, one receives a different solution each time early-stopped gradient descent is run. As described in Section 3.3.1 of
Wüthrich (
2019), this exactly results in the problem of having infinitely many equally good models for a fixed stopping rule, and it is not clear which particular solution should be chosen, say, for insurance pricing. This is the main point on which we elaborate in this work, and we come back to this discussion after Equation (7), below.
2.2. Tweedie’s Compound Poisson Model
We assume that the response
in Equation (1) belongs to the family of Tweedie’s compound Poisson (CP) models, see
Tweedie (
1984), having a density of the form
with
being the power variance (hyper-)parameter, and
Tweedie’s CP models belong to the exponential dispersion family (EDF), the latter being more general because it allows for more general cumulant functions
, we refer to
Jørgensen (
1986,
1987). We focus here on Tweedie’s CP family and not on the entire EDF because for certain calculations we need explicit properties of cumulant functions. However, similar results can be derived for other members of the EDF. We mention that the effective domain
is a convex set, and that the cumulant function
is smooth and strictly convex in the interior of the effective domain, having the following explicit form
is the Poisson model (with
),
is the gamma model, and for
we receive compound Poisson models with i.i.d. gamma claim sizes having shape parameter
, see
Jørgensen and de Souza (
1994) and
Smyth and Jørgensen (
2002). The first two moments are given by
with power variance function
for
.
Remark 1. In this paper we work under Tweedie’s CP models characterized by cumulant functions of the form Equation (3) because we need certain (explicit) properties of in the proofs of the statements below. For regression modeling one usually starts from a bigger class of models, namely, the EDF, which only requires that the cumulant function is smooth and strictly convex on the interior of the corresponding effective domainΘ, see Jørgensen (1986, 1987). A sub-family of the EDF is the so-called Tweedie’s distributions which have a cumulant function that allows for a power mean-variance relationship Equation (4) for any , see Table 1 in Jørgensen (1987), for instance, is the Gaussian model, is the Poisson model, is the gamma model and is the inverse Gaussian model. As mentioned, the models generated by cumulant function Equation (3) cover the interval and correspond to compound Poisson models with i.i.d. gamma claim size, we also refer to Delong et al. (2020). Relating Equation (3) to network regression function Equation (1) implies that we aim at modeling the canonical parameter
of policy
i by
and
is the canonical link, in contrast to the general link function
g. If we choose the canonical link for
g then
provides the identity function in Equation (5). Network parameter
is estimated with maximum likelihood estimation (MLE). This is achieved either by maximizing the log-likelihood function or by minimizing the corresponding deviance loss function. We use the framework of the deviance loss function here because it is more closely related to the philosophy of minimizing an objective function in gradient descent methods. The average deviance loss for independent random variables
is under Tweedie’s CP model assumption given by
for given data
. The canonical parameter is given by
if we understand it as a function defined through Equation (5). Naturally we have for any mean parameter
of any policy
i
because these terms subtract twice the log-likelihood of the
-parametrized model from its saturated counterpart.
is called the unit deviance of
w.r.t. mean parameter
.
The gradient descent algorithm now tries to make the objective function small through optimizing network parameter . This is done globally, i.e., simultaneously on the entire portfolio . Naturally, two different parameters with may provide very different models on an individual policy level i. This is exactly the point raised in the previous section, namely, that early stopping in model calibration w.r.t. to a global objective function on observations may provide equally good models on that portfolio level, but they may be very different on an individual policy level (if the gradient descent algorithm has not converged to the same extremal point of the loss function). The goal of this paper is to study such differences on individual policy level.
3. Aggregating Predictors
We introduce aggregating in this section. This can be described on one single policy
i. Assume that
is a predictor for response
(and an estimator for mean parameter
). In general, we assume that predictor
and response
are independent. This independence reflects that we perform an out-of-sample analysis, meaning that the mean parameter has been estimated on a learning data set that is disjoint from
, and we aim at performing a generalization analysis by considering the average loss described by the expected unit deviance (subject to existence)
We typically assume that the predictor is chosen such that this expected generalization loss is finite.
Proposition 1. Choose response with power variance parameter and canonical parameter . Assume is an unbiased estimator for the mean parameter , being independent of , and additionally satisfying , a.s., for some . We have expected generalization loss We remark that the bounds on ensure that Equation (8) is finite, moreover, the upper bound is needed to ensure that the predictors lie in the domain such that the expected unit deviances are convex functions in m. This is needed in the following proofs.
Proof. We calculate the expected generalization loss received by the mean of the unit deviance
where in the last step we have used independence between
and
and where we use function
We calculate the second derivative of this function. For
, it is given by
where for the last inequality we have used that the square bracket is non-negative under our assumptions. This implies that
is a concave function, and applying Jensen’s inequality we obtain
where in the last step we have used unbiasedness of estimator
. This finishes the proof. □
Proposition 1 tells us that the estimated model
has an expected generalization loss Equation (8) which is bounded below by the one of the true model mean
of
. Using aggregating we now try to come as close as possible to this lower bound.
Breiman (
1996) has analyzed this question in terms of the square loss function, and further results under the square loss function are given in
Bühlmann and Yu (
2002). We prefer to work with the deviance loss function here because this is the objective function used for fitting in the gradient descent algorithm. For this reason we prove the subsequent results, however, in their deeper nature these results are equivalent to the ones in
Breiman (
1996) and
Bühlmann and Yu (
2002).
Assume that
are i.i.d. copies of unbiased predictor
. We define the aggregated predictor
Proposition 2. Assume that , , are i.i.d. copies of satisfying the assumptions of Proposition 1, and being all independent from . We have for all Proof. The last bound is immediately clear because the aggregated predictors themselves fulfill the assumptions of Proposition 1, and henceforth the corresponding statement. Thus, we focus on the inequalities for
. Consider decomposition of the aggregate predictor for
The predictors
,
, are copies of
, though not independent ones. We have, using function
defined in Equation (9),
where the inequality follows from applying Jensen’s inequality to the concave function
, and the last identity follows from the fact that
,
, are copies of
. This finishes the proof. □
Proposition 2 says that aggregation works, i.e., aggregating i.i.d. predictors Equation (10) leads to a monotonically decreasing expected generalization loss. Moreover, notice that the i.i.d. assumption can be relaxed, indeed, it is sufficient that every in the above proof has the same distribution as . This does not require independence between the predictors , , but exchangeability is sufficient.
Proposition 3. Assume that , , are i.i.d. copies of satisfying the assumptions of Proposition 1, and being all independent from . In the Poisson case we additionally assume that the sequence of aggregated predictors Equation (8) has a uniform integrable upper bound. We have Proof. We have the identity
thus, it suffices to consider the last term. The law of large numbers implies a.s. convergence
because we have i.i.d. unbiased predictors
,
(in particular we have consistency). Thus, it suffices to provide a uniform integrable bound and then the claim follows from Lebesgue’s dominated convergence theorem. Note that by assumption we have uniform bounds
, a.s., which proves the claim for
. Thus, there remains the Poisson case
. In the Poisson case we have
. The leading term of this function is linear for
, henceforth, the uniform integrable bound assumption on the sequence Equation (8) provides the proof. □
The previous statement is based on the law of large numbers. Of course, we can also study a central limit theorem (CLT) that provides asymptotic normality and the rate of convergence. For the aggregated predictors we have convergence in distribution
noting that, under the Poisson case
we need to assume, in addition to the assumptions of Proposition 1, that the second moment of
is finite. Consider the expected generalization loss function on policy
i for given mean estimate
being independent of
where function
was defined in Equation (9). Thus, for a CLT of the expected generalization loss it suffices to understand the asymptotic behavior of
, because we have, using the tower property for conditional expectation and using independence between
and
,
Proposition 4. Assume that , , are i.i.d. copies of satisfying the assumptions of Proposition 1, and all being independent from . In the Poisson case we additionally assume that has finite second moment. We have This shows how the speed of convergence of aggregated predictors translates to the speed of convergence of deviance loss functions, in particular, from Taylor’s expansion we receive a first order term
. Basically, this is Theorem 5.2 in
Lehmann (
1983); we provide a proof because it is instructive.
Proof. We have Taylor expansion (using the Lagrange form for the remainder)
for some
m between
and
. This provides
The first term on the right-hand side converges in distribution to the standard Gaussian distribution as
, because of the CLT Equation (11). Therefore, the claim follows by proving that the last term converges in probability to zero. Consider the event
. CLT Equation (11) implies
. Choose
. For the last term on the right-hand side of the last identity we have bound
This finishes the proof. □
We remark that Propositions 1–4 have been formulated for predictors , and a crucial assumption in these propositions is that these predictors are unbiased for mean parameter . We could state similar aggregation results on the canonical parameter scale for and . This would have the advantage that we do not need any side constraints on because the cumulant function is convex over the entire effective domain . The drawback of this latter approach is that typically is not unbiased for , in particular, if we choose the canonical link in a non-Gaussian situation. In general, we can only expect asymptotic unbiasedness in this latter situation.
6. Conclusions and Outlook
This work has defined the nagging predictor which produces accurate and stable portfolio predictions on the basis of random network calibrations, and it provided convergence results in the context of Tweedie’s compound Poisson generalized linear models. Focusing on an example in motor third-party liability insurance pricing, we have shown that stable portfolio results are achieved after 20 network training runs, and by increasing the number of network training runs to 400, we have shown that quite stable results are produced at the level of individual policies, which is an important requirement for the use of networks for insurance pricing, and more general actuarial tasks. The coefficient of variation of the nagging predictor is shown to be a useful data-driven metric for measuring the relative difficulty with which a network is able to fit to individual training examples, and we have used it to calibrate an accurate meta network which approximates the nagging predictor. Whereas this work has examined the stability of network predictions in the case of a portfolio at a point in time, another important aspect of consistency within insurance is stable pricing over time, thus, future work could consider methods for stabilizing network predictions as new information becomes available.