*Article* **Machine Learning in Least-Squares Monte Carlo Proxy Modeling of Life Insurance Companies**

#### **Anne-Sophie Krah 1,***∗***, Zoran Nikoli´c <sup>2</sup> and Ralf Korn 1,3**


Received: 30 December 2019; Accepted: 12 February 2020; Published: 21 February 2020

**Abstract:** Under the Solvency II regime, life insurance companies are asked to derive their solvency capital requirements from the full loss distributions over the coming year. Since the industry is currently far from being endowed with sufficient computational capacities to fully simulate these distributions, the insurers have to rely on suitable approximation techniques such as the least-squares Monte Carlo (LSMC) method. The key idea of LSMC is to run only a few wisely selected simulations and to process their output further to obtain a risk-dependent proxy function of the loss. In this paper, we present and analyze various adaptive machine learning approaches that can take over the proxy modeling task. The studied approaches range from ordinary and generalized least-squares regression variants over generalized linear model (GLM) and generalized additive model (GAM) methods to multivariate adaptive regression splines (MARS) and kernel regression routines. We justify the combinability of their regression ingredients in a theoretical discourse. Further, we illustrate the approaches in slightly disguised real-world experiments and perform comprehensive out-of-sample tests.

**Keywords:** least-squares monte carlo method; machine learning; proxy modeling; life insurance; Solvency II

#### **1. Introduction**

The Solvency II directive of the European Parliament and European Council (2009) requires from insurance companies a derivation of the solvency capital requirement (SCR) using the full probability distributions of losses over a one-year period. Some life insurers comply with this requirement by setting up internal models. Other insurers opt for the much simpler standard formula, which enables an aggregation of the company's exposures to single risks. Lacking an analytical valuation formula for the losses in a one-year period, life insurers with an internal model are supposed to utilize a Monte Carlo approach usually called nested simulations approach (Bauer et al. (2012)). In practice their cash-flow-projection (CFP) models need to be simulated several hundred thousand to several million times for a robust implementation of the nested simulations approach. But the insurers are currently far from being endowed with sufficient computational capacities to perform such expensive simulation tasks. By applying suitable approximation techniques like the least-squares Monte Carlo (LSMC) approach of Bauer and Ha (2015), the insurers are able to overcome these computational hurdles though. For example, they can implement the LSMC framework formalized by Krah et al. (2018) and applied by, for example, Bettels et al. (2014), to derive their full loss distributions. The central idea of this framework is to carry out a comparably small number of wisely chosen nested Monte

Carlo simulations and to feed the simulation results into a supervised machine learning algorithm that translates the results into a proxy function of the insurer's loss (output) with respect to the underlying risk factors (input).

Our starting point is the LSMC framework from Krah et al. (2018). In the following the same approach for the proxy derivation is assumed, we will only amend the calibration and validation steps. Therefore, we neither repeat the simulation setting nor the procedure for the full loss distribution forecast and SCR calculation here in detail. The purpose of this exposition is to introduce different machine learning methods that can be applied in the calibration step of the LSMC framework, to point out their similarities and differences and to compare their out-of-sample performances in the same slightly disguised real-world LSMC example already used in Krah et al. (2018).

We describe the data basis used for calibration and validation in Section 2.1, the structure of the calibration algorithm in Section 2.2 and our validation approach in Section 2.3. Our focus lies on out-of-sample performance rather than computational efficiency as the latter becomes only relevant if the former gives reason for it. We analyze a very realistic data basis with 15 risk factors and validate the proxy functions based on a very comprehensive and computationally expensive nested simulations test set comprising the SCR estimate.

The main idea of our approach is to combine different regression methods with an adaptive algorithm, in which the proxy functions are built up of basis functions in a stepwise fashion. In a four risk factor LSMC example, Teuguia et al. (2014) applied a full model approach, forward selection, backward elimination and a bidirectional approach as, for example, discussed in Hocking (1976) with orthogonal polynomial basis functions. They stated that only forward selection and the bidirectional approach were feasible when the number of risk factors or the polynomial degree exceeded 7, as then the resulting other models exploded. Life insurance companies covering a wide range of contracts in their portfolio are typically exposed to even more risk factors like, for example, 15. Complex business regulation frameworks such as those in Germany cause non-linear dependencies between risk factors and losses, which naturally lead to polynomials of higher degrees in the chosen proxy models. In these cases, even the standard forward selection and bidirectional approaches become infeasible as the sets of candidate terms from which the basis functions are chosen will explode then as well. We therefore follow the suggestion of Krah et al. (2018) to implement the so-called principle of marginality, an iteration-wise update technique of the set of candidate terms that lets the algorithm get along with comparably few carefully selected candidate terms.

Our main contribution is to identify, explain and illustrate a collection of regression methods and model selection criteria from the variety of regression design options that provide suitable proxy functions in the LSMC framework when applied in combination with the principle of marginality. After some general remarks in Section 3.1, we describe ordinary least-squares (OLS) regression in Section 3.2, generalized linear models (GLMs) by Nelder and Wedderburn (1972) in Section 3.3, generalized additive models (GAMs) by Hastie and Tibshirani (1986) and Hastie and Tibshirani (1990) in Section 3.4, feasible generalized least-squares (FGLS) regression in Section 3.5, multivariate adaptive regression splines (MARS) by Friedman (1991) in Section 3.6, and kernel regression by Watson (1964) and Nadaraya (1964) in Section 3.7. While some regression methods such as OLS and FGLS regression or GLMs can immediately be applied in conjunction with numerous model selection criteria such as Akaike information criterion (AIC), Bayesian information criterion (BIC), Mallow's *CP* or generalized cross-validation (GCV), other regression methods such as GAMs, MARS, kernel, ridge or robust regression require well thought-through modifications thereof or work only with non-parametric alternatives such as *k*-fold or leave-one-out cross-validation. For adaptive approaches of FGLS, ridge and robust regression in life insurance proxy modeling, see also Hartmann (2015), Krah (2015) and Nikoli´c et al. (2017), respectively.

In the theory sections, we present the models with their assumptions, important properties and popular estimation algorithms and demonstrate how they can be embedded in the adaptive algorithm by proposing feasible implementation designs and combinable model selection criteria. While we shed light on the theoretical basic concepts of the models to lay the groundwork for the application and interpretation of the later following numerical experiments, we forego describing in detail technical enhancements or peculiarities of the involved algorithms and instead refer the interested reader to further sources. Additionally we provide the practicioners with R packages containing useful implementations of the presented regression routines. We complement the theory sections by corresponding empirical results in Section 4, throughout which we perform the same Monte Carlo approximation task to make the performance of the various methods comparable. We measure the approximation quality of the resulting proxy functions by means of aggregated validation figures on three out-of-sample test sets.

Conceivable alternatives to the entire adaptive algorithm are other typical machine learning techniques such as artificial neural networks (ANNs), decision tree learning or support vector machines. In particular, the classical feed forward networks proposed by Hejazi and Jackson (2017) and applied in various ways by Kopczyk (2018), Castellani et al. (2018), Born (2018) and Schelthoff (2019) were shown to capture the complex nature of CFP models well. A major challenge here is not only to find reliable hyperparameters such as the numbers of hidden layers and nodes in the network, batch size, weight initializer probability distribution, learning rate or activation functions but also the high dependence on the random seeds. We plan to contribute to this in a further publication which will be dedicated to hyperparameter search algorithms and stabilization methods such as ensemble methods. As an alternative to feed forward networks, Kazimov (2018) suggested to use radial basis function networks albeit so far none of the tested approaches performed better than the ordinary least squares regression in Krah et al. (2018).

In decision tree learning, random forests and tree-based gradient boosting machines were considered by Kopczyk (2018) and Schoenenwald (2019). While random forests were outperformed by feed forward networks but did better than the least absolute shrinkage and selection operator (LASSO) by Tibshirani (1996) in the example of the former author, they generally performed worse than the adaptive approaches by Krah et al. (2018) with OLS regression in numerous examples of the latter author. The gradient boosting machines, requiring more parameter tuning and thus being more versatile and demanding, came overall very close to the adaptive approaches.

Castellani et al. (2018) compared support vector regression (SVR) by Drucker et al. (1997) to ANNs and the adaptive approaches by Teuguia et al. (2014) in a seven risk factor example and found the performance of SVR placed somewhere inbetween the other two approaches with the ANNs getting closest to the nested simulations benchmark. As some further non-parametric approaches, Sell (2019) tested least-squares support-vector machines (LS-SVM) by Suykens and Vandewalle (1999) and shrunk additive least-squares approximations (SALSA) by Kandasamy and Yu (2016) in comparison to ANNs and the adaptive approaches by Krah et al. (2018) with OLS regression. In his examples, SALSA was able to beat the other two approaches whereas LS-SVM was left far behind. The analyzed machine learning alternatives have in common that they require at least to some degree a fine-tuning of some model hyperparameters. Since this is often a non-trivial but crucial task for generating suitable proxy functions, finding efficient and reliable search algorithms should become a subject of future research.

#### **2. Calibration and Validation in the LSMC Framework**

#### *2.1. Fitting and Validation Points*

#### 2.1.1. Outer Scenarios and Inner Simulations

Our starting point is the LSMC approach (Krah et al. (2018)). LSMC proxy functions are calibrated conditional on the *fitting points* generated by the Monte Carlo simulations of the CFP model. Additional out-of-sample *validation points* serve as a mean for an assessment of the goodness-of-fit. The explaining variables of a proxy function are financial and actuarial risks the insurance company is exposed to. Examples for these risks are changes in interest rates, equity, credit, mortality, morbidity, lapse and expense levels over the one-year period. The dependent variable is an economic variable like the

available capital, loss of available capital or best estimate of liabilites over the one-year period. Figure 1 plots the fitting values of an exemplary economic variable with respect to a financial risk factor. By an *outer scenario* we refer to a specific realized stress level combination of these risk factors over one year, and by an *inner simulation* to a stochastic path of an outer scenario in the CFP model under the given risk-neutral probability measure. Each outer scenario is assigned the probability weighted mean value of the economic variable over the corresponding inner simulations. In the LSMC context the fitting values are the mean values over only few inner simulations whereas the validation values are derived as the mean values over many inner simulations.

**Figure 1.** Fitting values of best estimate of liabilities with respect to a financial risk factor.

#### 2.1.2. Different Trade-Off Requirements

According to the law of large numbers, this construction makes the validation values comparably stable while the fitting values are very volatile. Typically, the very limited fitting and validation simulation budgets are of similar sizes. Hence the few inner simulations in the case of the fitting points allow a great diversification among the outer scenarios whereas the many inner simulations in the case of the validation points let the validation values be quite close to their expectations but at the cost of only little diversification among the outer scenarios. These opposite ways to deal with the trade-off between the numbers of outer scenarios and inner simulations reflect the different requirements for the fitting and validation points in the LSMC approach. While the fitting scenarios should cover the domain of the real-world scenarios well to serve as a good regression basis, the validation values should approximate the expectations of the economic variable at the validation scenarios well to provide appropriate target values for the proxy functions.

#### *2.2. Calibration Algorithm*

#### 2.2.1. Five Major Components

The calibration of the proxy function is performed by an adaptive algorithm that can be decomposed into the following five major components: (1) a set of allowed basis function types for the proxy function, (2) a regression method, (3) a model selection criterion, (4) a candidate term update principle, and (5) the number of steps per iteration and the directions of the algorithm. For illustration, we adopt the flowchart of the adaptive algorithm from Krah et al. (2018) and depict it in Figure 2. While components (1) and (5) enter the flowchart implicitly through the start proxy, candidate terms and the order of the processes and decisions in the chart, components (2), (3) and (4) are explicitly indicated through the labels "Regression", "Model Selection Criterion" and "Get Candidate Terms".

**Figure 2.** Flowchart of the calibration algorithm.

Let us briefly recapitulate the choice of components (1)–(5) from the successful applications of the adaptive algorithm in the insurance industry as described in Krah et al. (2018). As the function types for the basis functions (1), let only monomials be allowed. Let the regression method (2) be ordinary least-squares (OLS) regression and the model selection criterion (3) Akaike information criterion (AIC) from Akaike (1973). Let the set of candidate terms (4) be updated by the principle of marginality to which we will return in greater detail below. Lastly, when building up the proxy function iteratively, let the algorithm make only one step per iteration in the forward direction (5) meaning that in each iteration exactly one basis function is selected which cannot be removed anymore (adaptive forward stepwise selection).

#### 2.2.2. Iterative Procedure

The algorithm starts in the upper left side of Figure 2 with the specification of the start proxy basis functions. We specify only the intercept so that the first regression (*k* = 0) reduces to averaging over all fitting values. In order to harmonize the choices of OLS regression and AIC, we assume that the errors are normally distributed and homoscedastic because then the OLS estimator coincides with the maximum likelihood estimator. AIC is a relative measure for the goodness-of-fit of the proxy function and is defined as twice the negative of the maximum log-likelihood plus twice the number of degrees of freedom. The smaller the AIC score, the better the fit, and thus the trade-off between a too complex (overfitting) and too simple model (underfitting).

At the beginning of each iteration (*k* = 1, ... , *K* − 1), the set of candidate terms is updated by the *principle of marginality* which stipulates that a monomial basis function becomes a candidate if and only if all its derivatives are already included in the proxy function. The choice of a monomial basis is compatible to the principle of marginality. Using such a principle saves computational costs by selecting the basis functions conditionally on the current proxy function structure. In the first iteration (*k* = 1), all linear monomials of the risk factors become candidates as their derivatives are constant values which are represented by the intercept.

The algorithm proceeds on the lower left side of the flowchart with a loop in which all candidate terms are separately added to the proxy function structure and tested with regard to their additional explanatory power. With each candidate, the fitting values are regressed against the fitting scenarios and the AIC score is calculated. If no candidate reduces the currently smallest AIC score, the algorithm terminates, and otherwise, the proxy function is updated by the one which reduces AIC most. Then the next iteration (*k* + 1) begins with the update of the set of candidate terms, and so on. As long as no termination occurs, this procedure is repeated until the prespecified maximum number of terms *K*max is reached.

#### *2.3. Validation Figures*

#### 2.3.1. Validation Sets

Since it is the objective of this paper to propose suitable regression methods for the proxy function calibration in the LSMC framework, we introduce several validation figures serving as indicators for the approximation quality of the proxy functions. We measure the out-of-sample performance of each proxy function on three different validation sets by calculating five validation figures per set.

The three validation sets are a Sobol set, a nested simulations set and a capital region set. Unlike the Sobol set, the nested simulations and capital region sets do not serve as feasible validation sets in the LSMC routine as they become known only after evaluating the proxy function as explained below. Furthermore, they require massive computational capacities. Yet they can be regarded as the natural benchmark for the LSMC-based method and are thus very valuable for this analysis. Figure 3 plots the nested simulation values of an exemplary economic variable with respect to a financial risk factor. The Sobol set consists of, for example, between *L* = 15 and *L* = 200 Sobol validation points, of which the scenarios follow a Sobol sequence covering the fitting space uniformly. Thereby, the fitting space is the cube on which the outer fitting scenarios are defined. It has to cover the space of real-world scenarios used for the full loss distribution forecast sufficiently well. For interpretive reasons, sometimes the Sobol set is extended by points with, for example, one-dimensional risk

scenarios or scenarios producing a risk capital close to the SCR (= 99.5% value-at-risk) in previous risk capital calculations.

**Figure 3.** Nested simulation values of best estimate of liabilities with respect to a financial risk factor.

The nested simulations set comprises the, for example, *L* = 820 to *L* = 6554 validation points of which the scenarios correspond to the, for example, highest 2.5% to 5% losses from the full loss distribution forecast made by the proxy function that had been derived under the standard calibration algorithm choices described in Section 2.2. Like in the example of Chapter 5.2 in Krah et al. (2018), the order of these losses-which scenarios lead to which quantiles?following from the fourth and last step of the LSMC approach is very similar to the order following from the nested simulations approach. Therefore the scenarios of the nested simulations set are simply chosen by the order of the losses resulting from the LSMC approach. Several of these scenarios consist of stresses falling out of the fitting space. Compare Figures 1 and 3 which depict fitting and nested simulation values from the same proxy modeling task with respect to the same risk factor. Severe outliers due to extreme stresses far outside of the fitting space should be excluded from the set. The capital region set is a subset of the nested simulations set containing the nested simulations SCR estimate, that is, the scenario leading to the 99.5% loss, and the, for example, 64 losses above and below, which makes in total, for example, *L* = 129 validation points.

#### 2.3.2. Validation Figures

The five validation figures reported in our numerical experiments comprise two normalized mean absolute errors (MAEs), one with respect to the magnitude of the economic variable itself and one with respect to the magnitude of the corresponding market value of assets. They comprise further the mean error, that is, the mean of the residuals, as well as two validation figures based on the change of the economic variable from its base value (see the definition of the base value below): the normalized MAE with respect to the magnitude of the changes and the mean error of these changes. The smaller the normalized MAEs are, the better the proxy function approximates the economic variable. However, the validation values are afflicted with Monte Carlo errors so that the normalized MAEs serve only as meaningful indicators as long as the proxy functions do not become too precise. The means of the residuals should be possibly close to zero since they indicate systematic deviations of the proxy functions from the validation values. While the first three validation figues measure how well the proxy function reflects the economic variable in the CFP model, the latter two address the approximation effects on the SCR, compare Chapter 3.4.1 of Krah et al. (2018).

Let us write the absolute value as |·| and let *L* denote the number of validation points. Then we can express the MAE of the proxy function *f xi* evaluated at the validation scenarios *x<sup>i</sup>* versus the validation values *y<sup>i</sup>* as <sup>1</sup> *<sup>L</sup>* <sup>∑</sup>*<sup>L</sup> i*=1 *y<sup>i</sup>* <sup>−</sup> *<sup>f</sup> xi* . After normalizing the MAE with respect to the mean of the absolute values of the economic variable or the market value of assets, that is, <sup>1</sup> *<sup>L</sup>* <sup>∑</sup>*<sup>L</sup> i*=1 *di* with *<sup>d</sup><sup>i</sup>* <sup>∈</sup> *yi* , *ai* , we obtain the first two validation figures, that is,

$$\text{mae} = \frac{\sum\_{i=1}^{L} \left| y^i - \widehat{f}\left(\mathbf{x}^i\right) \right|}{\sum\_{i=1}^{L} \left| d^i \right|}. \tag{1}$$

In the following, we will refer to (1) with *d<sup>i</sup>* = *y<sup>i</sup>* as the MAE with respect to the *relative metric*, and to (1) with *d<sup>i</sup>* = *a<sup>i</sup>* as the MAE with respect to the *asset metric*. The mean of the residuals is given by

$$\text{res} = \frac{1}{L} \sum\_{i=1}^{L} \left( y^i - \hat{f}\left(\mathbf{x}^i\right) \right). \tag{2}$$

Let us refer by the *base value y*<sup>0</sup> to the validation value corresponding to the base scenario *x*<sup>0</sup> in which no risk factor has an effect on the economic variable. In analogy to (1) but only with respect to the relative metric, we introduce another normalized MAE by

$$\text{mae}^{0} = \frac{\sum\_{i=1}^{L} \left| \left( y^{i} - y^{0} \right) - \left( \widehat{f} \left( \mathbf{x}^{i} \right) - \widehat{f} \left( \mathbf{x}^{0} \right) \right) \right|}{\sum\_{i=1}^{L} \left| y^{i} - y^{0} \right|}. \tag{3}$$

The mean of the corresponding residuals is given by

$$\text{res}^0 = \frac{1}{L} \sum\_{i=1}^{L} \left( \left( y^i - y^0 \right) - \left( \hat{f} \left( \mathbf{x}^i \right) - \hat{f} \left( \mathbf{x}^0 \right) \right) \right). \tag{4}$$

In addition to these five validation figures, let us define the base residual which can be used as a substitute for (4) depending on personal taste. The base residual can easily be extracted from (2) and (4) by

$$\text{res}^{\text{base}} = y^0 - \widehat{f}\left(\mathbf{x}^0\right) = \text{res} - \text{res}^0. \tag{5}$$

#### **3. Machine Learning Regression Methods**

#### *3.1. General Remarks*

As the main part of our work, we will compare various types of machine learning regression approaches for determining suitable proxy functions in the LSMC framework. The methods we present in this section range from ordinary and generalized least-squares regression variants over GLM and GAM approaches to multivariate adaptive regression splines and kernel regression approaches.

The performance of the newly derived proxy functions when applied to the described validation sets is one way of comparing the different methods. Another way consists of ensuring compatibility with the principle of marginality and utilizing a suitable model selection criterion such as AIC in order to be able to compare iteration-wise the candidate models inside the approaches.

We will in the following sections shortly introduce the different methods, collect some theoretical properties and then concentrate on aspects of their implementation. Their numerical performance on the different validation sets is the subject of Section 4.

Our aim in the calibration step below is to estimate the conditional expectation *Y*(*X*) under the risk-neutral measure given an outer scenario *X*. In contrast to Krah et al. (2018) *Y*(*X*) does not necessarily have to be the available capital but can instead be, for example, the best estimate of liabilites or the market value of assets. The *D*-dimensional fitting scenarios are always generated under the physical probability measure P on the fitting space which itself is a subspace of R*D*.

#### *3.2. Ordinary Least-Squares (OLS) Regression*

#### 3.2.1. The Regression Model

In iteration *K* − 1 of the adaptive forward stepwise algorithm (as given in Section 2.2), the OLS approximation consists of a linear combination of suitable linearly independent basis functions *ek* (*X*) ∈ *L*<sup>2</sup> R*D*, B, P , *k* = 0, 1, . . . , *K* − 1, that is,

$$\mathcal{Y}(X) \stackrel{K \prec \infty}{\approx} f(X) = \sum\_{k=0}^{K-1} \beta\_k \mathfrak{e}\_k \, (X). \tag{6}$$

We call *f*(*X*) the predictor of *Y*(*X*) or the *systematic component*.

With the fitting points *xi* , *yi* , *i* = 1, ... , *N*, and uncorrelated errors *<sup>i</sup>* (the *random components*) having the same variance *σ*<sup>2</sup> > 0 (= homoscedastic errors), we obtain the classical linear regression model

$$\mathbf{y}^i = \sum\_{k=0}^{K-1} \beta\_k \mathbf{c}\_k \left(\mathbf{x}^i\right) + \boldsymbol{\epsilon}^i,\tag{7}$$

where *e*<sup>0</sup> *xi* = 1 and *β*<sup>0</sup> is the intercept. Then, the ordinary least-squares (OLS) estimator *β*OLS of the coefficients is given by

$$\hat{\mathcal{B}}\_{\text{OLS}} = \underset{\mathcal{\mathcal{B}} \in \mathbb{R}^K}{\text{arg}\min} \left\{ \sum\_{i=1}^N \left( y^i - \sum\_{k=0}^{K-1} \beta\_k c\_k \left( x^i \right) \right)^2 \right\}.\tag{8}$$

Using the notation *zik* = *ek xi* the OLS problem is solved explicitly by

$$
\hat{\boldsymbol{\beta}}\_{\rm OLS} = \left(\boldsymbol{Z}^{\rm T}\boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\rm T} \boldsymbol{\textbf{y}}.\tag{9}
$$

The proxy function *f* (*X*) for the economic variable *Y*(*X*) given an outer scenario *X* is

$$\mathcal{Y}(X) \stackrel{K, N < \infty}{\approx} \widehat{f}\left(X\right) = \sum\_{k=0}^{K-1} \widehat{\beta}\_{\text{OLS}, k} \mathfrak{e}\_k\left(X\right). \tag{10}$$

For a practical implementation see, for example, function *lm(*·*)* in the R package *stats* of R Core Team (2018).

3.2.2. Gauss-Markov Theorem, ML Estimation and AIC

Under the assumptions of strict exogeneity *E* [ | *Z*] = **0** (A1), a spherical error variance *<sup>V</sup>* [ | *<sup>Z</sup>*] = *<sup>σ</sup>*<sup>2</sup> *IN* with *IN* the *<sup>N</sup>*-dimensional identity matrix (A2), and linearly independent basis functions (A3), we have (compare, for example, Hayashi (2000)):


$$\text{AIC} = -2l \left( \hat{\mathfrak{k}}\_{\text{OLS}} \hat{\sigma}^2 \right) + 2 \left( K + 1 \right) = N \left( \log \left( 2 \pi \hat{\sigma}^2 \right) + 1 \right) + 2 \left( K + 1 \right). \tag{11}$$

#### *3.3. Generalized Linear Models (GLMs)*

#### 3.3.1. The Regression Model

The systematic component of a GLM (see Nelder and Wedderburn (1972) for its introduction) equals the linear predictor *η* = *f*(*X*) of the model in (6). However, one uses a monotonic link function *g*(·) that relates the economic variable *Y*(*X*) to the linear predictor via

$$\underbrace{g(\boldsymbol{\chi}(\boldsymbol{X}))}\_{=-\boldsymbol{\mu}} \overset{K<\infty}{\approx} \underbrace{f(\boldsymbol{X})}\_{=-\boldsymbol{\eta}} = \sum\_{k=0}^{K-1} \beta\_k z\_k = \mathbf{z}^T \boldsymbol{\mathfrak{f}},\tag{12}$$

with **<sup>z</sup>** <sup>=</sup> (*e*<sup>0</sup> (*X*),...,*eK*−<sup>1</sup> (*X*))*T*.

Of course, the choice of the link function *g*(.) is a critical aspect. A possible motivation is a non-negativity requirement on *Y*(*X*) that can be satisfied using *g*(*y*) = ln(*y*). Further comments on choices of the link function are motivated below.

#### 3.3.2. Canonical Link Function, GLM Estimation and IRLS Algorithm

While the normal distribution assumption for the random component allowed the derivation of nice properties in the linear model of the preceding section, the GLM considers random components with (conditional) distributions from the exponential family. Its canonical form with parameter *θ* is given by the density function

$$\pi(y \mid \theta, \phi) = \exp\left(\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right),\tag{13}$$

where *a*(*φ*), *b*(*θ*) and *c*(*y*, *φ*) are specific functions. For example, a normally distributed economic variable with mean *μ* and variance *σ*<sup>2</sup> is given by *a*(*φ*) = *φ*, *b*(*θ*) = *<sup>θ</sup>*<sup>2</sup> <sup>2</sup> and *c*(*y*, *φ*) = −1 2 *y*2 *<sup>σ</sup>*<sup>2</sup> <sup>+</sup> log 2*πσ*<sup>2</sup> with *θ* = *μ* and *φ* = *σ*2.

For a random variable *Y* with a distribution from the exponential family, we have

$$E(\mathbf{Y}) = \boldsymbol{\mu} = \boldsymbol{b}'(\boldsymbol{\theta}), \; Var(\mathbf{Y}) = \boldsymbol{b}''(\boldsymbol{\theta}) \boldsymbol{a}(\boldsymbol{\phi}) =: \mathcal{V}\left[\boldsymbol{\mu}\right] \boldsymbol{a}(\boldsymbol{\phi}) \,. \tag{14}$$

*a*(*φ*) is called a dispersion parameter, *V*[.] the variance function. We will in the following make the simplifying assumption *a*(*φ<sup>i</sup>* ) = *φ*, *i* = 1, ..., *N* for a constant value of *φ* (A5) and then obtain the ML estimator in the GLM from Equation (13) as

$$\hat{\mathcal{B}}\_{\text{GLM}} = \underset{\mathcal{\mathcal{B}} \in \mathbb{R}^K}{\text{arg}\max} \left\{ \sum\_{i=1}^N \left( \frac{y^i \theta^i - b(\theta^i)}{\phi} + c(y^i, \phi) \right) \right\}. \tag{15}$$

Under (A5), there does in general not exist a closed-form solution for the GLM coefficient estimator (15). The resulting iterative method will be simplified for so-called canonical link functions *g*(*μ*) = *θ* which due to relation (14) are given by

$$\mathcal{g}(\mu) = (b')^{-1}(\mu),\tag{16}$$

with *b*(.) from the definition of the exponential family. Examples of pairs of canonical link functions and corresponding distributions are *g*(*μ*) = *μ* and the normal, *g*(*μ*) = 1/*μ* and the gamma, and *g*(*μ*) = 1/*μ*<sup>2</sup> and the inverse Gaussian distribution.

In Chapter 2.5, McCullagh and Nelder (1989) apply Fisher's scoring method to obtain an approximation to the GLM estimator. Further, McCullagh and Nelder (1989) justify how Fisher's scoring method can be cast in the form of the iteratively reweighted least squares (IRLS) algorithm. To state the IRLS algorithm in our context, we need some notation.

Let *<sup>η</sup><sup>i</sup>* (*t*) = *f xi* be the estimate for the linear predictor evaluated at fitting scenario *x<sup>i</sup>* , compare (12). Let *<sup>μ</sup><sup>i</sup>* (*t*) <sup>=</sup> *<sup>g</sup>*−<sup>1</sup> *ηi* (*t*) be the estimate for the economic variable, and *<sup>d</sup><sup>η</sup> dμ μi* (*t*) = *g μi* (*t*) the first derivative of the link function with respect to the economic variable evaluated at *μi* (*t*) . Furthermore, we introduce the weight matrix *<sup>W</sup>*(*t*) <sup>=</sup> diag *w*1 *<sup>β</sup>***(***t***)** ,..., *<sup>w</sup><sup>N</sup> <sup>β</sup>***(***t***)** with components given by

$$
\hat{w}^i \left( \hat{\mathcal{B}}^{(t)} \right) = \left( \frac{d\eta}{d\mu} \left( \hat{\mu}\_{(t)}^i \right) \right)^{-2} V \left[ \hat{\mu}\_{(t)}^i \right]^{-1} \,, \tag{17}
$$

and *V μi* (*t*) the variance function from above evaluated at *<sup>μ</sup><sup>i</sup>* (*t*) . Finally, we define *D*(*t*) = diag(*d*<sup>1</sup> (*t*) , ..., *d<sup>N</sup>* (*t*) ) with *d<sup>i</sup>* (*t*) = *g μi* (*t*) which allows us to formulate the IRLS algorithm for canonical link functions.

**IRLS algorithm.** *Perform the iterative approximation procedure below with an initialization of <sup>μ</sup><sup>i</sup>* (0) <sup>=</sup> *<sup>y</sup><sup>i</sup>* <sup>+</sup> 0.1 *and <sup>η</sup><sup>i</sup>* (0) = *g μi* (0) *as proposed by Dutang (2017) until convergence:*

$$
\hat{\boldsymbol{\beta}}^{\{t+1\}} = \left(\boldsymbol{Z}^{\mathrm{T}} \boldsymbol{W}^{(t)} \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\mathrm{T}} \boldsymbol{W}^{(t)} \hat{\mathbf{s}}^{\{t\}} \left(\hat{\boldsymbol{\beta}}^{\{t\}}\right), \tag{18}
$$

$$\hat{\mathbf{s}}^{\{\mathbf{t}\}}\left(\hat{\boldsymbol{\beta}}^{\{\mathbf{t}\}}\right) = \ \ \ \ \mathbf{Z}\hat{\boldsymbol{\beta}}^{\{\mathbf{t}\}} + \ \ \mathbf{D}^{\{\mathbf{t}\}}(\boldsymbol{y} - \hat{\boldsymbol{\mu}}\_{t}) \tag{19}$$

*After convergence, we set <sup>β</sup>*GLM <sup>=</sup> *<sup>β</sup>***(***t***+1)***.*

Green (1984) proposes to solve the system *ZTW*(*t*)*Z <sup>β</sup>***(***t***+1)** <sup>=</sup> *<sup>Z</sup>TW*(*t*) *s***(***t***)** which is equivalent to (18) via a QR decomposition to increase numerical stability. For a practical implementation of GLMs using the IRLS algorithm, see, for example, function *glm(*·*)* in R package *stats* of R Core Team (2018).

By inserting (17), (19) and the GLM estimator into (18) and by using (12), we obtain

$$\hat{\mathcal{B}}\_{\text{GLM}} = \underset{\mathcal{B} \in \mathbb{R}^K}{\text{arg min}} \left\{ \sum\_{i=1}^N V \left[ \hat{\mu}\_{\text{GLM}}^i \right] \left( y^i - \hat{\mu}\_{\text{GLM}}^i \right)^2 \right\},\tag{20}$$

that is, the GLM estimator minimizes the squared sum of raw residuals scaled by the estimated individual variances of the economic variable.

The Pearson residuals are defined as the raw residuals divided by the estimated individual standard deviations, that is,

$$
\hat{\varepsilon}^i = \frac{y^i - \hat{\mu}\_{\text{GLM}}^i}{\sqrt{V\left[\hat{\mu}\_{\text{GLM}}^i\right]}}.\tag{21}
$$

#### 3.3.3. AIC and Dispersion Estimation

Since AIC depends on the ML estimators, it is combinable with GLMs in the adaptive algorithm. Here, it has the form

$$\text{AIC} = -2l \left( \hat{\mathfrak{f}}\_{\text{GLM}} \hat{\mathfrak{f}} \right) + 2 \left( K + p \right), \tag{22}$$

where *K* is the number of coefficients and *p* indicates the number of the additional model parameters associated with the distribution of the random component. For instance, in the normal model, we have *p* = 1 due to the error variance/dispersion. A typical estimate of the dispersion in GLMs is the Pearson residual chi-squared statistic divided by *N* − *K* as described by Zuur et al. (2009) and implemented, for example, in function *glm(*·*)* belonging to R package *stats*, that is,

$$
\hat{\phi} = \frac{1}{N - K} \sum\_{i=1}^{N} \left( \hat{\epsilon}^i \right)^2,\tag{23}
$$

with *<sup>i</sup>* given by (21). Even though this is not the ML estimator, it is a good estimate because, if the model is specified correctly, the Pearson residual chi-squared statistic divided by the dispersion is asymptotically *χ*<sup>2</sup> *<sup>N</sup>*−*<sup>K</sup>* distributed and the expected value of a chi-squared distribution with *<sup>N</sup>* <sup>−</sup> *<sup>K</sup>* degrees of freedom is *N* − *K*.

#### *3.4. Generalized Additive Models (GAMs)*

#### 3.4.1. The Regression Model

Generalized additive models (GAMs) as introduced by Hastie and Tibshirani (1986) and Hastie and Tibshirani (1990) can be regarded as richly parameterized GLMs with smooth functions. While GAMs inherit from GLMs the random component (13) and the link function (12), they inherit from the additive models of Friedman and Stuetzle (1981) the linear predictor with the smooth functions. In the adaptive algorithm, we apply GAMs of the form

$$\underbrace{g(\boldsymbol{Y}(\boldsymbol{X}))}\_{=\boldsymbol{\eta}} \overset{K<\infty}{\approx} \underbrace{f(\boldsymbol{X})}\_{=\boldsymbol{\eta}} = \boldsymbol{\beta}\_0 + \sum\_{k=1}^{K-1} h\_k(z\_k)\_\prime \tag{24}$$

where *zk* = *ek* (*X*), *β*<sup>0</sup> is the intercept and *hk* (·), *k* = 1, ... , *K* − 1, are the smooth functions to be estimated. In addition to the smooth functions, GAMs can also include simple linear terms of the basis functions as they appear in the linear predictor of GLMs. A smooth function *hk* (·) can be written as a basis expansion

$$h\_k\left(z\_k\right) = \sum\_{j=1}^{l} \beta\_{kj} b\_{kj}\left(z\_k\right) \tag{25}$$

with coefficients *βkj* and known basis functions *bkj* (*zk*), *j* = 1, ... , *J*, which should not be confused with their arguments, namely the first-order basis functions *zk* = *ek* (*X*), *k* = 0, ... , *K* − 1. The slightly adapted Figure 4 from Wood (2006) depicts an exemplary approximation of *y* by a GAM with a basis expansion in one dimension *zk* without an intercept. The solid colorful curves represent the pure basis functions *bkj* (*zk*), *j* = 1, ... , *J*, the dashed colorful curves show them after scaling with the coefficients *βkjbkj* (*zk*), *j* = 1, . . . , *J*, and the black curve is their sum (25).

**Figure 4.** Generalized additive model (GAM) with a basis expansion in one dimension.

Typical examples for basis functions are thin plate regression splines, duchon splines, cubic regression splines or Eilers and Marx style P-splines. See, for example, function *gam(*·*)* in R package *mgcv* of Wood (2018) for a practical implementation of GAMs admitting these types of basis functions and using the PIRLS algorithm, which we present below.

$$\begin{aligned} \text{In vector notation, we can write } \boldsymbol{\mathfrak{f}} &= \left(\boldsymbol{\beta}\_{0}, \boldsymbol{\mathfrak{f}}\_{1}^{T}, \dots, \boldsymbol{\mathfrak{f}}\_{K-1}^{T}\right)^{T} \text{ with } \boldsymbol{\mathfrak{f}}\_{k} = \left(\boldsymbol{\beta}\_{k1}, \dots, \boldsymbol{\beta}\_{k\bar{l}}\right)^{T} \text{ and } \boldsymbol{\mathfrak{a}} = \left(\boldsymbol{\beta}\_{k}, \boldsymbol{\mathfrak{f}}\_{1}^{T}, \dots, \boldsymbol{\mathfrak{f}}\_{K-1}^{T}\right)^{T} \text{ with } \boldsymbol{\mathfrak{f}} = \left(\boldsymbol{\beta}\_{k}, \boldsymbol{\mathfrak{f}}\_{1}^{T}, \dots, \boldsymbol{\mathfrak{f}}\_{K}^{T}\right)^{T} \text{ with } \boldsymbol{\mathfrak{a}} = \left(\boldsymbol{\beta}\_{k}, \boldsymbol{\mathfrak{f}}\_{k}^{T}, \dots, \boldsymbol{\mathfrak{f}}\_{K}^{T}\right)^{T}, \text{ hence (24) becomes} \\ \boldsymbol{\mathfrak{g}} = \underbrace{\begin{pmatrix} \mathbf{Y}(\boldsymbol{X}) \\ -\boldsymbol{\mu} \end{pmatrix}}\_{-\boldsymbol{\mu}} \stackrel{\mathcal{K}\text{<}\infty}{\text{as}} \underbrace{\boldsymbol{\mathfrak{f}}(\boldsymbol{\mathfrak{X}})}\_{-\boldsymbol{\eta}} = \mathbf{a}^{T}\boldsymbol{\mathfrak{f}}. \end{aligned} \tag{26}$$

In order to make the smooth functions *hk* (·), *k* = 1, ... , *K* − 1, identifiable, identifiability constraints ∑*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *hk* (*zik*) = 0 with *zik* = *ek xi* can be imposed. According to Wood (2006) this can be achieved by modification of the basis functions *bkj* (·) with one of them being lost.

#### 3.4.2. Penalization and GAM Estimation via PIRLS Algorithm

Let the deviance corresponding to observation *y<sup>i</sup>* be *D<sup>i</sup>* (*β*) = 2 *l i* sat − *l <sup>i</sup>* (*β*, *φ*) *φ* where *D<sup>i</sup>* (*β*) is independent of dispersion *φ*, where *l i* sat = max*β<sup>i</sup> l i βi* , *φ* is the saturated log-likelihood and *l <sup>i</sup>* (*β*, *φ*) the log-likelihood. Then the model deviance can be written as *D* (*β*) = ∑*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *D<sup>i</sup>* (*β*). It is a generalization of the residual sum of squares for ML estimation. For instance, in the normal model the unit deviance is *<sup>y</sup><sup>i</sup>* − *<sup>μ</sup><sup>i</sup>* 2 . For given smoothing parameters *λ<sup>k</sup>* > 0, *k* = 1, ... , *K* − 1, the GAM estimator *β*GAM of the coefficients is defined as the minimizer of the penalized deviance

$$\hat{\mathfrak{F}}\_{\text{GAN}} = \underset{\mathfrak{f} \in \mathbb{R}^{(K-1)/ +1}}{\arg\min} \left\{ D \left( \mathfrak{F} \right) + \sum\_{k=1}^{K-1} \lambda\_k \int h\_k^{\prime\prime} \left( z\_k \right)^2 \, \mathrm{d}z\_k \right\}, \text{where} \tag{27}$$

$$\int h\_k^{\prime\prime} \left( z\_k \right)^2 \, \mathrm{d}z\_k = \mathfrak{F}\_k^T \left( \int \mathbf{b}\_k^{\prime\prime} \left( z\_k \right) \mathbf{b}\_k^{\prime\prime} \left( z\_k \right)^T \, \mathrm{d}z\_k \right) \mathfrak{F}\_k = \mathfrak{F}\_k^T \mathcal{S}\_k \mathfrak{F}\_k$$

are the smoothing penalties. The smoothing parameters *λ<sup>k</sup>* control the trade-off between a too wiggly model (overfitting) and a too smooth model (underfitting). The larger the *λ<sup>k</sup>* values are, the more pronounced is the wiggliness of the basis functions reflected by their second derivatives in the minimization problem (27), and the higher is thus the penalty associated with the coefficients and the smoother is the estimated model.

A major advantage of the definition of GAMs via (24), (25), and (27) is its compatibility with information criteria and other model selection criteria such as generalized cross-validation. Besides, the resulting penalty matrix favors numerical stability in the PIRLS algorithm.

Since the saturated log-likelihood is a constant for a fixed distribution and set of fitting points, we can turn the minimization problem (27) into the maximization task of the penalized log-likelihood, that is,

$$\hat{\boldsymbol{\beta}}\_{\text{GAN}} = \underset{\boldsymbol{\beta} \in \mathbb{R}^{(K-1)/l+1}}{\arg\max} \left\{ l \left( \boldsymbol{\beta}, \boldsymbol{\phi} \right) - \frac{1}{2} \sum\_{k=1}^{K-1} \lambda\_k \boldsymbol{\beta}\_k^T \boldsymbol{\mathcal{S}}\_k \boldsymbol{\mathcal{S}}\_k \boldsymbol{\beta}\_k \right\}. \tag{28}$$

Wood (2000) points out that Fisher's scoring method can be cast in a penalized version of the iteratively reweighted least squares (PIRLS) algorithm when being used to approximate the GAM coefficient estimator (28). We formulate the PIRLS algorithm based on Marx and Eilers (1998) who indicate the iterative solution explicitly.

Let *<sup>β</sup>***(***t***)** now be the GAM coefficient approximation in iteration *<sup>t</sup>*. Then the vector of the dependent variable **s(***t***)** <sup>=</sup> *s*1 *<sup>β</sup>***(***t***)** ,..., *s<sup>N</sup> <sup>β</sup>***(***t***)** *T* and the weight matrix given by *W*(*t*) = *Risks* **2020**, *8*, 21

diag *w*1 *<sup>β</sup>***(***t***)** ,..., *<sup>w</sup><sup>N</sup> <sup>β</sup>***(***t***)** have the same form as in the IRLS algorithm, see (19) and (17). Additionally, let *S* = blockdiag (0, *λ*1S1,..., *λK*−1S*K*−1) with *S*<sup>11</sup> = 0 belonging to the intercept be the penalty matrix.

**PIRLS algorithm.** *Perform the iterative approximation procedure below with initialization of <sup>μ</sup><sup>i</sup>* (0) <sup>=</sup> *<sup>y</sup><sup>i</sup>* <sup>+</sup> 0.1 *and <sup>η</sup><sup>i</sup>* (0) = *g μi* (0) *until convergence occurs:*

$$\begin{split} \hat{\boldsymbol{\mathcal{B}}}^{(t+1)} &= \underset{\boldsymbol{\mathcal{B}} \in \mathbb{R}^{(K-1)\times 1}}{\arg\min} \left\{ \sum\_{i=1}^{N} \boldsymbol{w}^{i} \left( \hat{\boldsymbol{\mathcal{B}}}^{(t)} \right)^{-1} \left( \hat{\boldsymbol{s}}^{i} \left( \hat{\boldsymbol{\mathcal{B}}}^{(t)} \right) - \boldsymbol{\mathcal{S}}\_{0} - \sum\_{k=1}^{K-1} \sum\_{j=1}^{I} \boldsymbol{\mathcal{S}}\_{kj} \boldsymbol{b}\_{kj} \left( \boldsymbol{z}\_{ik} \right) \right)^{2} + \sum\_{k=1}^{K-1} \lambda\_{k} \boldsymbol{\mathsf{f}}\_{k}^{T} \boldsymbol{\mathcal{S}}\_{k} \boldsymbol{\mathsf{f}}\_{k} \right\} \\ &= \left( \boldsymbol{Z}^{T} \boldsymbol{W}^{(t)} \boldsymbol{Z} + \boldsymbol{\mathcal{S}} \right)^{-1} \boldsymbol{Z}^{T} \boldsymbol{W}^{(t)} \hat{\boldsymbol{s}}^{(t)}. \end{split} \tag{29}$$

*After convergence, we set <sup>β</sup>*GAM <sup>=</sup> *<sup>β</sup>***(***t***+1)***.*

3.4.3. Smoothing Parameter Selection, AIC and Stagewise Selection

The smoothing parameters *λ<sup>k</sup>* can be selected such that they minimize a suitable model selection criterion, for the sake of consistency, preferably the one used in the adaptive algorithm for basis function selection. The GAM estimator (28) does not exactly maximize the log-likelihood, therefore AIC has another form for GAMs than for GLMs. Hastie and Tibshirani (1990) propose a widely used version of AIC for GAMs, which uses effective degrees of freedom df in place of the number of coefficients (*K* − 1)*J* + 1. This is

$$\text{AIC} = -2l \left( \hat{\mathcal{B}}\_{\text{GAN}}, \hat{\Phi} \right) + 2 \left( \text{df} + p \right), \tag{30}$$

where

$$\mathbf{df} = \text{tr}\left( (I + \mathcal{S})^{-1} I \right). \tag{31}$$

Note that *I* + *S* = *ZTWZ* + *S* is already approximately calculated in the PIRLS algorithm. For GAMs, an estimate of the dispersion *φ* is obtained similarly to GLMs by (23). The parameter *p* is defined as in (22).

Another popular and effective smoothing parameter selection criterion invented by Craven and Wahba (1979) is generalized cross-validation (GCV), that is,

$$\text{GCV} = \frac{\text{ND}\left(\hat{\mathfrak{K}}\_{\text{CAM}}\right)}{\left(N - \text{df}\right)^2},\tag{32}$$

with the model deviance *D <sup>β</sup>*GAM evaluated at the GAM estimator and the effective degrees of freedom defined just like for AIC.

Note that the adaptive forward stepwise algorithm depicted in Figure 2 can become computationally infeasible with GAMs as opposed to, for example, GLMs. In iteration *k*, a GAM has (*K* − 1)*J* + 1 coefficients which need to be estimated while a GLM has only *K* coefficients. This difference in the estimation effort is increased further due to the iterative nature of the IRLS and PIRLS algorithms. Moreover, GAMs involve the task of optimal smoothing parameter selection. To deal with this aspect, Wood (2000), Wood et al. (2015) and Wood et al. (2017) have developed practical GAM fitting methods for large data sets. However, the suitable application of these methods in the adaptive algorithm is beyond the scope of our analysis, in particular as our focus is not on computational performance. Besides parallelizing the candidate loop on the lower left side of Figure 2, we achieve the necessary performance gains in GAMs by replacing the stepwise algorithm by a stagewise algorithm. This means that in each iteration, a predefined number *L* or proportion of candidate basis functions is selected simultaneously until a termination criterion is fulfilled. Thereby we select in one stage those basis functions which reduce the model selection criterion of our choice most when added separately

to the current proxy function structure. When there are not at least as many basis functions as targeted, the algorithm shall be terminated after the ones which lead to a reduction in the model selection criterion have been selected.

#### *3.5. Feasible Generalized Least-Squares (FGLS) Regression*

#### 3.5.1. The Regression Model

The regression model here equals the OLS case. However, we now let the errors have the covariance matrix Σ = *σ*2Ω where Ω is positive definite and known and *σ*<sup>2</sup> > 0 is unknown. We transform the generalized regression model according to Hayashi (2000) to obtain a model (\*) which satisfies Assumptions (A1), (A2) and (A3) of the classical linear regression model. For this, choose an invertible matrix *H* with Ω−<sup>1</sup> = *HTH* which can, for example, be the Cholesky matrix. Then, the generalized response vector **y**∗, design matrix *Z*∗ and error vector ∗ are given by

$$\mathbf{y}^\* = H\mathbf{y}, \quad Z^\* = HZ, \quad \mathbf{e}^\* = \mathbf{y}^\* - Z^\*\mathbf{f} = H\left(\mathbf{y} - Z\mathbf{f}\right) = H\boldsymbol{\varepsilon}.\tag{33}$$

In analogy to the OLS estimator, the generalized least-squares (GLS) estimator *β*GLS of the coefficients is given as the minimizer of the generalized residual sum of squares, that is,

$$\hat{\mathcal{B}}\_{\text{GLS}} = \underset{\mathcal{B} \in \mathbb{R}^{K}}{\text{arg min}} \left\{ \sum\_{i=1}^{N} \left( \epsilon^{\*,i} \right)^{2} \right\}. \tag{34}$$

The closed-form expression of the GLS estimator is

$$\hat{\mathbf{g}}\_{\rm GLS} = \left(Z^{\*T}Z^{\*}\right)^{-1}Z^{\*T}\mathbf{y}^{\*} = \left(Z^{T}\Omega^{-1}Z\right)^{-1}Z^{T}\Omega^{-1}\mathbf{y},\tag{35}$$

and the proxy function becomes

$$\dot{f}\,(X) = \mathbf{z}^T \dot{\mathcal{H}}\_{\text{GLS}\_Y} \tag{36}$$

where **<sup>z</sup>** <sup>=</sup> (*e*<sup>0</sup> (*X*),...,*eK*−<sup>1</sup> (*X*))*T*. The scalar *<sup>σ</sup>*<sup>2</sup> can be estimated in analogy to OLS regression by *s*GLS = <sup>1</sup> *<sup>N</sup>*−*<sup>K</sup>* ∗*T*<sup>∗</sup> where <sup>∗</sup> <sup>=</sup> **<sup>y</sup>**<sup>∗</sup> <sup>−</sup> *<sup>Z</sup>*∗*β*GLS is the residual vector.

#### 3.5.2. Gauss-Markov-Aitken Theorem and ML Estimation

Under the assumptions (A1), (A3), and a covariance matrix Σ = *σ*2Ω of which Ω is positive definite and known (A6), we have:


As a consequence, given a known matrix Ω, we have a closed form solution for the GLS estimator that coincides with the ML estimator of the regression coefficients and the adaptive algorithm inside the LSMC approach goes through.

#### 3.5.3. Unknown Ω and FGLS Estimation via ML Algorithm

In the LSMC framework, Ω is unknown. However, if a consistent estimator Ω exists, we can apply feasible generalized least-squares (FGLS) regression, of which the estimator

$$
\hat{\boldsymbol{\beta}}\_{\text{FGLS}} = \left(\boldsymbol{Z}^{T}\hat{\boldsymbol{\Omega}}^{-1}\boldsymbol{Z}\right)^{-1}\boldsymbol{Z}^{T}\hat{\boldsymbol{\Omega}}^{-1}\mathbf{y} \tag{37}
$$

has asymptotically the same properties as the GLS estimator (35).

With **<sup>z</sup>** <sup>=</sup> (*e*<sup>0</sup> (*X*),...,*eK*−<sup>1</sup> (*X*))*<sup>T</sup>* the FGLS proxy function is then given as

$$
\widehat{f}'(X) = \mathbf{z}^T \widehat{\boldsymbol{\beta}}\_{\text{FGLS}}.\tag{38}
$$

For the estimation of Ω we will in the following set *σ*<sup>2</sup> = 1 which can be done without loss of generality and consider Σ = Ω. Furthermore, we assume in addition to (A1), (A3) and (A7) that the elements of the covariance matrix Σ are twice differentiable functions of parameters *α* = (*α*0,..., *αM*−1) *<sup>T</sup>* with *<sup>K</sup>* <sup>+</sup> *<sup>M</sup>* <sup>≤</sup> *<sup>N</sup>*. We then write <sup>Σ</sup> <sup>=</sup> <sup>Σ</sup> (*α*) (A8). The following result is the basis of the iterative ML algorithm for the regression coefficients and the variance matrix.

**Theorem 1.** *The generalized regression model (7) under Assumptions (A1), (A3), (A7) and (A8) has the following first-order ML conditions:*

$$\hat{\mathbf{j}}\_{\text{ML}} = \left(\mathbf{Z}^{T}\hat{\boldsymbol{\Sigma}}^{-1}\mathbf{Z}\right)^{-1}\mathbf{Z}^{T}\hat{\boldsymbol{\Sigma}}^{-1}\mathbf{y}\_{\text{s}}\tag{39}$$

$$\frac{\partial l}{\partial \alpha\_m} = \frac{1}{2} \text{tr} \left( \frac{\partial \Sigma^{-1}}{\partial \alpha\_m} \Sigma \right)\_{\mathbf{a} = \hat{\mathbf{a}}\_{\text{ML}}} - \frac{1}{2} \hat{\boldsymbol{\varepsilon}}^T \left( \frac{\partial \Sigma^{-1}}{\partial \alpha\_m} \right)\_{\mathbf{a} = \hat{\mathbf{a}}\_{\text{ML}}} \hat{\boldsymbol{\varepsilon}} = 0,\tag{40}$$

*where m* <sup>=</sup> 0, . . . , *<sup>M</sup>* <sup>−</sup> <sup>1</sup>*,* <sup>Σ</sup> <sup>=</sup> <sup>Σ</sup> (*α*ML) *and* <sup>=</sup> **<sup>y</sup>** <sup>−</sup> *<sup>Z</sup>β*ML*.*

The system in (39) and (40) is then solved iteratively (see, for example, Magnus (1978)). We start the procedure with *β***(0)** and then use PORT optimization routines as described in Gay (1990) and implemented in function *nlminb(*·*)* belonging to R package *stats* of R Core Team (2018). In this iterative routine, *α***(***t***+1)** can be initialized, for example, by random numbers from the standard normal distribution.

**ML algorithm.** *Perform the following iterative approximation procedure with, for example, an initialization of <sup>β</sup>***(0)** <sup>=</sup> *<sup>β</sup>*OLS *until convergence:*


$$\begin{aligned} \widehat{\Sigma}^{(t+1)} &= \Sigma \left( \widehat{\mathbf{a}}^{\{t+1\}} \right), \\ \widehat{\boldsymbol{\beta}}^{\{t+1\}} &= \left( \boldsymbol{Z}^{\mathrm{T}} \left( \widehat{\Sigma}^{\{t+1\}} \right)^{-1} \boldsymbol{Z} \right)^{-1} \boldsymbol{Z}^{\mathrm{T}} \left( \widehat{\Sigma}^{\{t+1\}} \right)^{-1} \mathbf{y}. \end{aligned} \tag{41}$$

*Continue with the next iteration.*

*After convergence, we set <sup>β</sup>*ML <sup>=</sup> *<sup>β</sup>***(***t***+1)** *and α*ML <sup>=</sup> *α***(***t***+1)***.*

Theorem 5 of Magnus (1978) states that under some further regularity conditions the FGLS coefficient estimator can be derived as the ML coefficient estimator by the ML algorithm under Assumptions (A1), (A3), (A7) and (A8).

#### 3.5.4. Heteroscedasticity, Variance Model Selection and AIC

Besides Assumption (A8) about the structure of the covariance matrix, we assume that the errors are uncorrelated with possibly different variances (= heteroscedastic errors), that is, Σ = diag *σ*2 <sup>1</sup> ,..., *<sup>σ</sup>*<sup>2</sup> *N* . We model each variance *σ*<sup>2</sup> *<sup>i</sup>* , *i* = 1, ... , *N*, by a twice differentiable function in dependence of parameters *α* = (*α*0,..., *αM*−1) *<sup>T</sup>* and a suitable set of linearly independent basis functions *em* (*X*) <sup>∈</sup> *<sup>L</sup>*<sup>2</sup> R*D*, B, P , *<sup>m</sup>* <sup>=</sup> 0, 1, . . . , *<sup>M</sup>* <sup>−</sup> 1, with **<sup>v</sup>***<sup>i</sup>* <sup>=</sup> *e*0 *xi* ,...,*eM*−<sup>1</sup> *xi <sup>T</sup>* , that is,

$$
\sigma\_i^2 = \sigma^2 V \left[ \mathbf{u}, \mathbf{v}^i \right],
\tag{42}
$$

where *V* , *α*, **v***<sup>i</sup>* is referred to as the variance function in analogy to *V* [*μ*] for GLMs and GAMs. Without loss of generality, we set again *σ*<sup>2</sup> = 1.

Hartmann (2015) has already applied FGLS regression with different variance models in the LSMC framework. In her numerical examples, variance models with multiplicative heteroscedasticity led to the best performance of the proxy function in the validation. Therefore, we restrict our analyis on these kinds of structures, compare, for example, Harvey (1976), that is,

$$V\left[\mathfrak{a},\mathbf{v}^{i}\right] = \exp\left(\mathbf{v}^{iT}\mathfrak{a}\right). \tag{43}$$

Like the proxy function, the variance function (43) has to be calibrated to apply FGLS regression, which means that the variance function has to be composed of suitable basis functions. Again, such a composition can be found with the aid of a model selection criterion. We still choose AIC, but have to take care for the fact that in FGLS regression the covariance matrix now contains *M* unknown parameters instead of only one in the OLS case (the same variance for all observations). Under Assumption (A7), AIC is given as

$$\begin{split} \text{AIC} &= -2l \left( \hat{\mathcal{B}}\_{\text{FGLS}} \hat{\Sigma} \right) + 2 \left( K + M \right) \\ &= N \log \left( 2\pi \right) + \log \left( \det \hat{\Sigma} \right) + \left( \mathbf{y} - Z \hat{\mathcal{B}}\_{\text{FGLS}} \right)^{T} \hat{\Sigma}^{-1} \left( \mathbf{y} - Z \hat{\mathcal{B}}\_{\text{FGLS}} \right) + 2 \left( K + M \right) . \end{split} \tag{44}$$

When using a variance model with multiplicative heteroscedasticity, AIC becomes

$$\text{AIC} = N \log \left( 2\pi \right) + \left( \sum\_{i=1}^{N} \mathbf{v}^{i T} \right) \hat{\mathbf{a}} + \sum\_{i=1}^{N} \exp \left( -\mathbf{v}^{i T} \hat{\mathbf{a}} \right) \left( \hat{\boldsymbol{\varepsilon}}^{i} \right)^{2} + 2 \left( K + M \right) . \tag{45}$$

As an alternative or complement, the basis functions of the variance model can be selected with respect to their correlations with the final OLS residuals or based on graphical residual analysis.

For the final implementation of a variance model we use modified versions of two algorithms from Hartmann (2015). Our type I variant starts with the derivation of the proxy function by the standard adaptive OLS regression approach and then selects the variance model adaptively from the set of proxy basis functions of which the exponents sum up to at most two. The type II variant builds on the type I algorithm by taking the resulting variance model as given in its adaptive proxy basis function selection procedure with FGLS regression in each iteration.

Note further, that we should only apply FGLS regression as a substitute of OLS regression if heteroscedasticity prevails. This can be tested with the help of the Breusch-Pagan test of Breusch and Pagan (1979) for the following special structure of the variance function

$$V\left[\mathbf{a}, \mathbf{v}^{i}\right] = h\left(\mathbf{v}^{i,T}\mathbf{a}\right),\tag{46}$$

where the function *<sup>h</sup>*(·) is twice differentiable and the first element of **<sup>v</sup>***<sup>i</sup>* is <sup>v</sup>*<sup>i</sup>* <sup>0</sup> = 1. Further, the assumption of normally distributed errors is made. We use it in the numerical computations to check if heteroscedasticity still prevails during the iteration procedure.

#### *3.6. Multivariate Adaptive Regression Splines (MARS)*

#### 3.6.1. The Regression Model

The multivariate adaptive regression splines (MARS) were introduced by Friedman (1991). The classical MARS model is a form of the classical linear regression model (7) where the basis functions *ek xi* are so-called hinge functions. Therefore, the theory of OLS regression applies in this context. GLMs (12) can also be applied in conjunction with MARS models. In this case we speak of generalized MARS models.

We describe the standard MARS algorithm in the LSMC routine according to Chapter 9.4 of Hastie et al. (2017). The building blocks of MARS proxy functions are reflected pairs of piecewise linear functions with knots *t* as depicted in Figure 5, that is,

$$(X\_d - t)\_+ = \max\left(X\_d - t, 0\right), \ (t - X\_d)\_+ = \max\left(t - X\_d, 0\right), \tag{47}$$

where the *Xd*, *d* = 1, ... , *D*, represent the risk factors that together form the outer scenario *X* = (*X*1,..., *XD*) *T*.

**Figure 5.** Reflected pair of piecewise linear functions with a knot at *t*.

For each risk factor, reflected pairs with knots at each fitting scenario stress *x<sup>i</sup> <sup>d</sup>*, *i* = 1, ... , *N*, are defined. All pairs are united in the following collection serving as the initial candidate basis function set of the MARS algorithm, that is,

$$\mathcal{C}\_1 = \left\{ (X\_d - t)\_{+}, (t - X\_d)\_{+} \right\}\_{\mathbf{t} \in \left\{ \mathbf{x}\_{d'}^1 \mathbf{x}\_{d'}^2, \dots, \mathbf{x}\_d^N \right\} \mid d = 1, \dots, D}. \tag{48}$$

We call the elements of *C*<sup>1</sup> hinge functions and consider them as functions *h* (*X*) over the entire input space R*D*. *C*<sup>1</sup> contains in total 2*DN* basis functions.

The adaptive basis function selection algorithm now consists of two parts, the forward and the backward pass.

#### 3.6.2. Adaptive Forward Stepwise Selection and Forward Pass

The forward pass of the MARS algorithm can be viewed as a variation of the adaptive forward stepwise algorithm depicted in Figure 2. The start proxy function consists only of the intercept, that is, *h*<sup>0</sup> (*X*) = 1. In the classical MARS model, the regression method of choice is the standard OLS regression approach with the estimator (8), where in each iteration a reflected pair of hinge functions is selected instead of *ek xi* . Similarly, the regression method of choice in the generalized MARS model is the IRLS algorithm (18). Let us denote the MARS coefficient estimator by *β*MARS. Note that the theory on AIC cannot be transferred without any adjustments since the notion of the degrees of freedom has to be reconsidered due to the knots in the hinge functions acting as additional degrees of freedom.

After each iteration, the set of candidate basis functions is extended by the products of the last two selected hinge functions with all hinge functions in *C*<sup>1</sup> that depend on risk factors of which the last two selected hinge functions do not depend on. Let the reflected pair selected in the first iteration (*k* = 1) be

$$h\_1\left(X\right) = \left(X\_{d\_1} - t\_1\right)\_{+\cdot},$$

$$h\_2\left(X\right) = \left(t\_1 - X\_{d\_1}\right)\_+.\tag{49}$$

Further, let *C*1,<sup>−</sup> = *C*<sup>1</sup> \ {*h*<sup>1</sup> (*X*), *h*<sup>2</sup> (*X*)}. Then, the set of candidate basis functions is updated at the beginning of the second iteration (*k* = 2) such that

$$\begin{split} \mathbf{C}\_{2} = \mathbf{C}\_{1,-} \cup \left\{ \left( \mathbf{X}\_{d} - \mathbf{t} \right)\_{+} h\_{1} \left( \mathbf{X} \right)\_{\prime} \left( \mathbf{t} - \mathbf{X}\_{d} \right)\_{+} h\_{1} \left( \mathbf{X} \right)\_{\prime} \right\}\_{\mathrm{t} \in \left\{ \mathbf{x}\_{d}^{1} \mathbf{x}\_{d}^{2}, \dots, \mathbf{x}\_{d}^{N} \right\} \mid d = 1, \ldots, D, d \neq d\_{1} \\ \cup \left\{ \left( \mathbf{X}\_{d} - \mathbf{t} \right)\_{+} h\_{2} \left( \mathbf{X} \right)\_{\prime} \left( \mathbf{t} - \mathbf{X}\_{d} \right)\_{+} h\_{2} \left( \mathbf{X} \right)\_{\prime} \right\}\_{\mathrm{t} \in \left\{ \mathbf{x}\_{d}^{1} \mathbf{x}\_{d}^{2}, \dots, \mathbf{x}\_{d}^{N} \right\} \mid d = 1, \ldots, D, d \neq d\_{1} \right\}. \end{split} \tag{50}$$

The second set *C*<sup>2</sup> thus contains 2 (*DN* − 1) + 4 (*D* − 1) *N* basis functions. Often, the order of interaction is limited to improve the interpretability of the proxy functions. Besides the maximum allowed number of terms, a minimum threshold for the decrease in the residual sum of squares can be employed as a termination criterion in the forward pass. Typically, the proxy functions generated in the forward pass overfit the data since model complexity is only penalized conservatively by stipulating a maximum number of basis functions and a minimum threshold.

#### 3.6.3. Backward Pass and GCV

Due to the overfitting tendency of the proxy function generated in the forward pass, a backward pass is executed afterwards. Apart from the direction and slight differences, the backward pass is similar to the forward pass. In each iteration, the hinge function of which the removal causes the smallest increase in the residual sum of squares is removed and the backward model selection criterion for the resulting proxy function is evaluated. By this backward procedure, we generate the "best" proxy functions of each size in terms of the residual sum of squares. Out of all these best proxy functions, we finally select the one which minimizes the backward model selection criterion. As a result, the final proxy function will not only contain reflected pairs of hinge functions but also single hinge functions of which the complements have been removed. Optionally, the backward pass can also be omitted.

Let the number of basis functions in the MARS model be *K* and the number of knots be *T*. The standard choice for the backward model selection criterion is GCV defined as

$$\text{GCV} = \frac{\text{ND}\left(\hat{\mathcal{B}}\_{\text{MARS}}\right)}{\left(N - \text{df}\right)^2},\tag{51}$$

with the effective degrees of freedom df = *K* + 3*T*.

An especially fast MARS algorithm was later developed by Friedman (1993) and is implemented, for example, in function *earth(*·*)* of R package *earth* provided by Milborrow (2018).

#### *3.7. Kernel Regression*

#### 3.7.1. The One-dimensional Regression Model

Kernel regression (which goes back to Nadaraya (1964) and Watson (1964)) is a type of locally weighted OLS regression where the weights vary with the input variable (*the target scenario*). We start

with locally constant (LC) regression where for each *x*<sup>0</sup> ∈ R the fixed *univariate kernel* with given *bandwidth λ* > 0 be

$$K\_{\lambda}\left(\mathbf{x}\_{0},\mathbf{x}^{i}\right) = D\left(\frac{\left|\mathbf{x}^{i} - \mathbf{x}\_{0}\right|}{\lambda}\right),\tag{52}$$

where *D* (·) denotes the specified kernel function. Solving the corresponding least squares problem

$$\hat{\mathcal{B}}\_{\text{LC}}\left(\mathbf{x}\_{0}\right) = \underset{\mathcal{B}\left(\mathbf{x}\_{0}\right) \in \mathbb{R}}{\arg\min} \left\{ \sum\_{i=1}^{N} \mathcal{K}\_{\lambda}\left(\mathbf{x}\_{0}, \mathbf{x}^{i}\right) \left(\mathbf{y}^{i} - \mathcal{J}\_{0}\left(\mathbf{x}\_{0}\right)\right)^{2} \right\},\tag{53}$$

one obtains the Nadaraya-Watson kernel smoother as the kernel-weighted average at each *x*<sup>0</sup> over the fitting values *y<sup>i</sup>* , that is,

$$
\widehat{f}\_{\text{LC}}\left(\mathbf{x}\_{0}\right) = \widehat{\beta}\_{\text{LC}}\left(\mathbf{x}\_{0}\right) = \frac{\sum\_{i=1}^{N} K\_{\lambda}\left(\mathbf{x}\_{0\prime}\mathbf{x}^{i}\right) y^{i}}{\sum\_{i=1}^{N} K\_{\lambda}\left(\mathbf{x}\_{0\prime}\mathbf{x}^{i}\right)}.\tag{54}
$$

Typical examples for the fixed kernel are the Epanechnikov (see the green shaded areas of Figure 6 inspired by Hastie et al. (2017)), tri-cube and uniform kernels or gaussian kernel. Note that a kernel smoother is continuous and varies over the domain of the target scenarios *x*0, it needs to be estimated separately at all of them.

**Figure 6.** Locally constant (LC) and LL kernel regression using the Epanechnikov kernel with *λ* = 0.2 in one dimension.

The bias at the boundaries of the domain of the LC kernel estimator (53) (see the left panel of Figure 6) is mainly eliminated by fitting locally linear functions instead of locally constant functions, see the right panel of Figure 6. At each target *x*0, the LL kernel estimator is defined as the minimizer of the kernel-weighted residual sum of squares, that is,

$$\hat{\boldsymbol{\beta}}\_{\text{LL}}\left(\mathbf{x}\_{0}\right) = \underset{\boldsymbol{\beta}\left(\mathbf{x}\_{0}\right) \in \mathbb{R}^{2}}{\arg\min} \left\{ \sum\_{i=1}^{N} K\_{\lambda}\left(\mathbf{x}\_{0}, \mathbf{x}^{i}\right) \left(\mathbf{y}^{i} - \boldsymbol{\beta}\_{0}\left(\mathbf{x}\_{0}\right) - \boldsymbol{\beta}\_{1}\left(\mathbf{x}\_{0}\right)\mathbf{x}^{i}\right)^{2} \right\},\tag{55}$$

with *β* (*x*0) = (*β*<sup>0</sup> (*x*0), *β*<sup>1</sup> (*x*0))*T*. The proxy function at *x*<sup>0</sup> is given by

$$
\hat{f}\_{\rm LL}(\mathbf{x}\_0) = \hat{\beta}\_{\rm LL,0}(\mathbf{x}\_0) + \hat{\beta}\_{\rm LL,1}(\mathbf{x}\_0) \,\mathbf{x}\_0. \tag{56}
$$

Again the minimization problem (55) must be solved separately for all target scenarios so that the coefficients of the proxy function vary across their domain. For each target scenario *x*<sup>0</sup> a weighted least-squares (WLS) problem with weights *K<sup>λ</sup> x*0, *x<sup>i</sup>* has to be solved. Its solution is the WLS estimator

$$
\hat{\boldsymbol{\beta}}\_{\rm LL}(\mathbf{x}\_0) = \left(\boldsymbol{Z}^T \boldsymbol{W}(\mathbf{x}\_0) \, \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^T \boldsymbol{W}\left(\mathbf{x}\_0\right) \mathbf{y}\_{\prime} \tag{57}
$$

with **y** the response vector, *W* (*x*0) = diag *K<sup>λ</sup> x*0, *x*<sup>1</sup> ,..., *K<sup>λ</sup> x*0, *xN* the weight matrix and *Z* the design matrix which contains row-wise the vectors 1, *x<sup>i</sup> T* . We call *<sup>H</sup>* the hat matrix if **<sup>y</sup>** <sup>=</sup> *<sup>H</sup>***<sup>y</sup>** such that **<sup>y</sup>** <sup>=</sup> *f* LL *x*1 ,..., *f* LL *x<sup>N</sup><sup>T</sup>* contains the proxy function values at their target scenarios.

When we use proxy functions in LL regression that are composed of polynomial basis functions with exponents greater than one, we could also speak of local polynomial regression.

#### 3.7.2. The Multidimensional Regression Model

We generalize LC regression to R*<sup>K</sup>* by expressing the kernel with respect to the basis function vector **<sup>z</sup>** <sup>=</sup> (*e*<sup>0</sup> (*X*),...,*eK*−<sup>1</sup> (*X*))*<sup>T</sup>* following from the adaptive forward stepwise selection with OLS regression and small *<sup>K</sup>*max. At each target scenario vector **<sup>z</sup>**<sup>0</sup> ∈ R*<sup>K</sup>* with elements *<sup>z</sup>*0*k*, basis function vector **<sup>z</sup>***<sup>i</sup>* ∈ R*<sup>K</sup>* with elements *zik* evaluated at fitting scenario *<sup>x</sup><sup>i</sup>* and given bandwidth vector *<sup>λ</sup>* = (*λ*0,..., *λK*−1) *<sup>T</sup>*, the multivariate kernel is defined as the product of univariate kernels, that is,

$$\mathbb{K}\_{\mathbf{A}}\left(\mathbf{z}\_{0},\mathbf{z}^{j}\right) = \prod\_{k=0}^{K-1} D\left(\frac{|z\_{ik} - z\_{0k}|}{\lambda\_{k}}\right). \tag{58}$$

The LC kernel estimator in R*<sup>K</sup>* is defined at each **z**<sup>0</sup> as

$$
\hat{f}\_{\text{LC}}\left(\mathbf{z}\_{0}\right) = \hat{\boldsymbol{\beta}}\_{\text{LC}}\left(\mathbf{z}\_{0}\right) = \frac{\sum\_{i=1}^{N} K\_{\lambda}\left(\mathbf{z}\_{0\prime}\mathbf{z}^{i}\right) y^{i}}{\sum\_{i=1}^{N} K\_{\lambda}\left(\mathbf{z}\_{0\prime}\mathbf{z}^{i}\right)}.\tag{59}
$$

Since we let *e*<sup>0</sup> (*X*) represent the intercept so that *zi*<sup>0</sup> = *z*<sup>00</sup> = 1, the corresponding univariate kernel *D* |*zi*0−*z*00| *λ*0 = *D* (0) is constant over all fitting points, thus cancels in (59) and can be omitted in (58).

The LL kernel estimator in R*<sup>K</sup>* is given as the multidimensional analogue of (55) at each **z**0, that is,

$$\hat{\boldsymbol{\beta}}\_{\text{LL}}\left(\mathbf{z}\_{0}\right) = \underset{\boldsymbol{\beta}(\mathbf{z}\_{0}) \in \mathbb{R}^{K}}{\arg\min} \left\{ \sum\_{i=1}^{N} \mathcal{K}\_{\lambda}\left(\mathbf{z}\_{0}, \mathbf{z}^{i}\right) \left(\mathbf{y}^{i} - \mathbf{z}^{i,T}\boldsymbol{\beta}\left(\mathbf{z}\_{0}\right)\right)^{2} \right\},\tag{60}$$

with *<sup>β</sup>* (**z**0) <sup>=</sup> (*β*<sup>0</sup> (**z**0),..., *<sup>β</sup>K*−<sup>1</sup> (**z**0))*<sup>T</sup>* and the proxy function at **<sup>z</sup>**<sup>0</sup> is given by

$$
\hat{f}\_{\rm LL} \left( \mathbf{z}\_0 \right) = \mathbf{z}\_0^T \hat{\boldsymbol{\beta}}\_{\rm LL} \left( \mathbf{z}\_0 \right) \,. \tag{61}
$$

The LL kernel estimator can again be computed by WLS regression, that is,

$$
\hat{\boldsymbol{\beta}}\_{\rm LL} \left( \mathbf{z}\_0 \right) = \left( \boldsymbol{Z}^T \boldsymbol{W} \left( \mathbf{z}\_0 \right) \boldsymbol{Z} \right)^{-1} \boldsymbol{Z}^T \boldsymbol{W} \left( \mathbf{z}\_0 \right) \mathbf{y}\_{\prime} \tag{62}
$$

where *W* (**z**0) = diag *K<sup>λ</sup>* **z**0, **z**<sup>1</sup> ,..., *K<sup>λ</sup>* **z**0, **z***N* is the weight matrix and *Z* the design matrix containing row-wise the vectors **<sup>z</sup>***i*,*T*. The hat matrix *<sup>H</sup>* satisfies **<sup>y</sup>** <sup>=</sup> *<sup>H</sup>***<sup>y</sup>** with **<sup>y</sup>** <sup>=</sup> *f* LL **z**1 ,..., *f* LL **z***N<sup>T</sup>* containing the proxy function values at their target scenario vectors.

#### 3.7.3. Bandwidth Selection, AIC and LOO-CV

The bandwidths *λ<sup>k</sup>* in kernel regression can be selected similarly to the smoothing parameters in GAMs by minimization of a suitable model selection criterion. In fact, kernel smoothers can be interpreted as local non-parametric GLMs with identity link functions. More precisely, at each target scenario the kernel smoother can be viewed as a GLM (12) where the parametric weights *V* , *μi* GLMin (20) are the non-parametric kernel weights *K<sup>λ</sup>* **z**0, **z***<sup>i</sup>* in (60). Since GLMs are special cases of GAMs and the bandwidths in kernel regression can be understood as smoothing parameters, kernel smoothers and GAMs are sometimes lumped together in one category. If the numbers *N* of the fitting points and *K* of the basis functions are large, from a computational perspective it might be beneficial to perform bandwidth selection based on a reduced set of fitting points.

Hurvich et al. (1998) propose to select the bandwidths *λ*1, ... , *λK*−<sup>1</sup> based on an improved version of AIC which works in the context of non-parametric proxy functions that can be written as linear combinations of the observations. It has the form

$$\text{AIC} = \log\left(\hat{\upsilon}^2\right) + \frac{1 + \text{tr}\left(H\right)/N}{1 - \left(\text{tr}\left(H\right) + 2\right)/N} \tag{63}$$

where *σ*<sup>2</sup> <sup>=</sup> <sup>1</sup> *<sup>N</sup>* (**<sup>y</sup>** <sup>−</sup> **<sup>y</sup>**) *<sup>T</sup>* (**<sup>y</sup>** <sup>−</sup> **<sup>y</sup>**) and *<sup>H</sup>* is the hat matrix.

As an alternative, leave-one-out cross-validation (LOO-CV) is suggested by Li and Racine (2004) for bandwidth selection. Let us refer to

$$\hat{\boldsymbol{\beta}}\_{\text{LL},-j}(\mathbf{z}\_0) = \underset{\boldsymbol{\beta}(\mathbf{z}\_0) \in \mathbb{R}^K}{\text{arg min}} \left\{ \sum\_{i \neq j, i=1}^N \boldsymbol{K}\_{\lambda} \left( \mathbf{z}\_{0\prime} \mathbf{z}^i \right) \left( \mathbf{y}^i - \mathbf{z}^{i,T} \boldsymbol{\beta} \left( \mathbf{z}\_0 \right) \right)^2 \right\} \tag{64}$$

as the leave-one-out LL kernel estimator and to *f* LL,−*<sup>j</sup>* (**z**0) <sup>=</sup> **<sup>z</sup>***<sup>T</sup>* <sup>0</sup> *β*LL,−*<sup>j</sup>* (**z**0) as the leave-one-out proxy function at **z**0. The objective of LOO-CV is to choose the bandwidths *λ*1,..., *λK*−<sup>1</sup> which minimize

$$\text{CV} = \frac{1}{N} \sum\_{i=1}^{N} \left( y^i - \hat{f}\_{\text{LL}, -i}(\mathbf{z}\_0) \right)^2. \tag{65}$$

#### 3.7.4. Adaptive Forward Stepwise OLS Selection

A practical implementation of kernel regression can be found, for example, via the combination of functions *npreg(*·*)* and *npregbw(*·*)* from R package *np* of Racine and Hayfield (2018).

In the other sections, basis function selection depends on the respective regression methods. Since the crucial process of bandwidth selection in kernel regression takes a very long time in the implementation of our choice, it would be infeasible to proceed here in the same way. Therefore, we derive the basis functions for LC and LL regression by adaptive forward stepwise selection based on OLS regression, by risk factor wise linear selection or a combination thereof. Thereby, we keep the maximum allowed number *K*max of terms rather small as we aim to model the subtleties by kernel regression.

#### **4. Numerical Experiments**

#### *4.1. General Remarks*

#### 4.1.1. Data Basis

In our slightly disguised real-world example, the life insurance company has a portfolio with a large proportion of traditional German annuity business. This choice was made in order to challenge the regression techniques since German traditional annuity business features high interest rate guarantees which may lead to large losses in low interest rate environments. We let the insurance company be exposed to *D* = 15 relevant financial and actuarial risk factors. For the derivation of the fitting points, we run its CFP model conditional on *N* = 25, 000 fitting scenarios with each of these outer scenarios entailing two antithetic inner simulations. For a subset of the resulting fitting values of the best estimate of liabilities (BEL), see Figure 1, for summary statistics, the left column of Table 1, and for a histogram, the left panel of Figure 7.


**Table 1.** Summary statistics of fitting and nested simulation values of best estimate of liabilities (BEL).

The Sobol validation set is generated based on *L* = 51 validation scenarios with 1000 inner simulations, comprising 26 Sobol scenarios, 15 one-dimensional risk scenarios, 1 base scenario and 9 scenarios that turned out to be capital region scenarios in the previous year risk capital calculations. The nested simulations set which is due to its high computational costs not available in the regular LSMC approach reflects the highest 5% real-world losses and is based on *L* = 1638 outer scenarios with respectively 4000 inner simulations. From the 1638 real-world scenarios, 14 exhibit extreme stresses far beyond the bounds of the fitting space and are therefore excluded from the analysis. For the remaining nested simulation values of BEL, see Figure 3, for summary statistics, the right column of Table 1, and for a histogram, the right panel of Figure 7. The capital region set consists of the *L* = 129 nested simulations points which correspond to the nested simulations SCR estimate (= 99.5% highest loss) and the 64 losses above and below (= 99.3% to 99.7% highest losses).

**Figure 7.** Histograms of fitting and nested simulation values of BEL.

#### 4.1.2. Validation Figures

We will output validation figure (1) with respect to the relative and asset metric, and additionally figures (2)–(4). While figures (3) and (4) are evaluated with respect to a base value resulting from 1000 inner simulations on the Sobol set, that is, v.mae0, v.res0, they are computed with respect to a base value resulting from 16, 000 inner simulations on the nested simulations set, that is, ns.mae0, ns.res0, and capital region set, that is, cr.mae0, cr.res0. The latter base value is supposed to be the more reliable

validation value since it is the one associated with a lower standard error. Therefore it is worth noting here that figure v.res<sup>0</sup> can easily be transformed such that it is also evaluated with respect to the latter base value by subtracting from it the difference of 14 which the two different base values incur. We will not explicitly state the base residual (5) as it is just (2) minus (4).

#### 4.1.3. Economic Variables

We derive the OLS proxy functions for two economic variables, namely for the best estimate of liabilities (BEL) and the available capital (AC) over a one-year risk horizon, that is, *Y*(*X*) ∈ {BEL(*X*), AC(*X*)}. Their approximation quality is assessed by validation figures (1) with respect to the relative and asset metric and (2). Essentially, AC is obtained as the market value of assets minus BEL, which means that AC reflects the negative behavior of BEL. Therefore, we will only derive BEL proxy functions with the other regression methods. The profit resulting from a certain risk constellation captured by an outer scenario *X* can be computed as AC(*X*) minus the base AC. Validation figures (3) and (4) address the approximation quality of this difference. Taking the negative of the profit yields the loss and evaluating the loss at all real-world scenarios the real-world loss distribution from which the SCR is derived as the 99.5% value-at-risk. The out-of-sample performances of two different OLS proxy functions of BEL on the Sobol, nested simulations and capital region sets serve as the benchmark for the other regression methods.

#### 4.1.4. Numerical Stability

Let us discuss the subject of numerical stability of QR decompositions in the OLS regression design under a monomial basis. If the weighting in the weighted least-squares problems associated with GLMs, heteroscedastic FGLS regression and kernel regression is good-natured, similar arguments apply as they can also be solved via QR decompositions according to Green (1984) where the weighting is just a scaling. However, the weighting itself raises additional numerical questions that need to be taken into consideration when making the regression design choices. In GLMs, these choices are the random component (13) and link function (12), in FGLS regression it is the functional form of the heteroscedatic variance model (42) and in kernel regression it is the kernel function (58). The following arguments do not apply to GAMs and MARS models as these are constructed out of spline functions, see (25) and (47), respectively. In GAMs, the penalty matrix increases numerical stability.

McLean (2014) justifies that from the perspective of numerical stability performing a QR decomposition on a monomial design matrix *Z* is asymptotically equivalent to using a Legendre design matrix *Z* and transforming the resulting coefficient estimator into the monomial one. Under the assumption of an orthonormal basis, Weiß and Nikoli´c (2019) have derived an explicit upper bound for the condition number of non-diagonal matrix <sup>1</sup> *<sup>N</sup>* (*Z* )*T*(*Z* ) for *N* < ∞, where the factor <sup>1</sup> *<sup>N</sup>* is used for technical reasons. This upper bound increases in (1) the number of basis functions, (2) the Hardy-Krause variation of the basis, (3) the convergence constant of the low-discrepancy sequence, and (4) the outer scenario dimension. Our previously defined type of restriction setting controls aspect (1) through the specification of *K*max and aspect (2) through the limitation of exponents *d*1*d*2*d*3. Aspects (3) and (4) are beyond the scope of the calibration and validation steps of the LSMC framework and therefore left aside here.

#### 4.1.5. Interpolation and Extrapolation

In the LSMC framework, let us refer by interpolation to prediction inside the fitting space and by extrapolation to prediction outside the fitting space. Runge (1901) found that high-degree polynomial interpolation at equidistant points can oscillate toward the ends of the interval with the approximation error getting worse the higher the degree is. In a least-squares problem, Runge's phenomenon was shown by Dahlquist and Björck (1974) not to apply to polynomials of degree *d* fitted based on *N* equidistant points if the inequality *d* < 2 <sup>√</sup>*<sup>N</sup>* holds. With *<sup>N</sup>* = 25,000 fitting points the inequality becomes *d* < 316 so that we clearly do not have to impose any further restrictions in OLS, FGLS and

kernel regression as well as in GLMs to keep this phenomenon under control. Splines as they occur in GAMs and MARS models do not suffer from this oscillation issue by construction.

Since Runge's phenomenon concerns the ends of the interval and the real-world scenarios for the insurer's full loss distribution forecast in the fourth step of the LSMC framework partly go beyond the fitting space, its scope comprises the extrapolation area as well. High-degree polynomial extrapolation can worsen the approximation error and play a crucial role if many real-world scenarios go far beyond the fitting space.

#### 4.1.6. Principle of Parsimony

Another problem that can occur in an adaptive algorithm is overfitting. Burnham and Anderson (2002) state that overfitted models often have needlessly large sampling variances which means that their precision of the predictions is poorer than that of more parsimonious models which are also free of bias. In cases where AIC leads to overfitting, implementing restriction settings of the form *K*max *d*1*d*2*d*<sup>3</sup> becomes relevant for adhering to the principle of parsimony.

#### *4.2. Ordinary Least-Squares (OLS) Regression*

#### 4.2.1. Settings

We build the OLS proxy functions (10) of *Y*(*X*) ∈ {BEL(*X*), AC(*X*)} with respect to an outer scenario *X* out of monomial basis functions that can be written as *ek* (*X*) = ∏<sup>15</sup> *<sup>l</sup>*=<sup>1</sup> *<sup>X</sup>r<sup>l</sup> k <sup>l</sup>* with *<sup>r</sup><sup>l</sup> <sup>k</sup>* ∈ N<sup>0</sup> so that each basis function can be represented by a 15-tuple *r*1 *<sup>k</sup>* ,...,*r*<sup>15</sup> *k* . The final proxy function depends on the restrictions applied in the adaptive algorithm. The purpose of setting restrictions is to guarantee numerical stability, to keep the extrapolation behavior under control and the proxy functions parsimonious. In order to illustrate the impact of restrictions, we run the adaptive algorithm for BEL under two different restriction settings with the second one being so relaxed that it will not take effect in our example. Additionally, we run the adaptive algorithm under the first restriction setting for AC to give an example of how the behavior of BEL can transfer to AC. As the first ingredient of our restriction setting acts the maximum allowed number of terms *K*max. Furthermore, we limit the exponents in the monomial basis. Firstly we apply a uniform threshold to all exponents, that is, *rl <sup>k</sup>* <sup>≤</sup> *<sup>d</sup>*1. Secondly we restrict the degree, that is, <sup>∑</sup><sup>15</sup> *<sup>l</sup>*=<sup>1</sup> *<sup>r</sup><sup>l</sup> <sup>k</sup>* ≤ *d*2. Thirdly we restrict the exponents in interaction basis functions, that is, if there are some *l*<sup>1</sup> = *l*<sup>2</sup> with *r l*1 *<sup>k</sup>* , *r l*2 *<sup>k</sup>* > 0, we require *r l*1 *<sup>k</sup>* , *r l*2 *<sup>k</sup>* ≤ *d*3. Let us denote this type of restriction setting by *K*max - *d*1*d*2*d*3.

As the first and second restriction settings, we choose 150–443 and 300–886, respectively, motivated by Teuguia et al. (2014) who found in their LSMC example in Chapter 4 with four risk factors and 50,000 fitting scenarios entailing two inner simulations that the validation error computed based on 14 validation scenarios started to stabilize at degree 4 when using monomial or Legendre basis functions in different adaptive basis function selection procedures. Furthermore, they pointed out that the LSMC approach becomes infeasible for degrees higher than 12.

We apply R function *lm(*·*)* implemented in R package *stats* of R Core Team (2018).

#### 4.2.2. Results

Table A1 contains the final BEL proxy function derived under the first restriction setting 150–443 with the basis function representations and coefficients. Thereby reflect the rows the iterations of the adaptive algorithm and depict thus the sequence in which the basis functions are selected. Moreover, the iteration-wise AIC scores and out-of-sample MAEs (1) with respect to the relative metric in % on the Sobol, nested simulations and capital region sets are reported, that is, v.mae, ns.mae and cr.mae. Table A2 contains the AC counterpart of the BEL proxy function derived under 150–443 and Table A3 the final BEL proxy function derived under the more relaxed restriction setting 300–886. Tables A4 and A5 indicate respectively for the BEL and AC proxy functions derived under 150–443 the AIC scores and all five previously defined validation figures evaluated on the Sobol, nested simulations

and capital region sets after each tenth iteration. Similarly, Table A6 reports these figures for the BEL proxy function derived under 300-886. Here the last row corresponds to the final iteration.

Lastly, we manipulate the validation values on all three validation sets twice insofar as we subtract respectively add pointwise 1.96 times the standard errors from respectively to them (inspired by 95% confidence interval of gaussian distribution). We then evaluate the validation figures for the final BEL proxy functions under both restriction settings on these manipulated sets of validation value estimates and depict them in Table A7 in order to assess the impact of the Monte Carlo error associated with the validation values.

#### 4.2.3. Improvement by Relaxation

Tables A1 and A2 state that the adaptive algorithm terminates under 150–443 for both BEL and AC when the maximum allowed number of terms is reached. This gives reason to relax the restriction setting to, for example, 300–886 which eventually lets the algorithm terminate due to no further reduction in the AIC score without hitting restrictions 886, compare Table A3 for BEL. In fact, only restrictions 224–464 are hit. Except for the already very small figures cr.mae, cr.mae*<sup>a</sup>* and cr.res all validation figures are further improved by the additional basis functions, see Tables A4 and A6. The largest improvement takes place between iterations 180 and 190. The result that at maximum degrees 464 are selected is consistent with the result of Teuguia et al. (2014) who conclude in their numerical examples of Chapter 4 that under a monomial, Legendre or Laguerre basis the optimum degree is probably 4 or 5. Furthermore, Bauer and Ha (2015) derive a similar result in their one risk factor LSMC example of Chapter 6 when using 50, 000 fitting scenarios and Legendre, Hermite, Chebychev basis functions or eigenfunctions.

According to our Monte Carlo error impact assessment in Table A7, the slight deterioration at the end of the algorithm is not sufficient to indicate a slight overfitting tendency of AIC. Under the standard choices of the five major components, compare Section 2.2, the adaptive algorithm manages thus to provide a numerically stable and parsimonious proxy function even without a restriction setting. Here, allowing a priori unlimited degrees of freedom is thus beneficial to capturing the complex interactions in the CFP model.

#### 4.2.4. Reduction of Bias

Overall, the systematic deviations indicated by the means of residuals (2) and (4) are reduced significantly on the three validation sets by the relaxation but not completely eliminated. For the 300–886 OLS residuals on the three sets, see the diamond-shaped residuals in Figures 8–10, respectively. While the reduction of the bias comes along with the general improvement stated above, the remainder of the bias indicates that sample size is not sufficiently large or that the functional form is not flexible enough to replicate the complex interactions in CFP models. Note that if the functional form is correctly specified, Proposition 3.2 of Bauer and Ha (2015) states that if sample size is not sufficiently large, the AC proxy function will on average be positively biased in the tail reflecting the high losses and the BEL proxy function will thus be negatively biased there. Since Propositions 1 and 2 of Gordy and Juneja (2010) state that this result holds for the nested simulations estimators as well, the validation values of the nested simulations and capital region sets need to be more accurate in order to serve for bias detection in this case. For an illustration of such as bias, see Figures 5 and 6 of Bauer and Ha (2015). The bias in our one sample example is in the opposite systematic direction, which is an indication of insufficiency of polynomials. This is also consistent with the observations in the industry that the polynomials seem not to able to replicate the sudden changes in steepness of AC and BEL which are a consequence of regulation and complex management actions in the CFP models.

**Figure 8.** Residual plots on Sobol set.

**Figure 9.** Residual plots on nested simulations set.

Unlike figures (1) and (2), figures (3) and (4) do not forgive a bad fit of the base value if the validation values are well approximated by a proxy function. Contrariwise, if a proxy function shows the same systematic deviation from the validation values and the base value, (3) and (4) will be close to zero whereas (1) and (2) will be not. The comparisons <sup>|</sup>v.res<sup>|</sup> <sup>&</sup>lt; v.res0 , <sup>|</sup>cr.res<sup>|</sup> <sup>&</sup>lt; cr.res0 but <sup>|</sup>ns.res<sup>|</sup> <sup>&</sup>gt; ns.res<sup>0</sup> , holding under both restrictions settings, indicate that on the Sobol and capital region sets primarily the base value is not approximated well whereas on the nested simulations set not only the base value but also the validation values are missed. The MAEs capture this result, too, that is, v.mae, cr.mae < ns.mae but ns.mae<sup>0</sup> < v.mae0, cr.mae0.

**Figure 10.** Residual plots on capital region set.

#### 4.2.5. Relationship between BEL and AC

The MAEs with respect to the relative metric for BEL are much smaller than for AC since the two economic variables are subject to similar absolute fluctuations with, for example, in the base case BEL being approximately 20 times the size of AC. The similar absolute fluctuations are reflected by the iteration-wise very similar MAEs with respect to the asset metric of BEL and AC, compare v.mae*a*, ns.mae*<sup>a</sup>* and cr.mae*<sup>a</sup>* given in % in Tables A4 and A5. Furthermore, they manifest themselves in the iteration-wise opposing means of residuals v.res, v.res0, ns.res and cr.res as well as in the similar-sized MAEs v.mae0, ns.mae0 and cr.mae0.

#### *4.3. Generalized Linear Models (GLMs)*

#### 4.3.1. Settings

We derive the GLMs (12) of BEL under restriction settings 150–443 and 300–886 which we also employed for the derivation of the OLS proxy functions. Thereby, we run each restriction setting with the canonical choices of random components for continuous (non-negative) response variables, that is, the gaussian, gamma and inverse gaussian distributions, compare McCullagh and Nelder (1989). In cases where the economic variable can also attain negative values (for example, AC), a suitable shift of the response values in a preceding step would be required. We combine each of the three random component choices with the commonly used identity, inverse and log link functions, that is, *g* (*μ*) ∈ . id (*μ*), <sup>1</sup> *<sup>μ</sup>* , log (*μ*) / , compare Hastie and Pregibon (1992). In combination with the inverse gaussian random component, we consider additionally link function <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> . Further choices are conceivable but go beyond this first shot.

We take R function *glm(*·*)* implemented in R package *stats* of R Core Team (2018).

#### 4.3.2. Results

While Tables A8–A10 display the AIC scores and five previously defined validation figures after each tenth iteration for the just mentioned combinations under 150–443, Tables A11–A13 do so under 300-886 and include furthermore the final iterations. Table A14 gives an overview of the AIC scores and validation figures corresponding to all considered final GLMs and highlights in green and red respectively the best and worst values observed per figure.

#### 4.3.3. Improvement by Relaxation

The OLS regression is the special case of a GLM with gaussian random component and identity link function which is why the first sections of Tables A8 and A11 coincide respectively with Tables A4 and A6. The adaptive algorithm terminates under 150–443 not only for this combination but also for all other ones when the maximum allowed number of terms is reached. Under 300–886 termination occurs due to no further reduction in the AIC score without hitting the restrictions-the different GLMs stop between 208–454 and 250–574.

For all GLMs except for the one with gamma random component and identity link, the AIC scores and eight most significant validation figures for measuring the approximation quality, namely leftmost figure v.mae to rightmost figure ns.res in the tables, are improved through the relaxation as can be seen in Table A14. For gamma random component with identity link, the deteriorations are negligible. Overall, figures ns.mae<sup>0</sup> and cr.mae0 are deteriorated by at maximum 0.5% points and figures ns.res<sup>0</sup> and cr.res0 by at maximum 4 units. Figures cr.mae and cr.mae*<sup>a</sup>* are especially small under 150–443 so that slight deteriorations by at maximum 0.05% points under 300-886 towards the levels of v.mae and v.mae*<sup>a</sup>* or ns.mae and ns.mae*<sup>a</sup>* are not surprising. Similar arguments apply to the acceptability of the maximum deterioration of cr.res by 13 to 17 units for inverse gaussian with <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link. We conclude that the more relaxed restriction setting 300–886 performs better than 150–443 for all GLMs in our numerical example. This result appears plausible in comparison with the OLS result from the previous section and hence also compared to the OLS results of Teuguia et al. (2014) and Bauer and Ha (2015).

AIC cannot be said to show an overfitting tendency according to Tables A11–A13 and also Table A7 since the validation figures do not deteriorate in the late iterations more than they underly Monte Carlo fluctuations, compare the OLS interpretation. Using GLMs instead of OLS regression in the standard adaptive algorithm, compare Section 2.2, lets the algorithm thus maintain its property to yield numerically stable and parsimonious proxy functions even without restriction settings.

#### 4.3.4. Reduction of Bias

According to Table A14, inverse gaussian with <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link shows the most significant decrease in v.mae by −0.088% points when moving from 150–443 to 300–886. Under 300–886 this combination even outperforms all other ones (highlighted in green) whereas under 150–443 it is vice versa (highlighted in red). Hence, the performance of a random component link combination under 150–443 does not generalize to 300–886. On the Sobol and nested simulations sets, the MAEs (1) are not only considerably lower for inverse gaussian with <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link than for all others but also the closest together even when the capital region set is included. This speaks for a great deal of consistency.

In fact, the systematic overestimation of 81% of the points on the nested simulations set by inverse gaussian with <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link is certainly smaller than, for example, that of 89% by gaussian with identity link but still very pronounced. On the capital region set, the overestimation rates for these two combinations are 41% and 56%, respectively, meaning that here the bias is negligibe. Surprisingly, for most GLMs the bias is here smaller than for inverse gaussian with <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link but since this result does not generalize to the nested simulations set, we regard it as a chance event and do not question the rather mediocre performance of inverse gaussian with <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link here further. Interpreting the mean of residuals (2) provides similar insights.

In particular, for inverse gaussian <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link GLM the reduction of the bias comes along with the general improvement by the relaxation. The small remainder of the bias indicates not only that this GLM is a promising choice here but also that identifying suitable regression methods and functional forms is crucial to further improving the accuracy of the proxy function. For the residuals on the three sets, see the triangle-shaped residuals in Figures 8–10, respectively.

#### 4.3.5. Major and Minor Role of Link Function and Random Component

Apart from the just considered case, for all three random components, the relaxation to 300–886 yields the largest out-of-sample performance gains in terms of v.mae with identity link (between −0.047% and −0.058% points), closely followed by log link (between −0.033% and −0.047% points), and the least gains with inverse link (between −0.017% and −0.020% points). While with identity link the largest improvements before finalization take place for gaussian, gamma and inverse gaussian random components between iterations 180 to 190, 170 to 180, and 150 to 160, respectively, with log link they occur much sooner between iterations 120 to 130, 110 to 120, and 110 to 120, respectively, see Tables A11–A13. As a result of this behavior, under 150–443 log link performs better than identity link for gaussian and inverse gaussian whereas under 300–886 it is vice versa. Inverse link always performs worse than identity and log links, in particular under 300–886.

Applying the same link with different random components does not bring much variation under 300–886 with gamma and inverse gaussian being slightly better than gaussian for all considered links though. A possible explanation is that the distribution of BEL is slightly skewed conditional on the outer scenarios. Thereby results the skewness in the inner simulations from an asymmetric profit sharing mechanism in the CFP model. While the policyholders are entitled to participate at the profits of an insurance company, see, for example, Mourik (2003), the company has to bear its losses fully by itself. Since gaussian performs only slightly worse than the skewed distributions, it should still be considered for practical reasons because it has a closed-form solution and a great deal of statistical theory has been developed for it, compare, for example, Dobson (2002). By conclusion, the choice of the link is more important than that of the random component so that trying alternative link functions might be beneficial.

#### *4.4. Generalized Additive Models (GAMs)*

#### 4.4.1. Settings

For the derivation of the GAMs (26) of BEL, we apply only restriction settings *K*max-443 with *K*max ≤ 150 in the adaptive algorithm since we use smooth functions (25) constructed out of splines that may already have exponents greater than 1 to which the monomial first-order basis functions are raised. As the model selection criterion we take GCV (32) used by our chosen implementation by default. We vary different ingredients of GAMs while holding others fixed to carve out possible effects of these ingredients on the approximation quality of GAMs in adaptive algorithms and our application.

We rely on R function *gam(*·*)* implemented in R package *mgcv* of Wood (2018).

#### 4.4.2. Results

Table A15 contains the validation figures for GAMs with varying number of spline functions per smooth function, that is, *J* ∈ {4, 5, 8, 10}, after each tenth and the finally selected smooth function. In the case of adaptive forward stepwise selection the iteration numbers coincide with the numbers of selected smooth functions. In contrast, table sections with adaptive forward stagewise selection results do not display the iteration numbers in the smooth function column *k*. In Table A16, we display the effective degrees of freedom, p-values and significance codes of each smooth function of the *J* = 4 and *J* = 10 GAMs from the previous table at stages *k* ∈ {50, 100, 150}. The p-values and significance codes are based on a test statistic of Marra and Wood (2012) having its foundations in the frequentist properties of Bayesian confidence intervals analyzed in Nychka (1988). Tables A17 and A18 report the validation figures respectively for GAMs with numbers *J* = 5 and *J* = 10, where the types of the spline functions are varied. Thin plate regression splines, penalized cubic regression splines, duchon splines and Eilers and Marx style P-splines are considered. Thereafter, Tables A19 and A20 display the validation figures respectively for GAMs with numbers *J* = 4 and *J* = 8 and different random component link function combinations. As in GLMs, we apply the gaussian, gamma and inverse gaussian distributions with identity, log, inverse and <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> (only inverse gaussian) link functions.

Table A21 compares by means of two exemplary GAMs the effects of adaptive forward stagewise selection of length *L* = 5 and adaptive forward stepwise selection. Last but not least, Table A22 contains a mixture of GAMs challenging the results which we will have deduced from the other GAM tables. Table A23 gives an overview of the validation figures corresponding to all derived final GAMs and highlights in green and red respectively the best and worst values observed per figure.

#### 4.4.3. Efficiency and Performance Gains by Tailoring the Spline Function Number

Table A15 indicates that the MAEs (1) and (3) of the exemplary GAMs built up of thin plate regression splines with gaussian random component and identity link tend to increase with the number *J* of spline functions per dimension until *k* = 100. Running more iterations reverses this behavior until *k* = 150. Hence, as long as comparably few smooth functions have been selected in the adaptive algorithm fewer spline functions tend to yield better out-of-sample performances of the GAMs whereas many smooth functions tend to perform better with more spline functions. A possible explanation of this observation is that an omitted-variable bias due to too few smooth functions is aggravated here by an overfitting due to too many spline functions. For more details on an omitted-variable bias, see, for example, Pindyck and Rubinfeld (1998), and for the needlessly large sampling variances and thus low estimation precision of overfitted models, see, for example, Burnham and Anderson (2002). Differently, the absolute values of the means of residuals (2) and (4) tend to become smaller with increasing *J* regardless of *k*.

According to Table A16, the components of the effective degrees of freedom (31) associated with each smooth function tend to decrease for *J* = 4 and *J* = 10 slightly in *k*. This is plausible as the explanatory power of each additionally selected smooth term is expected to decline by trend in the adaptive algorithm. Conditional on df > 1, that is for proportions of at least 40% of all smooth terms, the averages of the effective degrees of freedom belonging to *k* ∈ {50, 100, 150} amount for *J* = 4 and *J* = 10 to {2.494, 2.399, 2.254} and {5.366, 4.530, 4.424}, respectively. The values are by construction smaller than *J* − 1 since one degree of freedom per smooth function is lost to the identifiability constraints. Hence, for at least 40% of the smooth functions, on average *J* = 6 is a reasonable choice to capture the CFP model properly while maintaining computational efficiency, compare Wood (2017). The other side of the coin here is that up to 60% of the smooth functions are supposed to be replacable by simple linear terms without losing accuracy so that here tremendous efficiency gains can be realized by making the GAMs more parsimonious. Furthermore, setting *J* individually for each smooth function can help improve computational efficiency (if *J* should be set below average) and out-of-sample performance (if *J* should be set above average). However, such a tailored approach entails the challenge that the optimal *J* per smooth function is not stable across all *k*, compare row-wise the degrees of freedom in the table for *J* = 4 and *J* = 10.

#### 4.4.4. Dependence of Best Spline Function Type

According to Tables A17 and A18, the adaptive algorithm terminates only due to no further decrease in GCV when the GAMs are composed of duchon splines discussed in Duchon (1977). Whether GCV has an overfitting tendency here cannot be deduced from this example since only restriction settings with *K*max ≤ 150 are tested. The thin plate regression splines of Wood (2003) and penalized cubic regression splines of Wood (2017) perform similarly and significantly better than the duchon splines for both *J* = 5 and *J* = 10. For *J* = 5 the Eilers and Marx style P-splines proposed by Eilers and Marx (1996) perform by far best when *K*max = 100 smooth functions are allowed. However, for *J* = 10 they are outperformed by both the thin plate regression splines and penalized cubic regression splines when between *K*max = 125 and 150 smooth functions are allowed. This result illustrates well that the best choice of the spline function type varies with *J* and *K*max, meaning that it should be selected together with these parameters.

#### 4.4.5. Minor Role of Link Function and Random Component

For GLMs, we have seen that varying the random component barely alters the validation results whereas varying the link function can make a noticeable impact. While this result mostly applies to the earlier compositions of GAMs as well, it certainly does not to the later ones. See for instance early composition *k* = 40 in Table A19. Here identity link GAMs with gamma and inverse gaussian random components perform more similar to each other than identity and log link GAMs with gamma random component or identity and log link GAMs with inverse gaussian random component do. Log link GAMs with gamma and inverse gaussian random components show such a behavior as well. However identity link GAM with the less flexible gaussian random component (no skewness) does not show at all a behavior similar to that of identity link GAMs with gamma or inverse gaussian random components. Now see later compositions *k* ∈ {70, 80} to verify that all available GAMs in the table produce very similar validation results.

For another example see Table A20. For early composition *k* = 50, identity link GAMs with gaussian and gamma random components behave very similar to each other just like log link GAMs with gaussian and gamma random components do. For later compositions *k* ∈ {100, 110}, again all available GAMs produce very similar validation results. A possible explanation of this result is that the impact of the link function and random component decreases with the number of smooth functions as the latter take the modeling over. By conclusion, the choices of the random component and link function do not play a major role when the GAM is built up of many smooth functions.

#### 4.4.6. Consistency of Results

Table A21 shows based on two exemplary GAMs constructed out of *J* = 8 thin plate regression splines per dimension varying in the random component and link function that the adaptive forward stagewise selection of length *L* = 5 and adaptive forward stepwise selection lead to very similar GAMs and validation results. As a result, stagewise selection should be preferred due to its considerable run time advantage. As we will see in the following, the run time can be further reduced without any drawbacks by dynamically selecting even more than 5 smooth functions per iteration.

The purpose of Table A22 is to challenge the hypotheses deduced above. Like Table A15, this table contains the results of GAMs with varying spline function number *J* ∈ {5, 8, 10} and fixed spline function type. Instead of thin plate regression splines, now Eilers and Marx style P-splines are considered. Since adaptive forward stepwise and stagewise selection do not yield significant differences in the examples of Table A21, we do not expect that permutations thereof affect the results much here as well. This allows us to randomly assign three different adaptive forward selection approaches to the three exemplary proxy function derivation procedures. As one of these approaches, we choose a dynamic stagewise selection approach in which *L* is determined in each iteration as the proportion 0.25 of the size of the candidate term set. Again we see that as long as only *k* ∈ {90, 100} smooth functions have been selected, *J* = 5 performs better than *J* = 8 and *J* = 8 better than *J* = 10. However, *k* = 150 smooth functions are not sufficient this time for *J* = 10 to catch up with the performance of *J* = 5. The observed performance order is consistent with the hypotheses of a high stability of the GAMs with respect to the adaptive selection procedure and random component link function combination.

#### 4.4.7. Potential of Improved Interaction Modeling

Table A23 presents as the most suitable GAM the one with highest allowed maximum number of smooth functions *K*max = 150 and highest number of spline functions *J* = 10 per dimension. The slight deterioration after *k* = 130 reported by Table A15 indicates that at least one of the parameters is already comparably high. According to Table A16, there are a few smooth terms which might benefit from being composed of more than ten spline functions and increasing *K*max might be helpful to capturing the interactions in the CFP model more appropriately, particularly in the light of the fact that the best GLM, having 250 basis functions, outperforms the best GAM on both the Sobol and nested simulations set, compare Table A14, with the best GAM showing a comparably low bias across the three validation sets though, see the dot-shaped residuals in Figures 8–10, respectively. Variations in the random component link function combination and adaptive selection procedure are not expected to change the performance much. By conclusion, we recommend the fast gaussian identity link GAMs (several expressions in the PIRLS algorithm simplify) with tailored spline function numbers per smooth function and simple linear terms under stagewise selection approaches of suitable lengths *L* ≥ 5 and more relaxed restriction settings where *K*max > 150.

#### *4.5. Feasible Generalized Least-Squares (FGLS) Regression*

#### 4.5.1. Settings

Like the OLS proxy functions and GLMs, we derive the FGLS proxy functions (38) under restriction settings 150–443 and 300–886. For the performance assessment of FGLS regression, we apply type I and II algorithms with variance models of different complexity, where type I results are obtained as a by-product of type II algorithm since the latter algorithm builds upon the former one. We control the complexity through the maximum allowed numbers of variance model terms *M*max ∈ {2; 6; 10; 14; 18; 22}.

We combine R functions *nlminb(*·*)* and *lm(*·*)* implemented in R package *stats* of R Core Team (2018).

#### 4.5.2. Results

Tables A24 and A25 display respectively the adaptively selected FGLS variance models of BEL corresponding to maximum allowed numbers of terms *M*max based on final 150–443 and 300–886 OLS proxy functions given in Tables A1 and A3. For reasons of numerical stability and simplicity, only basis functions with exponents summing up to at max two are considered as candidates. Additionally, the AIC scores and MAEs with respect to the relative metric are reported in the tables. By construction, these results are also the type I algorithm outcomes. Tables A26 and A27 summarize respectively under 150–443 and 300–886 all iteration-wise out-of-sample test results. The results of type II algorithm after each tenth and the final iteration of adaptive FGLS proxy function selection are respectively displayed by Tables A28 and A29. Table A30 gives an overview of the AIC scores and validation figures corresponding to all final FGLS proxy functions and highlights as in the previous overview tables in green and red respectively the best and worst values observed per figure.

#### 4.5.3. Consistency Gains by Variance Modeling

By looking at Tables A24 and A25 we see similar out-of-sample performance patterns during adaptive variance model selection based on the basis function sets of 150–443 and 300–886 OLS proxy functions. In both cases, the p-values of Breusch-Pagan test indicate that heteroscedasticity is not eliminated but reduced when the variance models are extended, that is, when *M*max is increased. In fact, in a more good-natured LSMC example Hartmann (2015) shows that a type I alike algorithm manages to fully eliminate heteroscedasticity. While the MAEs (1) barely change on the Sobol set, they decrease significantly on the nested simulations set and increase noticeably on the capital region set. Under 300–886 the effects are considerably smaller than under 150–443 since the capital region performance of 300–886 OLS proxy function is less extraordinarily good than that of 150–443 OLS proxy function. The three MAEs approach each other under both restriction settings. Hence the reductions in heteroscedasticity lead to consistency gains across the three validation sets.

Tables A26 and A27 complete the just discussed picture. The remaining validation figures on the Sobol set improve through type I FGLS regression slightly compared to OLS regression. Like ns.mae, figure ns.res and the base residual improve a lot with increasing *M*max under 150–443 and a little less under 300-886 but ns.mae0 and ns.res<sup>0</sup> do not alter much as the aforementioned two figures cancel each other out here. On the capital region set, the figures deteriorate or remain comparably high in absolute values. The type I FGLS figures converge fast so that increasing *M*max successively from 10 to 22 barely affects the out-of-sample performance anymore. As a result of heteroscedasticity modeling, the proxy functions are shifted such that overall approximation quality increases. Unfortunately, this does not guarantee an improvement in the relevant region for SCR estimation as our example illustrates well.

#### 4.5.4. Monotonicity in Complexity

Let us address the type II FGLS results under 150-443 in Table A28 now. For *M*max = 2, figures (3) and (4) are improved on all three validation sets significantly compared to OLS regression with the type I figures lying inbetween. The other validation figures are similar for OLS, type I and II FGLS regression, which traces the performance gains in (3) and (4) back to a better fit of the base value. For *M*max = 6 to 22, the type II figures show the same effects as the type I ones but more pronouncedly, see the previous two paragraphs. These effects are by trend the more distinct the more complex the variance model becomes. The type II figures stabilize less than the type I ones because of the additional variability coming along with adaptive FGLS proxy function selection. Hartmann (2015) shows in terms of Sobol figures in her LSMC example that increasing the complexity while omitting only one regressor from the simpler variance model can deteriorate the out-of-sample performance dramatically. Intuitively, it is plausible that the FGLS validation figures are the farther from the OLS figures away the more elaborately heteroscedasticity is modeled.

Now let us relate the type II FGLS results under 300-886 in Table A29 to the other FGLS results. Under 300–886 for *M*max = 2, figures (3) and (4) are already at a comparably good level with both OLS and type I FGLS regression so that they do not alter much or even deteriorate with type II FGLS regression. Like under 150–443 for *M*max = 6 to 22, the type II figures show the effects of the type I ones more pronouncedly. Under both restriction settings, ns.mae and ns.res decrease thereby significantly. While this barely causes ns.res<sup>0</sup> to change under 150–443, it lets ns.res<sup>0</sup> increase in absolute values under 300–886. The slight improvements on the Sobol set and the deteriorations on the capital region set carry over to 300–886. When *M*max is increased up to 22, the type II FGLS validation figures under 300–886 do not stop fluctuating. The variability entailed by adaptive FGLS proxy function selection intensifies thus through the relaxation of the restriction setting in this numerical example. According to Breusch-Pagan test, heteroscedasticity is neither eliminated by the type II algorithm here nor by a type II alike approach of Hartmann (2015) in her more good-natured example.

#### 4.5.5. Improvement by Relaxation

Among all FGLS proxy functions listed in Table A30, we consider type II with *M*max = 14 in variance model selection under 300–886 as the best performing one. Apart from nested simulations validation under type I algorithm, 300–886 performs better than 150–443. Since on the other hand type II algorithm performs better than type I algorithm under the respective restriction settings, 300–886 and type II algorithm are the most promising choices here. Differently *M*max = 14 does not constitute a stable choice due to the high variability coming along with 300–886 and type II algorithm.

While all type I FGLS proxy functions are by definition composed of the same basis functions as the OLS proxy function, the compositions of type II FGLS proxy functions vary with *M*max because of their renewed adaptive selection. Consequently, under 300–886 all type I FGLS proxy functions hit the same restrictions 224–464 as the OLS proxy function does, whereas the restrictions hit by type II FGLS proxy functions vary between 224–454 and 258–564. This variation is consistent with the OLS and GLM results from the previous sections and hence the OLS results of Teuguia et al. (2014) and Bauer and Ha (2015).

AIC does not have an overfitting tendency according to Tables A26–A29 as the validation figures do not deteriorate in the late iterations more than they underly Monte Carlo fluctuations, compare the OLS and GLM interpretations. Using FGLS instead of OLS regression in the standard adaptive algorithm, compare Section 2.2, lets the algorithm thus yield numerically stable and parsimonious proxy functions without restriction settings as well.

#### 4.5.6. Reduction of Bias

The type II *M*max = 14 FGLS proxy function under 300-886 reaches with 258 terms the highest observed number across all numerical experiments and not only outperforms all derived GLMs and GAMs in terms of combined Sobol and nested simulations validation, it also shows by far the smallest bias on these two validation sets and approximates the base value comparably well. This observation speaks for a high interaction complexity of the CFP model. The reduction of the bias comes again along with the general improvement by the relaxation. Given the fact that the capital region set presents the most extreme and challenging validation set in our analysis, the still mediocre performance here can be regarded as acceptable for now. Nevertheless, especially the bias on this set motivates the search for even more suitable regression methods and functional forms. For the residuals of the 300–886 FGLS proxy function on the three sets, see the x-shaped residuals in Figures 8–10, respectively.

#### *4.6. Multivariate Adaptive Regression Splines (MARS)*

#### 4.6.1. Settings

We undertake a two-step approach to identify suitable generalized MARS models out of numerous possibilities. In the first step, we vary several MARS ingredients over a wide range and obtain in this way a large number of different MARS models. To be more specific, we vary the maximum allowed number of terms *K*max ∈ {50, 113, 175, 237, 300} and the minimum threshold for the decrease in the residual sum of squares *<sup>t</sup>*min <sup>∈</sup> {0, 1.25, 2.5, 3.75, 5} · <sup>10</sup>−<sup>5</sup> in the forward pass, the order of interaction *o* ∈ {3, 4, 5, 6}, the pruning method *p* ∈ {'n', 'b', 'f', 's'} with 'n' = 'none', 'b' = 'backward', 'f' = 'forward' and 's' = 'seqrep' in the backward pass, as well as the random component link function combination of the GLM extension. In addition to the 10 random component link function combinations applied in the numerical experiments of the GLMs, compare, for example, Table A14, we use poisson random component with identity, log and squareroot link functions. We work with the default fast MARS parameter fast.k = 20 of our chosen implementation.

We use R function *earth(*·*)* implemented in R package *earth* of Milborrow (2018).

#### 4.6.2. Results

In total, these settings yield 4 · 5 · 5 · 4 · 13 = 5200 MARS models with a lot of duplicates in our first step. We validate the 5200 MARS models on the Sobol, nested simulations and capital region sets through evaluation of the five validation figures. Then we collect the five best performing MARS models in terms of each validation figure per set which gives us in total 5 · 5 = 25 best performing models per first step validation set. Since the MAEs (1) with respect to the relative and asset metric entail the same best performing models, only 5 · 4 = 20 of the collected models per first step set are potentially different. Based on the ingredients of each of these 20 MARS models per first step set, we define 5 · 5 = 25 new sets of ingredients varying only with respect to *K*max and *t*min and derive the corresponding new but similar MARS models in the second step. As a result, we obtain in total 20 · 25 = 500 new MARS models per first step set. Again, we assess their out-of-sample performances through evaluation of the five validation figures on the three validation sets. Out of the 500 new MARS models per first step set, we collect then the best performing ones in terms of each validation figure per second step set. Now this gives us in total 5 · 3 = 15 best MARS models per first step set, or taking into account that the MAEs (1) with respect to the relative and asset metric entail once more the same best performing models, 4 · 3 = 12 potentially different best models per first step set. In total, this makes 12 · 3 = 4 · 9 = 36 best MARS models, which can be found in Table A31 sorted by first and second step validation sets.

#### 4.6.3. Poor Interaction Modeling and Extrapolation

In Table A31, the out-of-sample performances of all MARS models derived in our two-step approach are sorted using the first step validation set as the primary and the second step validation set as the secondary sort key. Let us address the first step second step validation set combinations by the headlines in Table A31. By construction, the combinations Sobol set<sup>2</sup> , Nested simulations set<sup>2</sup> and Capital region set<sup>2</sup> yield respectively the MARS models with the best validation figures (1)–(4) on the Sobol, nested simulations and capital region sets. See that in the table all corresponding diagonal elements are highlighted in green. But the best MAEs (1) and (3) are not even close to what OLS regression, GLMs, GAMs and FGLS regression achieve. Finding small residuals (2) and (4) regardless of the other validation figures is not sufficient. The performances on the nested simulations and capital region sets, comprising several scenarios beyond the fitting space, are especially poor. All these results indicate that MARS models do not seem very suitable for our application. Despite the possibility to select up to 300 basis functions, the MARS algorithm selects only at maximum 148 basis functions, which suggests that without any alterations, the algorithm is not able to capture the behavior of the CFP model properly, in particular extrapolation behavior is comparably poor.

The MARS model with the set of ingredients *K*max = 50, *t*min = 0, *o* = 4, *p* = 'b', inverse gaussian random component and identity link function is selected as the best one six times out of 36, or once for each Sobol and nested simulations first step validation set combination. Furthermore, this model performs best in terms of v.res0, ns.mae<sup>0</sup> and ns.mae*a*. Since there is no other MARS model with a similar high occurrence and performance, we consider it the best performing and most stable one found in our two-step approach. For illustration of a MARS model, see this one in Table A32. The fact that this best MARS model performs worse than other ones in terms of several validation figures stresses the infeasibility of MARS models in this application.

#### 4.6.4. Limitations

Table A31 suggests that, up to a certain upper limit, the higher the maximum allowed number of terms *K*max the higher tends the performance on the Sobol set to be. However, this result does not generalize to the nested simulations and capital region sets. Since at maximum 148 basis functions are selected here even if up to 300 basis functions are allowed, extending the range of *K*max in the first step of this numerical experiment would not affect the output in this regard. The threshold *t*min is an instrument controlling the number of basis functions selected in the forward pass up to *K*max which cannot be extended below zero, meaning that its variability has already been exhausted here as well. For the interaction order *o* similar considerations as for *K*max apply. The pruning method *p* used in the backward pass does not play a large role compared to the other ingredients as it only helps reduce the set of selected basis functions. In terms of Sobol validation, inverse gaussian random component with identity link performs best, whereas in terms of nested simulations and capital region validation, inverse gaussian random component with any link or log link with gaussian or poisson random component perform best. We conclude that if there was a suitable MARS model for our application, our two-step approach would have found it.

#### *4.7. Kernel Regression*

#### 4.7.1. Settings

We make a series of adjustments affecting either the structure or the derivation process of the multidimensional LC and LL proxy functions (59) and (61) to get as broad a picture of the potential of kernel regression in our application as possible. Our adjustments concern the kernel function and its order, the bandwidth selection criterion, the proportion of fitting points used for bandwidth selection, and the sets of basis functions of which the local proxy functions are composed of. Thereby we combine in various ways the gaussian, Epanechnikov and uniform kernels, orders *o* ∈ {2, 4, 6, 8}, bandwidth selection criteria LOO-CV and AIC, and between 2500 (proportion bw = 0.1) and 25,000 (proportion bw = 1) fitting points for bandwidth selection.

We work with R functions *npregbw(*·*)* and *npreg(*·*)* implemented in R package *np* of Racine and Hayfield (2018).

#### 4.7.2. Results

Furthermore, we alternate the four basis function sets contained in Tables A33 and A34. The first two basis function sets with *K*max ∈ {16, 27} are derived by adaptive forward stepwise selection based on OLS regression, the third one with *K*max = 15 by risk factor wise linear selection and the last one with *K*max = 22 by a combination thereof. All combinations including their out-of-sample performances can be found in Table A35. Again, the best and worst values observed per validation figure are highlighted in green and red, respectively.

#### 4.7.3. Poor Interaction Modeling and Extrapolation

We draw the following conclusions based on the validation results in Table A35. The comparisons of LC and LL regression applied with gaussian kernel and 16 basis functions or Epanechnikov kernel and 15 basis functions suggest that LL regression performs better than LC regression. However, even the best Sobol, nested simulations and capital region results of LL regression are still outperformed by OLS regression, GLMs, GAMs and FGLS regression. Possible explanations for this observation are that kernel regression is not able to model the interactions of the risk factors equally well with its few basis functions and that local regression approaches perform rather poorly close to and especially beyond the boundary of the fitting space because of the thinned out to missing data basis in this region. While the first explanation applies to all three validation sets, the latter one applies only to the nested simulations and capital region sets on which the validation figures are indeed worse than on the Sobol set. While LC regression produces interpretable results with the sets of 22 and 27 basis functions, the more complex LL regression does not in most cases.

#### 4.7.4. Limitations

On the Sobol and capital region sets, both LC and LL regression show similar behaviors when relying on gaussian kernel and 16 basis functions compared to Epanechnikov kernel and 15 basis functions. But on the nested simulations set, gaussian kernel and 16 basis functions are the superior choices. Using a uniform kernel with LC regression deteriorates the out-of-sample performance. The results of LC regression indicate furthermore that an extension of the basis function sets from 15 to 27 only slightly affects the validation performance. With gaussian kernel switching from 16 to 27 basis functions barely has an impact and with Epanechnikov kernel only the nested simulations and capital region validation performance improve when using 27 as opposed to 15, 16 or 22 basis functions. While increasing the order of the gaussian or Epanechnikov kernel deteriorates the validation figures dramatically, for the uniform kernel the effects can go in both directions. AIC performs worse than LOO-CV when used for bandwidth selection of the gaussian kernel in LC regression. For LC regression, increasing the proportion of fitting points entering bandwidth selection improves all validation figures until a specific threshold is reached. But thereafter the nested simulations and capital region figures are deteriorated. For LL regression no such deterioration is observed.

Overall we do not see much potential in kernel regression for our practical example compared to most of the previously analyzed regression methods. Nonetheless in order to achieve comparably good kernel regression results, we consider LL regression more promising than LC regression due to the superior but still poor modeling close to and beyond the boundary of the fitting space. We would apply it with gaussian, Epanechnikov or other similar kernel functions. A high proportion of fitting points for bandwidth selection is recommended and it might be worth trying alternative comparably small basis function sets reflecting, for example, the risk factor interactions better than in our examples.

#### **5. Conclusions**

For high-dimensional variable selection applications such as the calibration step in the LSMC framework, we have presented various machine learning regression approaches ranging from ordinary and generalized least-squares regression variants over GLM and GAM approaches to multivariate adaptive regression splines and kernel regression approaches. At first we have justified the combinability of the ingredients of the regression routines such as the estimators and proposed model selection criteria in a theoretical discourse. Afterwards we have applied numerous configurations of these machine learning routines to the same slightly disguised real-world example in the LSMC framework. With the aid of different validation figures, we have analyzed the results, compared the out-of-sample performances and adviced to use certain routine designs.

In our slightly disguised real-world example and given LSMC setting, the adaptive OLS regression, GLM, GAM and FGLS regression algorithms turned out to be suitable machine learning methods for proxy modeling of life insurance companies with potential for both performance and computational efficiency gains by fine-tuning model hyperparameters and implementation designs. Differently, the MARS and kernel regression algorithms were not found to be convincing in our application. In order to study the robustness of our results, the approaches can be repeated in multiple other LSMC examples.

After all, none of our tested approaches was able to completely eliminate the bias observed in the validation figures and to yield consistent results across the three validation sets though. Investigations on whether these observations are systematic for the approaches, a result of the Monte Carlo error or a combination thereof help further narrow down the circle of recommended regression techniques. In order to assess the variance and bias of the proxy estimates conditional on an outer scenario, seed stability analyses in which the sets of fitting points are varied and convergence analyses in which sample size is increased need to be carried out. While such analyses would be computationally very costly, they would provide valuable insights into how to further improve approximation quality, that is, whether additional fitting points are necessary to reflect the underlying CFP model more accurately, whether more suitable functional forms and estimation assumptions are required for a more appropriate proxy modeling, or whether both aspects are relevant. Furthermore, one could deduce from such an analysis the sample sizes needed by the different regression algorithms to meet certain validation criteria. Since the generation of large sample sizes is currently computationally expensive for the industry, algorithms getting along with comparably few fitting points should be striven for.

Picking a suitable calibration algorithm is most important from the viewpoint of capturing the CFP model and hence the SCR appropriately. Therefore, if the bias observed in the validation figures indicates indeed issues with the functional forms of our approaches, doing further research on techniques not entailing such a bias or at least a smaller one is vital. On the one hand, one can fine-tune the approaches of this exposition and try different configurations thereof, and on the other hand, one can analyze further machine learning alternatives such as the ones mentioned in the introduction and already used in other LSMC applications. Ideally, various approaches like adaptive OLS regression, GLM, GAM and FGLS regression algorithms, artificial neural networks, tree-based methods and support vector machines would be fine-tuned and compared based on the same realistic and comprehensive data basis. Since the major challenges of machine learning calibration algorithms are hyperparameter selection and in some cases their dependence on randomness, future research should be dedicated to efficient hyperparameter search algorithms and stabilization methods such as ensemble methods.

**Author Contributions:** Conceptualization, A.-S.K., Z.N. and R.K.; Formal analysis, A.-S.K.; Investigation, A.-S.K.; Methodology, A.-S.K., Z.N. and R.K.; Project administration, A.-S.K.; Resources, Z.N.; Software, A.-S.K.; Supervision, R.K.; Validation, Z.N. and R.K.; Visualization, A.-S.K.; Writing–original draft, A.-S.K. and R.K.; Writing–review and editing, Z.N. and R.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The first author would like to thank Christian Weiß for his valuable comments which greatly helped to improve the paper. Furthermore, she is grateful to Magdalena Roth, Tamino Meyhöfer and her colleagues who have been supportive by providing her with academic time and computational resources. Additionally, we gratefully acknowledge very constructive comments by two anonymous reviewers.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Table A1.** Ordinary least squares (OLS) proxy function of BEL derived under 150–443 in the adaptive algorithm with the final coefficients. Furthermore, Akaike information criterion (AIC) scores and out-of-sample mean absolute errors (MAEs) in % after each iteration.


**Table A1.** *Cont.*



**Table A2.** OLS proxy function of available capital (AC) derived under 150–443 in the adaptive algorithm with the final coefficients. Furthermore, AIC scores and out-of-sample MAEs in % after each iteration.

**Table A2.** *Cont.*



**Table A3.** OLS proxy function of BEL derived under 300–886 in the adaptive algorithm with the final coefficients. Furthermore, AIC scores and out-of-sample MAEs in % after each iteration.

**Table A3.** *Cont.*


**Table A3.** *Cont.*



**Table A4.** Out-of-sample validation figures of the OLS proxy function of BEL under 150–443 after each tenth iteration.

**Table A5.** Out-of-sample validation figures of the OLS proxy function of AC under 150–443 after each tenth iteration.



**Table A6.** Out-of-sample validation figures of the OLS proxy function of BEL under 300–886 after each tenth and the final iteration.

**Table A7.** Out-of-sample validation figures of the derived OLS proxy functions of BEL under 150–443 and 300–886 after the final iteration based on three different sets of validation value estimates. Thereby emerges the first set of validation value estimates from pointwise subtraction of 1.96 times the standard errors from the original set of validation values. The second set is the original set. The third set is the addition counterpart of the first set.


**Table A8.** AIC scores and out-of-sample validation figures of the gaussian generalized linear models (GLMs) of BEL with identity, inverse and log link functions under 150–443 after each tenth iteration.


**Table A9.** AIC scores and out-of-sample validation figures of the gamma GLMs of BEL with identity, inverse and log link functions under 150–443 after each tenth iteration.


**Table A10.** AIC scores and out-of-sample validation figures of the inverse gaussian GLMs of BEL with identity, inverse, log and <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link functions under 150–443 after each tenth iteration.


**Table A11.** AIC scores and out-of-sample validation figures of the gaussian GLMs of BEL with identity, inverse and log link functions under 300–886 after each tenth and the final iteration.


**Table A12.** AIC scores and out-of-sample validation figures of the gamma GLMs of BEL with identity, inverse and log link functions under 300–886 after each tenth and the final iteration.


**Table A13.** AIC scores and out-of-sample validation figures of the inverse gaussian GLMs of BEL with identity, inverse, log and <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link functions under 300–886 after each tenth and the final iteration.


**Table A13.** *Cont.*


**Table A14.** AIC scores and out-of-sample validation figures of the gaussian, gamma and inverse gaussian GLMs of BEL with identity, inverse, log and <sup>1</sup> *<sup>μ</sup>*<sup>2</sup> link functions under 150–443 and 300–886 after the final iteration. Highlighted in green and red respectively the best and worst AIC scores and validation figures.



**Table A15.** Out-of-sample validation figures of selected generalized additive models (GAMs) of BEL with varying spline function number per dimension and fixed spline function type under 150–443 after each tenth and the finally selected smooth function.

**Table A16.** Effective degrees of freedom, p-values and significance codes per dimension of GAMs of BEL built up of thin plate regression splines with gaussian random component and identity link function under 150–443 for spline function numbers *J* ∈ {4, 10} per dimension at stages *k* ∈ {50, 100, 150}. The confidence levels corresponding to the indicated significance codes are \*\*\* = 0.001, \*\* = 0.01, \* = 0.05, = 0.1, = 1.


**Table A16.** *Cont.*


**Table A17.** Out-of-sample validation figures of selected GAMs of BEL with varying spline function type and fixed spline function number of 5 per dimension under 100–443 after each tenth and the finally selected smooth function.


**Table A18.** Out-of-sample validation figures of selected GAMs of BEL with varying spline function type and fixed spline function number of 10 per dimension under between 100–443 and 150–443 after each tenth and the finally selected smooth function.


**Table A19.** Out-of-sample validation figures of selected GAMs of BEL with varying random component link function combination and fixed spline function number of 4 per dimension under between 40–443 and 150–443 after each tenth and the finally selected smooth function.


**Table A20.** Out-of-sample validation figures of selected GAMs of BEL with varying random component link function combination and fixed spline function number of 8 per dimension under between 50–443 and 150–443 after each tenth and the finally selected smooth function.


**Table A21.** Out-of-sample validation figures of selected GAMs of BEL in adaptive forward stepwise and stagewise selection of length 5 under between 25–443 and 100–443 after each tenth and the finally selected smooth function.


**Table A22.** Out-of-sample validation figures of selected GAMs of BEL with varying spline function number per dimension and fixed spline function type under between 91–443 and 150–443 after each tenth and the finally selected smooth function or after each dynamically stagewise selected smooth function block. Thereby furthermore a variation in the random component link function combination.


**Table A23.** Maximum allowed numbers of smooth functions and out-of-sample validation figures of all derived GAMs of BEL under between 25–443 and 150–443 after the final iteration. Highlighted in green and red respectively the best and worst validation figures.



**Table A24.** Feasible generalized least-squares (FGLS) variance models of BEL corresponding to *M*max ∈ {2, 6, 10, 14, 18, 22} derived by adaptive selection from the set of basis functions of the 150–443 OLS proxy function given in Table A1 with exponents summing up to at max two. Furthermore, *p*-values of Breusch-Pagan test, AIC scores and out-of-sample MAEs in % after each iteration.

**Table A25.** FGLS variance models of BEL corresponding to *M*max ∈ {2, 6, 10, 14, 18, 22} derived by adaptive selection from the set of basis functions of the 300–886 OLS proxy function given in Table A3 with exponents summing up to at max two. Furthermore, *p*-values of Breusch-Pagan test, AIC scores and out-of-sample MAEs in % after each iteration.



**Table A26.** Iteration-wise out-of-sample validation figures in adaptive variance model selection of BEL corresponding to *M*max ∈ {2, 6, 10, 14, 18, 22} based on the 150–443 OLS proxy function given in Table A1 with exponents summing up to at max two. Simultaneously type I FGLS regression results.

**Table A27.** Iteration-wise out-of-sample validation figures in adaptive variance model selection of BEL corresponding to *M*max ∈ {2, 6, 10, 14, 18, 22} based on the 300–886 OLS proxy function given in Table A3 with exponents summing up to at max two. Simultaneously type I FGLS regression results.


**Table A28.** AIC scores and out-of-sample validation figures of type II FGLS proxy functions of BEL under 150–443 with variance models of varying complexity *M*max after each tenth iteration.


**Table A28.** *Cont.*



**Table A29.** AIC scores and out-of-sample validation figures of type II FGLS proxy functions of BEL under 300–886 with variance models of varying complexity *M*max after each tenth and the final iteration.

**Table A29.** *Cont.*


**Table A30.** AIC scores and out-of-sample validation figures of all derived FGLS proxy functions of BEL under 150–443 and 300–886 after the final iteration. Highlighted in green and red respectively the best and worst AIC scores and validation figures.


**Table A31.** Settings and out-of-sample validation figures of best performing multivariate adaptive regression splines (MARS) models derived in a two-step approach sorted by first and second step validation sets. Highlighted in green and red respectively the best and worst validation figures.



**Table A32.** Best MARS model of BEL derived in a two-step approach with the final coefficients.


**Table A33.** Basis function sets of LC and LL proxy functions of BEL corresponding to *K*max ∈ {16, 27} derived by adaptive OLS selection.

**Table A34.** Basis function sets of LC and LL proxy functions of BEL corresponding to *K*max ∈ {15, 22} derived by risk factor wise or combined risk factor wise and adaptive OLS selection.


**Table A35.** Settings and out-of-sample validation figures of LC and LL proxy functions of BEL using basis function sets from Tables A33 and A34. Highlighted in green and red respectively the best and worst validation figures.


#### **References**


Dahlquist, Germund, and Åke Björck. 1974. *Numerical Methods*. Englewood Cliffs: Prentice-Hall.


#### *Risks* **2020**, *8*, 21

Hastie, Trevor, and Daryl Pregibon. 1992. *Chapter 6 'Generalized Linear Models' in Statistical Models in S*. Boca Raton, London, New York, and Washington: Wadsworth & Brooks/Cole.

Hastie, Trevor, and Robert Tibshirani. 1986. Generalized additive models. *Statistical Science* 1: 297–318. [CrossRef] Hastie, Trevor, and Robert Tibshirani. 1990. *Generalized Additive Models*. London: Chapman & Hall.

Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman. 2017. *The Elements of Statistical Learning*, 2nd ed. New York: Springer Series in Statistics.

Hayashi, Fumio. 2000. *Econometrics*. Princeton: Princeton University Press.


Nadaraya, Elizbar A. 1964. On estimating regression. *Theory of Probability and Its Applications* 9: 141–42. [CrossRef]


Watson, Geoffrey S. 1964. On estimating regression. *Sankhya: The Indian Journal of Statistics, Series A* 26: 359–72.


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
