A Sarmanov Distribution with Beta Marginals: An Application to Motor Insurance Pricing

Bolancé, Catalina; Guillen, Montserrat; Pitarque, Albert

doi:10.3390/math8112020

Open AccessArticle

A Sarmanov Distribution with Beta Marginals: An Application to Motor Insurance Pricing

by

Catalina Bolancé

^†

,

Montserrat Guillen

^*,†

and

Albert Pitarque

^†

Department Econometrics, Riskcenter-IREA, Universitat de Barcelona, E08034 Barcelona, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2020, 8(11), 2020; https://doi.org/10.3390/math8112020

Submission received: 26 October 2020 / Accepted: 9 November 2020 / Published: 13 November 2020

(This article belongs to the Special Issue Multivariate Sarmanov Distributions and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Background: The Beta distribution is useful for fitting variables that measure a probability or a relative frequency. Methods: We propose a Sarmanov distribution with Beta marginals specified as generalised linear models. We analyse its theoretical properties and its dependence limits. Results: We use a real motor insurance sample of drivers and analyse the percentage of kilometres driven above the posted speed limit and the percentage of kilometres driven at night, together with some additional covariates. We fit a Beta model for the marginals of the bivariate Sarmanov distribution. Conclusions: We find negative dependence in the high quantiles indicating that excess speed and night-time driving are not uniformly correlated.

Keywords:

beta regression; dependence; bivariate Sarmanov distribution; estimation; telematics; insurance

1. Introduction

We analyse a bivariate model based on the Sarmanov distribution with marginal Beta distributions. These marginals are specified based on a generalized linear model (Beta-GLM) or Beta regression as defined by Ferrari and Cribari-Neto [1]. The objective is to fit data defined in the

(0, 1)

interval.

Many authors have analysed bivariate Beta distributions (see, for example, [2,3,4,5]). However, these distributions pose several difficult challenges: their generalization to higher dimensions and their specification as a generalized linear model are not straightforward. The Sarmanov distribution provides a way to address these challenges.

Originally, the Sarmanov distribution in its bivariate form was introduced by Sarmanov [6], its multivariate version was suggested by Lee [7] and was generalized by Bairamov et al. [8]. Its use to model the bivariate behaviour of random variables with a marginal

B e t a (α, β)

distribution was proposed by Gupta and Wong [3]. These authors defined the five parameter bivariate Beta distribution from what is known as Morgenstern’s distribution [9] with marginal Beta, which is a particular case of the Sarmanov distribution.

The bivariate Sarmanov distribution is characterized by its flexibility in the marginal distributions and, furthermore, given that its functional form establishes that the marginals are clearly separated from the dependency model, the specification in terms of a bivariate generalized linear model turns out to be natural. Generalizing from two dimensions to higher dimensions is simple—(see [10] for an example of a trivariate Sarmanov distribution specified as a generalized linear model with Negative Binomial marginals).

In this work, we show an application of the bivariate Sarmanov distribution with Beta marginals generalised linear model to predict two of the most relevant telematics variables in motor insurance [11]. Telematics variables are obtained from GPS/inertial devices installed in vehicles and they provide an abundant source of information to motor insurers. In our case study, a bivariate model is specified, for the proportion of kilometres driven above the posted speed limit and the proportion of kilometres driven at night. These two variables seem to be related, but researchers have not yet been able to find a good way to understand their association. The explanatory variables are the characteristics of the insured policyholder and the vehicle. The database used in our application has already been analysed in various works published in statistical, transport and risk analysis journals (see [11,12,13,14,15,16,17]). In all previous studies, the two telematics variables that we analyse here were used as predictors of the accident rate, and they were assumed to be uncorrelated.

In Section 2, the new bivariate Sarmanov model is specified and the particular case with marginal Beta-GLM with a domain in the

(0, 1)

interval is analysed; the estimation method is also discussed. The results of our case study are shown in Section 3. Finally, Section 4 contains the conclusions.

2. The Models

Let

(Y_{1}, Y_{2})

be a bivariate random vector that, for convenience, is defined in

{(0, 1)}^{2}

. Its distribution depends on a set of k quantitative or binary covariates, whose values are represented by the vector

x_{j} = {(x_{1 j}, \dots, x_{k j})}^{'}

,

j = 1, 2

, where

x_{1 j} = 1

is a constant term. The relationship between

Y_{j}

and the covariates is given by the linear predictor

x_{j}^{'} β^{j}

, where

β^{j} = {(β_{1}^{j}, \dots, β_{k}^{j})}^{'}

,

j = 1, 2

, are vectors of parameters to be estimated. To simplify the notation, the covariates are assumed to be the same for

j = 1

and

j = 2

, and so the vector of explanatory variables is denoted as x. The bivariate probability density function (pdf) associated with the Sarmanov distribution is:

\begin{matrix} f_{Y_{1}, Y_{2}} (y_{1}, y_{2} | x^{'} β^{1}, x^{'} β^{2}) & = & f_{1} (y_{1} | x^{'} β^{1}) f_{2} (y_{2} | x^{'} β^{2}) \\ \times & [1 + ω ϕ_{1} (y_{1} | x^{'} β^{1}) ϕ_{2} (y_{2} | x^{'} β^{2})], y_{1}, y_{2} \in (0, 1) \end{matrix}

(1)

where

ω

is the dependence parameter and

ϕ_{j}

,

j = 1, 2

, are bounded kernel functions. For the function defined in (1) to be a pdf, the following conditions must hold:

\int_{0}^{1} ϕ_{j} (y_{j} | x^{'} β^{j}) f_{j} (y_{j} | x^{'} β^{j}) d y_{j} = 0, j = 1, 2

(2)

and

1 + ω ϕ_{1} (y_{1} | x β^{1}) ϕ_{2} (y_{2} | x β^{2}) \geq 0, \forall (y_{1}, y_{2}) \in {(0, 1)}^{2} .

(3)

For given values of

x^{'} β^{j}

,

j = 1, 2

, we define:

m_{j} (x^{'} β^{j}) = inf_{0 < y_{j} < 1} ϕ_{j} (y_{j} | x^{'} β^{j}) and M_{j} (x^{'} β^{j}) = sup_{0 < y_{j} < 1} ϕ_{j} (y_{j} | x^{'} β^{j}), j = 1, 2 .

Taking into account the condition defined in (3), bounds can be defined for the dependency parameter

ω

. However, as this parameter does not depend on the linear predictor, new extreme values are defined as:

m_{j}^{⋆} = max_{\forall x^{'} β^{j}} m_{j} (x^{'} β^{j})

and

M_{j}^{⋆} = min_{\forall x^{'} β^{j}} M_{j} (x^{'} β^{j})

, so that the bounds of the dependency parameter are:

max \{- \frac{1}{m_{1}^{⋆} m_{2}^{⋆}}, - \frac{1}{M_{1}^{⋆} M_{2}^{⋆}}\} \leq ω \leq min \{- \frac{1}{m_{1}^{⋆} M_{2}^{⋆}}, - \frac{1}{M_{1}^{⋆} m_{2}^{⋆}}\} .

(4)

The previous condition holds for every vector of covariates x, which implies that the dependency parameter must be located within the narrowest bounds. In practice, we will assume that the vectors observed in the sample dataset lead to the entire domain of values of linear predictors

x^{'} β^{j}

,

j = 1, 2

. In the insurance context, where we will discuss our illustration, we assume that all possible risk profiles that can be insured by the company are already present in the portfolio.

For each vector of covariate observations x, we can also obtain the covariance between the dependent variables as:

c o v (Y_{1}, Y_{2}) = ω v_{1} (x) v_{2} (x),

(5)

where

v_{j} (x) = \int_{0}^{1} y_{j} ϕ_{j} (y_{j} | x β^{j}) f_{j} (y_{j} | x^{'} β^{j}) d y_{j}, j = 1, 2

. The correlation is obtained by dividing by the product of standard deviations.

There exist many possible specifications for the kernel functions

ϕ_{j}

,

j = 1, 2

(see [18] [for a description of kernel functions proposed in the literature). When fitting the bivariate Beta distribution without covariates, Gupta and Wong [3] propose a kernel function such as

ϕ_{j} = 2 F_{j} - 1

, where

F_{j}

is the cumulative distribution function (cdf). This specification has the advantage that the bounds for the dependency parameter are given by

- 1 \leq ω \leq 1

for any vector x. However, the previous model does not allow obtaining closed expressions for some magnitudes of interest, such as the conditioned moments. In this work, we propose to use kernels

ϕ_{j} = y_{j}^{r} - E (Y_{j}^{r})

, where r is a value to be determined by the analyst. Next, some results obtained for the particular case of the Sarmanov distribution with marginal

B e t a (α, β)

distribution with

r = 1

are analyzed. These cases intuitively correspond to a situation of linear dependency, controlled by the dependence parameter

ω

.

2.1. The Bivariate Beta GLM Model

The pdf of a random variable Y with

B e t a (α, β)

distribution, with parameters

α, β > 0

, is:

f_{Y} (y; α, β) = \frac{Γ (α + β)}{Γ (α) Γ (β)} y^{α - 1} {(1 - y)}^{β - 1} = \frac{1}{B (α, β)} y^{α - 1} {(1 - y)}^{β - 1}

and its cdf is:

F_{Y} (y; α, β) = \frac{B (y, α, β)}{B (α, β)},

where

Γ (\cdot)

and

B (\cdot, \cdot)

are the Gamma and Beta functions, respectively, and

B (y, \cdot, \cdot)

is the incomplete Beta function.

The Beta regression was proposed by Ferrari and Cribari-Neto [1], with the reparametrization

μ = \frac{α}{α + β}

and

ψ = α + β

, so that:

f (y; μ, ψ) = \frac{1}{B (μ ψ, (1 - μ) ψ)} y^{μ ψ - 1} {(1 - y)}^{(1 - μ) ψ - 1},

where

E (Y) = μ

, with

0 < μ < 1

, and

V (Y) = \frac{μ (1 - μ)}{(1 + ψ)}

, with

ψ > 0

, where

ψ^{- 1}

is the scale parameter. We note that, given the values of

μ

and

ψ

, it holds that

V (Y) < 0.25

. The specification as GLM is defined as (note that we use

μ (x)

to emphasize that

μ

depends on the linear predictor):

g [μ (x)] = x^{'} β,

where

g [\cdot]

is a link function that can be defined in different ways, in this work, we use the logit link,

g [μ (x)] = log [\frac{μ (x)}{1 - μ (x)}]

.

To simplify the notation from now on, we eliminate the linear predictors in the conditioned part. The pdf associated with the bivariate random vector

(Y_{1}, Y_{2})

with a Sarmanov distribution and Beta GLM marginals that will be called the Sarmanov-Beta-GLM is ():

\begin{matrix} f_{Y_{1}, Y_{2}} (y_{1}, y_{2}) & = & \frac{1}{B (μ_{1} (x) ψ_{1}, (1 - μ_{1} (x)) ψ_{1})} y_{1}^{μ_{1} (x) ψ_{1} - 1} {(1 - y_{1})}^{(1 - μ_{1} (x)) ψ_{1} - 1} \\ \times & \frac{1}{B (μ_{2} (x) ψ_{2}, (1 - μ_{2} (x)) ψ_{2})} y_{2}^{μ_{2} (x) ψ_{2} - 1} {(1 - y_{2})}^{(1 - μ_{2} (x)) ψ_{2} - 1} \\ \times & [1 + ω (y_{1} - μ_{1} (x)) (y_{2} - μ_{2} (x))], y_{1}, y_{2} \in (0, 1) . \end{matrix}

(6)

For the previous expression to be a pdf, the dependency parameter must be located within the bounds defined in (4), which, for the kernel functions that we propose, are:

\begin{matrix} max \{- \frac{1}{max_{\forall x^{'} β^{1}} (- μ_{1} (x)) max_{\forall x^{'} β^{2}} (- μ_{2} (x))}, - \frac{1}{min_{\forall x^{'} β^{1}} (1 - μ_{1} (x)) min_{\forall x^{'} β^{2}} (1 - μ_{2} (x))}\} \\ \leq ω \leq \\ min \{- \frac{1}{min_{\forall x^{'} β^{1}} (1 - μ_{1} (x)) max_{\forall x^{'} β^{2}} (- μ_{2} (x))}, - \frac{1}{max_{\forall x^{'} β^{1}} (- μ_{1} (x)) min_{\forall x^{'} β^{2}} (1 - μ_{2} (x))}\} . \end{matrix}

(7)

The bivariate cdf associated with a Sarmanov-Beta-GLM is obtained directly from the double integral of the bivariate pdf defined in (6):

\begin{matrix} F_{Y_{1}, Y_{2}} (y_{1}, y_{2}) = \frac{B (y_{1}, ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})}{B (ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})} \times \frac{B (y_{2}, ψ_{2} μ_{2} (x), (1 - μ_{2} (x)) ψ_{2})}{B (ψ_{2} μ_{2} (x), (1 - μ_{2} (x)) ψ_{2})} \\ \times & [1 + ω (\frac{B (y_{1}, ψ_{1} μ_{1} (x) + 1, (1 - μ_{1} (x)) ψ_{1})}{B (ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})} - μ_{1} (x) \frac{B (y_{1}, ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})}{B (ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})}) \\ \times & (\frac{B (y_{2}, ψ_{2} μ_{2} (x) + 1, (1 - μ_{2} (x)) ψ_{2})}{B (ψ_{2} μ_{2} (x), (1 - μ_{2} (x)) ψ_{2})} - μ_{2} (x) \frac{B (y_{2}, ψ_{2} μ_{2} (x), (1 - μ_{2} (x)) ψ_{2})}{B (ψ_{2} μ_{2} (x), (1 - μ_{2} (x)) ψ_{2})})], \end{matrix}

(8)

where

y_{1}, y_{2} \in (0, 1)

.

Proposition 1.

The conditioned pdf is:

\begin{matrix} f_{Y_{1} | Y_{2}} (y_{1} | Y_{2} = y_{2}) & = & \frac{1}{B (μ_{1} (x) ψ_{1}, (1 - μ_{1} (x)) ψ_{1}))} y_{1}^{μ_{1} (x) ψ_{1} - 1} {(1 - y_{1})}^{(1 - μ_{1} (x)) ψ_{1} - 1} \\ \times & [1 + ω (y_{1} - μ_{1} (x)) (y_{2} - μ_{2} (x))], y_{1}, y_{2} \in (0, 1) \end{matrix}

(9)

and similarly for

f_{Y_{2} | Y_{1}} (y_{2} | Y_{1} = y_{1})

. Integrating the previous expression, the conditional cdf is obtained as

\begin{matrix} F_{Y_{1} | Y_{2}} (y_{1} | Y_{2} = y_{2}) = F_{1} (y_{1}) & \times & [1 + ω (y_{2} - μ_{2} (x)) (1 - μ_{1} (x))] \\ - & ω (y_{2} - μ_{2} (x)) \frac{y_{1} (1 - y_{1})}{ψ_{1} μ_{1} (x)} f_{1} (y_{1}), y_{1}, y_{2} \in (0, 1) . \end{matrix}

(10)

Proof.

The conditioned pdf is obtained directly as

f_{Y_{1} | Y_{2}} (y_{1} | Y_{2} = y_{2}) = \frac{f_{Y_{1}, Y_{2}} (y_{1}, y_{2})}{f_{Y_{2}} (y_{2})} .

Integrating the result of

f_{Y_{1} | Y_{2}} (y_{1} | Y_{2} = y_{2})

in (9), we obtain:

\begin{matrix} F_{Y_{1} | Y_{2}} (y_{1} | Y_{2} = y_{2}) = \int_{0}^{y_{1}} f_{1} (t) d t + ω (y_{2} - μ_{2} (x)) \int_{0}^{y_{1}} f_{1} (t) (t - μ_{1} (x)) d t \\ = & F_{1} (y_{1}) + ω (y_{2} - μ_{2} (x)) [\frac{B (y_{1}, ψ_{1} μ_{1} (x) + 1, (1 - μ_{1} (x)) ψ_{1})}{B (ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})} - μ_{1} (x) F_{1} (y_{1})] . \end{matrix}

(11)

In addition, since

\begin{matrix} \frac{B (y_{1}, ψ_{1} μ_{1} (x) + 1, (1 - μ_{1} (x)) ψ_{1})}{B (ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})} \\ = & \frac{B (y_{1}, ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})}{B (ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})} - \frac{y_{1}^{μ_{1} (x) ψ_{1}} {(1 - y_{1})}^{(1 - μ_{1} (x)) ψ_{1}}}{ψ_{1} μ_{1} (x) B (ψ_{1} μ_{1} (x), (1 - μ_{1} (x)) ψ_{1})} \\ = & F_{1} (y_{1}) - \frac{y_{1} (1 - y_{1})}{ψ_{1} μ_{1} (x)} f_{1} (y_{1}), \end{matrix}

then, by substituting the previous expression in (11), then (10) follows directly. □

The conditioned quantile is obtained from the inverse of expression (10), for which a numerical method (such as Newton’s method) can be used.

Proposition 2.

The conditional expectation is:

E (Y_{1} | Y_{2} = y_{2}) = μ_{1} (x) + ω (y_{2} - μ_{2} (x)) V (Y_{1} | x),

(12)

where

V (Y_{1} | x) = \frac{μ_{1} (x) (1 - μ_{1} (x))}{(ψ_{1} + 1)}

is the variance, which also depends on the vector of covariates. Similarly,

E (Y_{2} | Y_{1} = y_{1})

can be found.

Proof.

The conditional expectation is obtained directly by solving the integral:

\begin{matrix} E (Y_{1} | Y_{2} = y_{2}) & = & \int_{0}^{1} y_{1} f_{Y_{1} | Y_{2}} (y_{1} | Y_{2} = y_{2}) d y_{1} \\ = & \int_{0}^{1} y_{1} f_{Y_{1}} (y_{1}) d y_{1} \times (1 + ω (y_{1} - μ_{1} (x)) (y_{2} - μ_{2} (x))) \\ = & \int_{0}^{1} y_{1} f_{Y_{1}} (y_{1}) d y_{1} \\ + ω (y_{2} - μ_{2} (x)) (\int_{0}^{1} y_{1}^{2} f_{Y_{1}} (y_{1}) d y_{1} - μ_{1} (x) \int_{0}^{1} y_{1} f_{Y_{1}} (y_{1}) d y_{1}) \\ = & μ_{1} (x) + ω (y_{2} - μ_{2} (x)) (E (Y_{1}^{2} | x) - μ_{1} {(x)}^{2}) \\ = & μ_{1} (x) + ω (y_{2} - μ_{2} (x)) V (Y_{1} | x) . \end{matrix}

Likewise, the corresponding result is obtained for

E (Y_{2} | Y_{1} = y_{1})

. □

Proposition 3.

From (5), the conditional covariance which depends on the vector of covariates x is:

c o v (Y_{1}, Y_{2}) = ω V (Y_{1}) V (Y_{2}) = ω \frac{μ_{1} (x) (1 - μ_{1} (x))}{(ψ_{1} + 1)} \frac{μ_{2} (x) (1 - μ_{2} (x))}{(ψ_{2} + 1)}

(13)

and the correlation is:

c o r r (Y_{1}, Y_{2}) = ω \sqrt{\frac{μ_{1} (x) (1 - μ_{1} (x))}{(ψ_{1} + 1)}} \sqrt{\frac{μ_{2} (x) (1 - μ_{2} (x))}{(ψ_{2} + 1)}} .

(14)

Proof.

Note that the covariance and the correlation are calculated directly if, in expression (5), we see that:

\begin{matrix} v_{j} (x) & = & \int_{0}^{1} y_{j} ϕ_{j} (y_{j} | x β^{j}) f_{j} (y_{j} | x^{'} β^{j}) d y_{j} \\ = & \int_{0}^{1} y_{j} (y_{j} - μ_{j} (x)) f_{j} (y_{j} | x^{'} β^{j}) d y_{j} = E (Y_{j}^{2} | x) - μ_{j} {(x)}^{2}, j = 1, 2 \end{matrix}

□

The dependence parameter of the model proposed in Gupta and Wong [3], which uses kernel functions

ϕ_{j} = 2 F_{j} - 1, j = 1, 2

, is located in the interval

- 1 \leq ω \leq 1

and is the same for all x. Our proposal bounds the dependence parameter to the narrowest interval among those obtained from all x. However, the advantage of our proposal is that our model allows for obtaining closed expressions for some magnitudes of interest such as bivariate moments (covariance) and conditional moments. In the numerical analysis section, we also compare the correlations estimated from our model and that of Gupta and Wong [3].

2.2. Estimation

In practice, we start from a bivariate sample of n observations. Let us denote the sample information as

(Y_{i 1}, Y_{i 2})

,

i = 1, \dots, n

, where for each i we know the values of the covariates

X_{i} = {(X_{i 1}, \dots, X_{i k})}^{'}

. Our objective is to estimate the parameter vectors

β^{j}

, the scale parameters,

ψ_{j}

and

j = 1, 2

, and the dependency parameter

ω

, from the maximization of the logarithm of the likelihood function associated with the Sarmanov distribution:

\begin{matrix} l (β^{1}, β^{2}, ψ_{1}, ψ_{2}, ω) & = & \sum_{i = 1}^{n} log f_{1} (Y_{i 1} | X_{i}^{'} β^{1}) + \sum_{i = 1}^{n} log f_{2} (Y_{i 2} | X_{i}^{'} β^{2}) \\ + & \sum_{i = 1}^{n} log (1 + ω ϕ_{1} (Y_{i 1} | X_{i}^{'} β^{1}) ϕ_{2} (Y_{i 2} | X_{i}^{'} β^{2})) \\ = & l_{1} (β^{1}, ψ_{1}) + l_{2} (β^{2}, ψ_{2}) + l_{12} (ω, β^{1}, β^{2}, ψ_{1}, ψ_{2}), \end{matrix}

(15)

The maximization of (15) cannot be carried out directly without considering that the parametric space is restricted and, in addition, as it was shown in expression (4), the bounds of the dependence parameter are closely related to the parameters of the marginals. Thus, in the maximization process, infeasible solutions will often be reached unless a careful numerical procedure is specifically designed. One way to address these difficulties is to rely on the IFM (Inference from Margin) method that has been widely used in the estimation of copulas see [19] [for a review]. For the estimation of the Sarmanov distribution, the IFM was already used by Bolancé and Vernic [10] for the case of GLM marginals with Negative Binomial distributions.

The IFM method is implemented as follows:

Inicialization. The parameters for the marginals are estimated as:

\begin{matrix} ({\hat{β}}^{1 (0)}, {\hat{ψ}}_{1}^{(0)}) & = & max_{β^{1}, ψ_{1}} l_{1} (β^{1}, ψ_{1}) \end{matrix}

(16)

\begin{matrix} ({\hat{β}}^{2 (0)}, {\hat{ψ}}_{2}^{(0)}) & = & max_{β^{2}, ψ_{2}} l_{2} (β^{2}, ψ_{2}) . \end{matrix}

(17)

For the initial estimation, function betareg() of betareg R package is used. With these parameters of the marginals, we start the iterative process in the two steps described below.

Step 1. Given the estimated marginal parameters in iteration

m - 1

and taking into account the limits of the dependence parameter

ω

defined in (4), with function optim() and the L-BFGS-B method using R, we estimate

ω

from the maximization of the likelihood function given fixed values of the marginal parameters, which is:

{\hat{ω}}^{(m)} = max_{ω} l_{ω | 12} (ω | {\hat{β}}^{1 (m - 1)}, {\hat{β}}^{2 (m - 1)}, {\hat{ψ}}_{1}^{(m - 1)}, {\hat{ψ}}_{2}^{(m - 1)}),

(18)

where

l_{ω | 12}

is the likelihood as a function of

ω

given the estimated parameters for the marginals in iteration

m - 1

.

Step 2. Given the estimated dependency parameter

{\hat{ω}}^{(m)}

in step 1, the marginal parameters are re-estimated in iteration m as:

({\hat{β}}^{1 (m)}, {\hat{ψ}}_{1}^{(m)}, {\hat{β}}^{2 (m)}, {\hat{ψ}}_{2}^{(m)}) = max_{β^{1}, ψ_{1}, β^{2}, ψ_{2}} l_{12 | ω} (β^{1}, ψ_{1}, β^{2}, ψ_{2} | {\hat{ω}}^{(m)}),

(19)

where

l_{12 | ω}

is the likelihood as a function of the marginal parameters given the dependence parameter estimated in step 1. The above maximization is also performed with function optim() and the L-BFGS-B method of R.

Steps 1 and 2 described above are repeated until reaching the convergence criterion based on the differences between parameter estimates obtained in two consecutive iterations.

Remark 1.

In the initialization process, if the dependent variables contain zeros or ones, the following correction

{\tilde{Y}}_{j} = (Y_{j} * (n - 1) + 0.5) / n, j = 1, 2

was proposed by Smithson and Verkuilen [20].

In practice, the algorithm described above is based on the optimization of conditional likelihood functions and not on the likelihood function defined in (15). However, in the last stage, the parameters estimated with the IFM method can be used as initial parameters in the process of maximizing the full likelihood function defined in (15). For this purpose, function optim() and method L-BFGS-B of R are used again.

Remark 2.

To estimate the Sarmanov model proposed by Gupta and Wong [3], it is not necessary to use the two-step process, since the bounds of the dependence parameter do not depend on the parameters of the marginal distributions.

3. Numerical Analysis

We analyse a database corresponding to a car insurance portfolio, in which part of the variables have been measured via a telematic system. The objective of our analysis is to model the joint behaviour of the percentage of kilometres driven above the posted speed limits (

Y_{1}

) and percentage of kilometres driven at night (

Y_{2}

). It is well known that both variables are related to the risk of having an accident. In Table 1, we show the main descriptive statistics of the dependent variables and the covariates used in the modelling process. For the estimation of the Sarmanov-Beta-GLM, the dependent variables have been transformed as indicated in Remark 1 in Section 2.2. Furthermore, to avoid very low coefficient values due to the scale of some covariates, variables age (

X_{1}

), age of driving license (

X_{2}

) and age of the vehicle (

X_{5}

) have been divided by 10; the vehicle power variable (

X_{6}

) is divided by 100 and the total annual distance driven in kilometres (

X_{7}

) is divided by 1000. In addition, note that, in this study, we have included a variable denoting the driver’s gender (

X_{3}

) and an indicator of private garage (

X_{4}

) as covariates.

The last row of Table 1 shows the Pearson correlation between the two dependent variables. This correlation is compared with the corresponding parameter estimate obtained from the Sarmanov model with marginal Beta proposed here and with the one proposed by Gupta and Wong [3], from now on the GW model. With this objective, Table 2 shows the dependence parameters estimated with both models, and the AIC and BIC statistics without including the covariates and including them. Using expression (14) and without covariates, from the dependence parameters

\hat{ω} = 14.883

, it can be deduced that the estimated correlation is

0.0601

, which is within the confidence interval of the Pearson correlation as shown in the last row of Table 1. On the contrary, if we use the five parameter Beta distribution, the (residual) correlation that is obtained from the numerical calculation of expression (5) is practically zero. This means that the association is captured by the bivariate model. Comparing both models, with and without covariates, using the AIC and BIC statistics, the results of Table 2 show that the fit is better for the model proposed here than it is for the GW model.

Table 3 shows the results of our Sarmanov-Beta-GLM using different vectors of covariates. Model I includes all the explanatory variables, among which we have the age (

X_{1}

), the age of the driving license (

X_{2}

) and the total distance driven annually (

X_{7}

), these three variables are associated with driving experience. To analyze the robustness of the results, in Model II, age (

X_{1}

) is eliminated, and, in addition, in Model III, the age of a driver’s license (

X_{2}

) is also eliminated. The results of Model I show that the effect of age is negative on both

Y_{1}

and

Y_{2}

that the effect of the driver’s license age is positive on

Y_{1}

and negative on

Y_{2}

and the effect of total distance,

X_{7}

, is positive on both dependent variables. By eliminating age (

X_{1}

) in Model II, the signs of the parameters associated with

X_{2}

and

X_{7}

are maintained, although the value is smaller in the case of

X_{2}

and remains practically the same for

X_{7}

. After eliminating variables

X_{1}

and

X_{2}

, we see that the effect of the total annual distance driven remains practically the same. If we observe the effects of the rest of covariates, these are practically the same in models I, II, and III. A man driver (

X_{3}

) with a powerful vehicle (

X_{6}

) would have larger

Y_{1}

and

Y_{2}

than the rest, all other characteristics being the same. However, using parking at night (

X_{4}

) has a positive effect on the percentage of speeding distance (

Y_{1}

) and a negative effect on the percentage of night-time driving (

Y_{2}

); the opposite happens with the age of the vehicle (

X_{5}

). The effect of

X_{5}

indicates that, when the vehicle is older, drivers tends to diminish the percent of speed driving, while night-time driving is larger.

To visualize the dependence between

Y_{1}

and

Y_{2}

in different quantiles, the following three examples of insured drivers are graphically analysed:

Profile 1 corresponds to a 27-year-old man, who drives about 7000 kilometres per year, with a 7-year-old driving license, with parking, with a vehicle of about 8 years and 100 HP.
Profile 2 corresponds to a 20-year-old man, who drives about 4000 kilometres per year, with a 2-year-old driving license, with parking, with a vehicle of about 2 years and 75 HP.
Profile 3 corresponds to a 36-year-old man, who drives about 10,000 kilometres per year, with a 15-year-old driving license, without parking, with a vehicle of about 15 years and 200 HP.

Profile 1 represents the average insured individual of the portfolio; Profile 2 is a younger man driver, less experienced than Profile 1 and with a newer and less powerful vehicle; finally, Profile 3 is an older man driver, more experienced than Profile 1 and an older and more powerful vehicle. Figure 1 represents different quantiles of the variable kilometres driven above the speed limit (

Y_{1}

) in the y-axis given the values of the percentage of kilometres driven at night (

Y_{2}

) for Profile 1 in the x-axis. Quantiles have been obtained from the expression (10). Note that, if the dependence parameter was zero, all the curves would remain constant. The adjusted dependence structure results in the represented conditional quantiles having a negative nonlinear relationship and, furthermore, the curves for the different quantile levels are non-parallel. Figure 1 indicates that, for Profile 1, the higher the percentage of kilometres driven at night (

Y_{2}

), the greater the caution in driving and, therefore, the lower the percentage of distance driven above the speed limits (

Y_{1}

). The same quantiles at

75 %

(plot on the left) and

95 %

(plot on the right) confidence levels are represented in Figure 2. These plots show that the curves are non-parallel and that Profile 3 is the most risky, followed by Profiles 1 and 2.

4. Conclusions

We have developed a bivariate model based on the Sarmanov distribution with marginal Beta GLM which has allowed us to model two important variables in modern motor insurance telematics databases. Our model is an alternative to a proposal previously made by Gupta and Wong [3] based on what is known as Morgenstern’s distribution, which is a particular case of the Sarmanov distribution. Our proposal allows for obtaining closed expressions for some magnitudes of interest, such as the bivariate cdf and conditioned moments, covariance and correlation, which are fundamental in risk analysis. We have shown that our Sarmanov-Beta-GLM model presents better fits than previous proposals also based on the Sarmanov distribution.

The results of our case study have shown that, for a specific example, although the dependence parameter is positive, which directly implies that, in the mean, the relationship between the conditioned mean and the values of the variable that conditions is positive, the conditional quantiles show that the relationship between the conditioned quantile, and the value of the conditioning variable may be negative for high quantile levels, a result that is consistent with the expected behaviour of drivers.

Author Contributions

Conceptualization, C.B. and M.G.; methodology, C.B.; software, A.P.; validation, C.B., M.G., and A.P.; formal analysis, A.P.; investigation, A.P.; resources, M.G.; data curation, M.G.; writing—original draft preparation, C.B.; writing—review and editing, M.G. and A.P.; visualization, C.B.; supervision, C.B.; project administration, M.G.; funding acquisition, M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Spanish Ministry of Science and Innovation grant PID2019–105986GB-C21, Fundación BBVA Research on Big Data and ICREA Academia.

Acknowledgments

We thank seminar participants and members of the Riskcenter, Universitat de Barcelona.

Conflicts of Interest

The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Ferrari, S.; Cribari-Neto, F. Beta regression for modelling rates and proportions. J. Appl. Stat. 2004, 31, 799–815. [Google Scholar] [CrossRef]
Arnold, B.; Ng, H. Flexible bivariate Beta distributions. J. Multivar. Anal. 2011, 102, 1194–1202. [Google Scholar] [CrossRef] [Green Version]
Gupta, A.; Wong, C. On three and five parameter bivariate beta distributions. Metrika 1985, 32, 85–91. [Google Scholar] [CrossRef]
Olkin, I.; Liu, R. A bivariate beta distribution. Stat. Probab. Lett. 2003, 62, 407–412. [Google Scholar] [CrossRef]
Olkin, I.; Trikalinos, T. Constructions for a bivariate beta distribution. Stat. Probab. Lett. 2015, 96, 54–60. [Google Scholar] [CrossRef] [Green Version]
Sarmanov, O. Generalized normal correlation and two-dimensional frechet classes. Doclady Soviet Math. 1966, 168, 596–599. [Google Scholar]
Lee, M. Properties and applications of the sarmanov family of bivariate distributions. Commun. Stat. Theory Methods 1996, 25, 1207–1222. [Google Scholar]
Bairamov, I.; Altinsoy, B.; Kerns, G. On generalized Sarmanov bivariate distributions. TWMS J. Appl. Eng. Math. 2011, 1, 86–97. [Google Scholar]
Morgenstern, D. Einfache beispiele zweidimen-sionaler verteilungen. Mitteilingsblatt Math. Stat. 1956, 8, 234–235. [Google Scholar]
Bolancé, C.; Vernic, R. Multivariate count data generalized linear models: Three approaches based on the sarmanovdistribution. Insur. Math. Econ. 2019, 85, 89–103. [Google Scholar] [CrossRef] [Green Version]
Guillen, M.; Nielsen, J.; Ayuso, M.; Pérez-Marín, A. The use of telematics devices to improve automobile insurance rates. Risk Anal. 2019, 39, 662–672. [Google Scholar] [CrossRef] [PubMed]
Ayuso, M.; Guillen, M.; Nielsen, J. Improving automobile insurance ratemaking using telematics: Incorporating mileage and driver behaviour data. Transportation 2019, 46, 735–752. [Google Scholar] [CrossRef] [Green Version]
Pérez-Marín, A.; Guillen, M. Semi-autonomous vehicles: Usage-based data evidences of what could be expected from eliminating speed limit violations. Accid. Anal. Prev. 2019, 123, 99–106. [Google Scholar] [CrossRef] [PubMed]
Pérez-Marín, A.; Ayuso, M.; Guillen, M. Do young insured drivers slow down after suffering an accident? Transp. Res. Part F Psychol. Behav. 2019, 62, 690–699. [Google Scholar] [CrossRef]
Pérez-Marin, A.; Guillen, M.; Alcañiz, M.; Bermúdez, L. Quantile regression with telematics information to assess the risk of driving above the posted speed limit. Risks 2019, 7, 80. [Google Scholar] [CrossRef] [Green Version]
Pesantez-Narvaez, J.; Guillen, M.; Alcañiz, M. Predicting motor insurance claims using telematics data-xgboost versus logistic regression. Risks 2019, 7, 70. [Google Scholar] [CrossRef] [Green Version]
Sun, S.; Bi, J.; Guillen, M.; Pérez-Marín, A. Assessing driving risk using internet of vehicles data: An analysis based on generalized linear models. Sensors 2020, 20, 2712. [Google Scholar] [CrossRef] [PubMed]
Bahraoui, Z.; Bolancé, C.; Pelican, E.; Vernic, R. On the bivariate Sarmanov distribution and copula. An application on insurance data using truncated marginal distributions. Stat. Oper. Res. Trans. SORT 2015, 39, 209–230. [Google Scholar]
Joe, H.; Xu, J. The estimation method of inference functions for margins for multivariate models. Open Collect. 1996. [Google Scholar] [CrossRef]
Smithson, M.; Verkuilen, J. A better lemon squeezer? maximum-likelihood regression with beta-distributed dependent variables. Psychol. Methods 2006, 11, 54–71. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Quantiles of percentage of kilometres driven over the speed limit (

Y_{1}

) in the y-axis for Profile 1 given the values of percentage of kilometres driven at night (

Y_{2}

) in the x-axis.

Figure 1. Quantiles of percentage of kilometres driven over the speed limit (

Y_{1}

) in the y-axis for Profile 1 given the values of percentage of kilometres driven at night (

Y_{2}

) in the x-axis.

Figure 2. Quantiles of percentage of kilometres driven over the speed limit (

Y_{1}

) for each driver profile given the values of percentage of kilometres driven at night (

Y_{2}

), (left)

75 %

level and (right)

95 %

level.

Figure 2. Quantiles of percentage of kilometres driven over the speed limit (

Y_{1}

) for each driver profile given the values of percentage of kilometres driven at night (

Y_{2}

), (left)

75 %

level and (right)

95 %

level.

Table 1. Definition of variables and descriptive statistics: mean, standard deviation (STD), minimum (Min) and Maximum (Max). The last row shows the linear correlation between dependent variables and a confidence interval at the

95 %

level.

Table 1. Definition of variables and descriptive statistics: mean, standard deviation (STD), minimum (Min) and Maximum (Max). The last row shows the linear correlation between dependent variables and a confidence interval at the

95 %

level.

Variable	Description	Mean	STD	Min	Max
$Y_{1}$	Percentage of kilometres driven above the speed limit	0.063	0.068	0.000	0.704
$Y_{2}$	Percentage of kilometres driven at night	0.069	0.064	0.000	1.000
$X_{1}$	Age of the driver	27.565	3.094	19.849	36.904
$X_{2}$	Age if driver License	7.174	3.053	1.810	15.910
$X_{3}$	Gender (=1 Men, =0 Women)	0.489	0.500	0.000	1.000
$X_{4}$	Night parking (=1 yes, 0=no)	0.774	0.418	0.000	1.000
$X_{5}$	Age of the vehicle	8.749	4.174	1.938	20.468
$X_{6}$	Power of the vehicle in Horse Power (HP)	97.226	27.772	12.000	500.000
$X_{7}$	Total Km	7159.510	4191.753	1.590	50,035.560
$ρ$	Pearson correlation between dependent variables (CI)	0.070 (0.057,0.082)

Table 2. Estimated dependence from Sarmanov-Beta models and goodness of fit criteria.

		$\hat{ω}$ (p-Value)	AIC	BIC
Proposed Model	No covariates	14.883 (<0.001)	−171,282.2	−171,241.5
	With all covariates	2.388 (0.055)	−177,508.8	−177,354.4
GW Model	No Covariates	0.002 (0.346)	−171,165.4	−171,124.8
	With all covariates	0.002 (0.356)	−177,497.2	−177,342.8

Table 3. Parameter estimates (p-values) for the Sarmanov-Beta models and goodness of fit statistics.

	Model I		Model II		Model III
	Y1	Y2	Y1	Y2	Y1	Y2
Cons.	−3.055 (<0.001)	−2.556 (<0.001)	−3.819 (<0.001)	−2.975 (<0.001)	−3.796 (<0.001)	−3.061 (<0.001)
$X_{1}$	−0.339 (<0.001)	−0.185 (<0.001)	-		-
$X_{2}$	0.294 (<0.001)	−0.052 (0.018)	0.048 (0.002)	−0.187 (<0.001)	-
$X_{3}$	0.097 (<0.001)	0.274 (<0.001)	0.107 (<0.001)	0.281 (<0.001)	0.109 (<0.001)	0.274 (<0.001)
$X_{4}$	0.108 (<0.001)	−0.031 (0.007)	0.107 (<0.001)	−0.031 (0.007)	0.107 (<0.001)	−0.031 (0.007)
$X_{5}$	−0.043 (0.001)	0.055 (<0.001)	−0.043 (0.001)	0.055 (<0.001)	−0.043 (0.001)	0.055 (<0.001)
$X_{6}$	0.653 (<0.001)	0.077 (<0.001)	0.654 (<0.001)	0.079 (<0.001)	0.664 (<0.001)	0.038 (0.027)
$X_{7}$	0.045 (<0.001)	0.035 (<0.001)	0.046 (<0.001)	0.035 (<0.001)	0.046 (<0.001)	0.035 (<0.001)
$ϕ_{1}$	18.480 (<0.001)		18.300 (<0.001)		18.294 (<0.001)
$ϕ_{2}$	14.823 (<0.001)		14.782 (<0.001)		14.703 (<0.001)
$ω$	2.388 (0.055)		2.325 (0.059)		2.214 (0.060)
AIC	−177,508.8		−177,238.5		−177,113.5
BIC	−177,354.4		−177,100.3		−176,991.6

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bolancé, C.; Guillen, M.; Pitarque, A. A Sarmanov Distribution with Beta Marginals: An Application to Motor Insurance Pricing. Mathematics 2020, 8, 2020. https://doi.org/10.3390/math8112020

AMA Style

Bolancé C, Guillen M, Pitarque A. A Sarmanov Distribution with Beta Marginals: An Application to Motor Insurance Pricing. Mathematics. 2020; 8(11):2020. https://doi.org/10.3390/math8112020

Chicago/Turabian Style

Bolancé, Catalina, Montserrat Guillen, and Albert Pitarque. 2020. "A Sarmanov Distribution with Beta Marginals: An Application to Motor Insurance Pricing" Mathematics 8, no. 11: 2020. https://doi.org/10.3390/math8112020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Sarmanov Distribution with Beta Marginals: An Application to Motor Insurance Pricing

Abstract

1. Introduction

2. The Models

2.1. The Bivariate Beta GLM Model

2.2. Estimation

3. Numerical Analysis

4. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI