We achieve improved TC seasonal forecasting by a new variable selection and model averaging approach. Variable selection is performed on Poisson regression model space by our stochastic search algorithm MGRS. Model averaging integrates and accordingly improves the forecasting performance of a small number of optimal Poisson models determined from variable selection.
3.2. Model Selection and Averaging
We proceed to present statistical and computational details underpinning model selection and averaging. Let
be a binary vector. Define a subset of
, denoted
, as such that
if and only if
. Namely, the
jth element of
indicates whether the
jth covariate
is included in the underlying candidate Poisson regression model. The candidate model specified by
or equivalently by
, denoted by
, is written as
where
is a sub-vector of
indexed by
which can also be estimated by the maximal likelihood method. Let
be the space of all candidate models:
Now we can use to represent the corresponding candidate model . By model selection we mean to find the best model or or in the candidate model space.
To evaluate the utility of a candidate model (
2), the Akaike information criterion (AIC) ([
37]) is a popular and widely used choice (e.g., [
38]). It is efficient and intends to select a model having the smallest predictive mean square errors. That is, the model with the minimal AIC value will have the best prediction performance. However, AIC has a tendency of over-fitting, especially in cases with small samples. Recall that there are only 45 samples in the seasonal tropical cyclone data set in
Section 2. This suggests an improvement to be made on AIC. One such improvement, called corrected AIC (AICc), is developed in [
22] and is defined for each model
as
where
is the log likelihood for model
evaluated at the maximal likelihood estimator
,
n is the sample size, and
is the number of parameters included in the model
.
Treating AICc as a function on the candidate model space
, finding the best model is equivalent to finding the minimal value of
. When
p is small, it is feasible to apply an all subset selection procedure to calculate
for every model
. However, the number of candidate models
grows with
p exponentially, and it becomes infeasible to perform all subset selection even when
p is moderately large. A step-wise selection method has been used to tackle the difficulty which, however, cannot guarantee the selection of the globally best model but a locally optimal model instead, see [
39].
To address the aforementioned difficulties, in this paper we turn to consider computationally feasible stochastic search approaches which is capable of finding the globally best model with probability 1 in the limit. Specifically, we propose a stochastic search algorithm of Metropolis–Gibbs sampling with random scan, denoted by
MGRS, for model selection from
. This new algorithm is motivated by the Gibbs sampler variable selection method developed in [
23] but improves the computing performance over the latter. We proceed to describe the new algorithm in the following.
Define a probability distribution
on the candidate model space
by
where
is a normalization constant, and
is a tuning parameter. Clearly it is difficult to directly calculate the normalization constant
Z when
is enormous. On the other hand, the conditional distribution for each
is Bernoulli, which does not involve
Z and has the probability of “success”:
where
. Generating samples of
from the joint distribution
can be achieved by Gibbs sampling through generating all conditional distributions
in sequence at random and progressively,
. By [
24], each conditional distribution
can be generated, rather than directly, but by Metropolis acceptance sampling, resulting in the aforementioned
MGRS Algorithm 1.
Four remarks for the Metropolis–Gibbs random scan algorithm MGRS are made here.
First, from the definition of the probability distribution , it is obvious that the best model with the minimal AICc value will have the largest probability.
Second, for the tuning parameter k, a large k will put more probabilities on the better models having relatively small AICc values and expedites the equilibrium of the generated Markov chain, while it tends to hold on to a local best model. The converse is true for small k. In practice, a pilot run of the algorithm is often needed to determine a plausible range for k. In the current seasonal TC forecasting study, we find it sufficient to set k between and 10.
Third, the permutation is used to harness random scan. That is, the order by which the components of are randomly and sequentially updated in each round is determined by the current permutation. This way the generated Markov chain will be recurrent and reversible, so that the globally best model can be selected with probability 1 in the limit.
Last, the Metropolis sampling used in updating each component
in Gibbs sampler is in effect generating
with a probability larger than the conditional probability of
specified by the corresponding conditional distribution in the Gibbs sampler. This implies the resultant generated Markov chain will converge to the equilibrium faster than the one generated by using Gibbs sampler alone.
Algorithm 1: Metropolis–Gibbs sampling with random scan (MGRS) |
Randomly choose an initial model from , e.g., . Suppose until have been generated. Next is to generate . Let be a permutation of . Set . Accept (vs. ) with the probability
Set . Accept (vs. ) with the probability
⋮ p. Set . Accept (vs. ) with the probability
Set . Then repeat step ii to generate a Markov chain
which can be regarded as samples of from when n is sufficiently large.
|
Let
be the generated Markov chain sequence. It can be shown that
S is an irreducible and reversible Markov chain with its stationary distribution being
, because
satisfies the detailed balance condition, cf. [
40].
Practically, we choose to use the I-chart method to determine whether the sample sequence
S reaches equilibrium after taking a burn-in period ([
41]). It is equivalent to checking whether the corresponding sequence of the AICc values has reached equilibrium. With this generated AICc sequence, we can plot a modified individual point control chart or I-chart to do the checking. Knowing that AICc is actually a function of the random variable
which follows the probability distribution
defined on
, it can be easily proved that
for any
. Using this result, the lower control limit of the I-chart can be set to be
, the minimum of the AICc series. The upper control limit of the I-chart can be chosen to be
, where
,
, and
are, respectively, the sample variance, the sample mean, and the sample minimum of the second half of the AICc series.
The generated AICc sequence is always above the lower control limit. On the other hand, when this sequence reaches equilibrium and , , and are consistent estimates, there would be less than of the AICc sequence over the upper control limit. Therefore, we can claim that there is no significant statistical evidence that the AICc sequence, and accordingly the generated model sequence S, have not reached equilibrium. Consequently, we will treat the model sequence S as though they were generated from the probability distribution , and then perform model selection and model averaging based on S and the associated AICc sequence. The I-chart proposed here will be said to be created by a level rule. Usually we use a level 90% rule to create the I-chart, corresponding to setting b equal .
Suppose the n generated models have formed a Markov chain in equilibrium, which is also true for the sequence . It is easy to find a model , named the estimated optimal model, such that . Given a small positive integer s, it is also feasible to identify the top s models from , denoted , such that .
In addition to being used for finding the best models,
and the associated AICc sequence can be used to rank the importance to the response
Y of each covariate component of
. The importance of each covariate
is determined by the frequency of the event
appearing in
. This frequency gives a consistent estimate of the marginal probability of
with respect to the probability distribution
. The marginal distribution of
with respect to
is Bernoulli with the probability of “success”:
which is the probability of covariate
being included in a model generated by the MGRS algorithm. Note that using covariate importance ranking for model selection is similar to Bayesian variable selection ([
42]).
Let
be the sub-chain of
consisting of all the
ith components of
. Although
may not be a Markov chain when
, the sampling distribution of
converges to the marginal distribution of
with respect to
, where the probability of “success”
is estimated by
Now we can use to represent the importance of covariate regarding its effect to the response Y. The most important covariates in are therefore determined by those having the largest values among ’s. Specifically, we can identify those important covariates by a pre-specified threshold such that .
In cases where the top s models from do not have a clear-cut winner, model averaging may be a better approach to finding a new model having improved predicting performance. Recall the top s models that have their respective coefficient estimates as . We define an extended smooth-AICc weight for each model in by
When
, the
’s become the Akaike weights used in [
43], which are commonly used among the model averaging methods. Here we choose the same
k value used in the
MGRS algorithm. Then the Poisson averaged regression model is defined as
When the prediction is the goal, it is recommended to use the model averaging method rather than the best model strategy ([
44]). Model (
6) is obtained by a Gibbs sampling induced model averaging approach, and is thus named the
GMA model.