1.1. Information Criteria and Generalization Error
A well known result by Stone [
2] shows that the MLE is a biased estimator of the minimum KL-divergence:
because it is evaluated on the data
, which was used to fit
. Cross-validation was developed as a model selection technique to select a model from a group that actually minimizes
and not merely
in the limit of large
n. Takeuchi [
3] and Akaike [
4] explicitly modeled this bias (generalization error) of an estimation procedure
.
Definition 6 (Generalization Error of estimation procedure
)
. Akaike’s Information Criterion (AIC) [
4] was one of the earliest attempts to correct for this bias. AIC is able to correct for generalization errors when comparing MLE estimates for a restricted class of models. This work was extended by Takeuchi’s TIC [
3] to expand the class of models, while still requiring that the MLE estimates be used for comparison.
In particular, note that it has long been known (e.g., in [
4]) that for the MLE estimate
of a model with
m parameters, the bias
b is asymptotically
almost surely. So, for instance,
almost surely. Hence, for the MLE estimate
of a model with
m parameters, we have that
almost surely. Proofs of this fact are found in both [
4], for a restricted subset of models, and [
3] for a broader class of models.
The goal of AIC, TIC, and ICE is to reduce the generalization error by reducing the order of the term, by incorporating a more negative power of n. This does not guarantee superior performance. In the case of TIC particularly, numerical instability can cause this term to have an unexpectedly large constant factor. However, if numerical instability is effectively controlled, it is expected that many problems could benefit from these techniques for moderate sample sizes, as will be shown in later sections.
1.2. TIC
In [
3], Takeuchi developed the information criterion
The second term here may be periodically referred to as the “trace term,” as it appears in ICE as well in later sections. This was an extension of AIC, which had previously been developed by Akaike in [
4]:
Here, for convenience, we use the convention that TIC (and AIC) is . In other work, it is often multiplied by n to produce a result that is .
Takeuchi then showed that AIC is a limiting case of TIC. It was shown by Stone in [
2] that AIC model selection and model selection via cross validation are equivalent whenever AIC is valid. By extension, TIC is also equivalent to cross validation under these circumstances.
If two models are to be compared using TIC or AIC, then the model with the lower value of TIC is on average the better model. Given two models,
with
and
with
, the model
is actually the better model with probability
where
is the KL divergence between the true distribution
f and a model generated distribution
g.
This follows directly from the fact that the exponential of a TIC value is a likelihood ratio, and the logic then proceeds in the usual way for likelihood ratio statistics [
5].
In this way, TIC (as with any information criterion) can be used to select the better model from a family of fit models. However, it requires that all models be fit using maximum likelihood estimation (MLE).
1.3. Additional Information Criteria
Modern machine learning models often have a very large number of parameters, in some cases having more parameters than observations in the fitting set. Recalling Equation (
9), it can be seen that using the MLE estimate
is likely to produce models that generalize poorly. For these models, information criteria have therefore fallen out of favor. Using an information criterion to choose the best model from a small set of models fit using MLE is unlikely to find an accurate model. If the number of models is very large, then Equation (
12) dictates that the information criterion differences must be very large in order to reliably find the best model, and again the result is unlikely to perform well. Additionally, each fit of a model such as this may carry considerable expense, so producing a large number of fits to filter with information criteria may be cost prohibitive as well.
Konishi and Kitagawa [
6] developed GIC, which extended TIC to no longer require MLE estimation, allowing regularization and similar generalization error reducing approaches. See [
7] for an overview of typical regularization techniques that might be paired with GIC in this way. Unfortunately, GIC is not viable as written for modern machine learning models as it still has a form similar to Equation (
10), and as discussed in [
1,
8,
9], these equations are numerically unstable for large
m (roughly
), regardless of
n.
Ichiguro et al. developed an alternate approach named Extended Information Criterion (EIC) in [
10]. The main idea is that TIC and AIC use a Taylor expansion of the generalization error, and then their correction terms are simply the leading order terms of that expansion. However, the generalization error itself (Equation (
7)) actually takes the form of an expectation over the true distribution. This expectation may be computed directly over the empirical distribution, avoiding the need for an expansion.
Additional analysis of EIC was performed by Kitagawa and Konishi in [
11]. Their analysis indicates that EIC (in its most basic implementation) adds noise to the objective function proportional to the sample size. This means it may not be appropriate for large datasets without adjustment, and adjustments to reduce this issue are then proposed and analyzed.
1.4. ICE
In the discussions of information criteria in the previous sections the models would be fit using MLE, or some other procedure, and then model selection would be performed afterwards using an information criterion. The exact fitting procedure is not specified. These approaches assume that some procedure can be found which will produce models with reasonable levels of accuracy, but that is hardly a given if the model has a very high parameter count m relative to the observation count n.
Though GIC allows the use of regularization (and various other techniques),
regularization itself is not always effective. For instance, see Figures 1 and 3 from Section 4 of [
1] for examples where
regularization is not helpful. Models as simple as estimating mean and variance of a Gaussian through MLE are always harmed by
. This gives good cause to believe that cases where
is not helpful, or not efficient, are fairly common. Approaches beyond regularization, such as early stopping or drop-out, tend to have hyper-parameters which can be difficult to estimate, just as regularization does. Additionally, there is little theoretical reason to believe that these approaches are reducing a generalization error efficiently.
An example of a highly parameterized model is a modern MNIST challenge leader [
12] that has 1,514,187 parameters, but was fit on a dataset with only 30,000 observations. A discussion of why this often occurs within the field of machine learning is beyond the scope of this paper, but it is enough to know that this is an important use case for model fitting that is not well served by existing information criteria.
In [
1], the ICE objective function is defined.
Definition 7 (ICE Objective)
.Let denote the minimizer of (
13).
This takes the same form as TIC, but it was shown that with only slightly stronger assumptions (see Theorem 1 below) the trace term from Definition 7 is still the leading order generalization error term in a neighborhood around the MLE .
The important properties of this objective function are encapsulated in Theorem 1.
Theorem 1 (ICE Behavior).
Suppose the following conditions hold: - 1.
satisfies White’s regularity conditions A1–A6 (see [13]). - 2.
is a global minimum of in the compact space defined in A2.
- 3.
There exists a such that for all other local minima .
- 4.
For the derivative exists, is continuous, and bounded on an open set around .
- 5.
For , the variance as on an open set around .
Then, for sufficiently large n there exists a compact subset containing such that:
- 1.
For the derivative exists, is continuous, and bounded on U, almost surely.
- 2.
For , as on U, almost surely.
- 3.
as almost surely.
- 4.
almost surely.
- 5.
almost surely.
Theorem 1 would guarantee that if I and J could be known exactly.
Taking
, and using the estimates
and
for their true values, this can be rewritten (approximately) as
This substitution was also used by Takeuchi in [
3].
Equation (
14) (rather than Definition 7) will be the starting point for the analysis in the remainder of the paper. It is expected that since
and
this equation would converge to Definition 7 and
; however, that will not be proven here since it is orthogonal to the analysis performed.
This paper is primarily concerned with the empirical consequences of approximating Equation (
14) through various means. The consequence of using Equation (
14) instead of Definition 7 is not directly observable or relevant to that analysis. As the analysis below makes clear, numerical instability would make any approach using the Hessian directly unviable, regardless of whether or not one could gain access to the actual true Hessian.
For background on the consequences of using an empirical approximation to the Hessian or Fisher Information in lieu of the actual unobservable value, see [
14].
Experiments in [
1] showed that the neighborhood of validity for this approach is typically large enough to contain
. Thus, if some care is taken with the optimization itself (techniques for this are also described in [
1]), then this approach is quite widely applicable.
For realistic data set size
n, reducing the generalization error
b from
to
might effectively eliminate overfitting, without requiring any hyper parameters or cross validation. For an analysis of the scale of leading order bias terms, see [
15], where it is seen in numerical simulations that first order corrections such as this can drastically reduce generalization errors.
Notice that for any models fit using ICE, it is sufficient to compare values of
for model selection, as these are also valid approximations of TIC values. Both TIC and ICE approximate the log likelihood that would be computed using cross validation (if computed at
, the MLE parameter estimate) and it can be seen that Equations (
14) and (
10) are identical.
As ICE is a superset of TIC, most of this paper will focus on ICE with the exception of sections comparing TIC to AIC.
The ICE approach, as with TIC, has a few main drawbacks.
Computation of is expensive.
Computation of is numerically unstable.
Since J must be positive definite, this is only valid in a neighborhood around the MLE.
Computing derivatives of is expensive and potentially unstable.
Several proposals are made in [
1] to overcome some of these issues. The purpose of this paper is to further analyze some of those proposals in the context of a large and more realistic problem, while also contributing additional improvements.
1.7. Gradient Computation
To utilize TIC, the approximation in Equation (
16) is sufficient; however, efficient computation of ICE also requires an approximate derivative. Since this derivative will be used only in optimizers, it need not be exact, but it is helpful if it generally has a high cosine similarity with the true derivative of Equation (
16).
The gradient of the ICE objective function may be derived from Equation (
14), and written as
which simplifies to
Notice that does not reduce to the identity, since it is a multiplication of the J matrix computed for a single observation against the inverse computed over all observations.
Direct computation of this quantity costs
time, so an approximation is needed. Begin by applying the approximation from Equation (
16) to obtain
This is much improved, but still requires the computation of , and , both of which cost in time and space. Further approximations are available to us here due to the fact that the optimizers that will use this gradient do not need its exact value. It is enough if the gradient is generally pointing “downhill” with respect to the objective function, and it is not necessary for it to be very accurate other than that. This translates to a requirement that the gradient approximation typically has a positive cosine-similarity with respect to the actual value.
If n is not too small, then the trace term is small compared to , and is small at , but D is not. Similarly, need not be especially small or large in the neighborhood of . Therefore, near , the first correction term should be larger than the second, having only one factor of instead of two.
Behavior far from
would generally be less important than behavior near this optimum, since it is expected that
would dominate these gradient terms in that case. Additionally, the numerical stabilization discussed in the next section (Equation (
23)) will also tend to reduce the importance of any term containing
D outside of the region where
D is positive definite by forcing (in the limit) that
. In this limit, both of the correction terms would become a simple scalar multiple of
, which would make them irrelevant to the optimization.
This reduces the equation to
The computation here is still
, but one more approximation can be applied. Near
, asymptotically,
under certain conditions. Therefore, making this substitution produces
where now the inner quantity is the original ICE correction for this specific observation, and that is used as a weight for the unadjusted gradient. This quantity can then be stabilized using the techniques in Equation (
23). Computationally, this is extremely efficient, requiring only
time and space. This is one of the approaches that will be considered for gradient calculation.
The only remaining difficulty here is that this computation requires either two passes, or
space, because the matrix
must be computed first, and then applied element by element using Equation (
21). Therefore, the gradients must either be computed twice (since they are used in the computation of
), or their values stored. Alternatively, at a minor cost in accuracy, the
from the previous iteration could be used. This approach was used in the mortgage model examined here.
An alternative to the approximation in Equation (
21) is to assume instead that
, here using the diagonal matrix
in place of the full
. If that approximation is used, then
This approximation may also be computed in
time and space. However, this computation appears to be less stable than Equation (
21), due to the likelihood of the non-positive definiteness of
for some observations
, and it will be seen in later sections that this is indeed the case.
The results section will analyze both Equations (
21) and (
22) numerically to determine which approach is more effective in this problem space.
1.8. Numerical Stabilization
Equations (
16), (
21) and (
22) can all suffer from the potential singularity or ill conditioning of
. This is a more severe problem for ICE than for TIC, since ICE must necessarily operate far from the MLE optimum
where
D may be actually singular or not positive definite.
The analysis performed in [
1] shows only that the trace term is the leading order generalization error term in a neighborhood
U around
, and need not be even positive outside of that neighborhood. Additional theories around the relationships between
and
are not developed here, but should
be close enough to
that it falls within
U, and hence
J is positive definite, then it is sufficient to ensure that the optimization over
is able to reach
U, and then within that neighborhood it can converge to
.
First of all, to improve numerical stability, we truncate to zero any gradient element with
, where here
is double precision machine epsilon, approximately
. These terms are too small to change the outcome of a dot product within a machine error. A vector so truncated is indistinguishable via dot products from one that has not been; however, it is possible for such terms to add numerical instability due to rounding errors in the computation of
. Similarly, for each element
of
, the value of
is computed as
where the weight
w is computed as
This weight is a continuous function of
, and goes to zero as
becomes small enough that it is dominated by rounding errors. In addition, for negative values of
, when multiplied by the square of the gradient in Equation (
16), the term becomes 1.0, thus preventing instability from forming when the optimizer is not near the MLE solution and
is not positive definite.
Geometrically, this means that far outside of
U the trace term is approximately the constant
for sample size
n, thus the optimizer will move towards
U if the MLE optimization would have done so. As the optimizer draws closer to
U, individual elements of the trace term start to take on values other than
. Deep within
U, the objective
is essentially unchanged from the value that it would have in the absence of this correction, and thus the optimizer can freely converge to
. Proving that the adjustment from Equation (
23) will always allow convergence to
if
is beyond the scope of this paper. Qualitatively, this would be expected to usually work, and this behavior is analyzed numerically in later sections.