1. Introduction
Fisher information matrices are widely used for making predictions for the errors and covariances of parameter estimates. They characterise the expected shape of the likelihood surface in parameter space, subject to an assumption that the likelihood surface is a multivariate Gaussian when viewed as a function of the model parameters. Diagonal terms are the inverse variances of the parameters, conditional on all others being known, and non-zero off-diagonal terms indicate correlations between inferred parameters. Diagonal terms of the inverse Fisher matrix yield the variances of parameters when all others are marginalised over. The Cramér–Rao inequality shows that the variances deduced from the Fisher matrix are lower bounds.
Fisher matrices have been extensively used in cosmology, where future experiments have been designed in order to deduce as precisely as possible the parameters of the standard cosmological model, so-called ΛCDM (Cold Dark Matter, with a cosmological constant Λ), and are routinely used to give “figures-of-merit” [
1] for the power of each experiment. Normally, these studies are standard applications of Fisher matrix theory, often simplified by an approximation (which is very good for observations of the Early Universe) that the data are Gaussian-distributed.
In this article, I review a number of generalisations of the Fisher matrix approach. In
Section 2 the derivation of the Fisher matrix for Gaussian data is sketched out; in
Section 3 we consider Fisher matrices for data pairs that have errors in both
x and
y; in
Section 4 we show how Fisher matrices may be used to estimate biases when some parameters are fixed at incorrect values; in
Section 5 we explore better approximations for the likelihood surface (“DALI”), from expansions to higher order in derivatives, and in
Section 6 we generalise the use of Gaussian likelihood surfaces to model selection and Bayesian evidence.
2. Gaussian Fields
In cosmology, one is very often dealing with Gaussian random fields, which are characterised statistically entirely by their mean and covariance. A pedagogical derivation for the Fisher matrix when the data
are Gaussian appears in [
2]. The negative log-likelihood
is
where in general both the mean vector
and the covariance matrix
depend on the model parameters
. If
represents 1-point statistics, such as Fourier coefficients, then typically
, and all the parameter dependence is in
. If
represents 2-point statistics, then for Gaussian fields they have only approximately a Gaussian distribution, and the analysis is only approximately correct. In this case, the covariance matrix has some parameter dependence through the 4-point function, which for Gaussian fields can be written as products of the 2-point function.
The Gaussian assumption is widely applicable in cosmology, since the quantum fluctuations that are thought to give rise to the density and radiation fields should ensure this, and limits on departures from gaussianity are very tight [
3]. Defining the data matrix
and using the matrix identity for positive definite square matrices
, where Tr indicates trace, we can re-write Equation (
1) as
Using standard comma notation for partial derivatives,
, and using the matrix identities
and
, we find after taking two derivatives and then the expectation value,
The great advantage of the Fisher matrix approach is seen in this example: no data (real or simulated) are required to compute the expected log-likelihood surface, only the statistical properties of the data. This can be a big advantage if simulation is computationally expensive.
3. Fisher Matrix with Errors in x as Well as y
The previous section gives the standard analysis where only the covariance of the
y values is considered. Let us now consider the fairly general case where the data consists of data pairs
, where we have errors in
both X and
Y. We can compute the Fisher matrix via the application of a Bayesian hierarchical model, provided that the errors in
X are small (this will be defined later). The full analysis is given in [
4].
We assume
and
are length
m and
n vectors (for data pairs,
, but in fact the analysis is more general), and have Gaussian errors, around true values
,
, with a covariance matrix
, which also allows correlations between
and
.
and
are not observed, being latent variables, and are essentially nuisance parameters. In fact the
are not independent nuisance parameters as they are assumed to be related precisely to
through a deterministic theoretical model
(however, a stochastic element could easily be included). Given the observed data,
, we seek the posterior
. With a uniform prior for
, this is proportional to the likelihood
. We write this as the marginalised distribution over
and
as
A deterministic
relation gives a delta function,
, and assuming a uniform prior for
(a more general prior is considered in [
4]), integration over
gives
We now assume that the errors in
are small, for which we require that we can truncate at the linear term of the Taylor expansion of
:
where
is diagonal for data consisting of pairs.
We assume a multivariate Gaussian for
(independent of
), and write the covariance matrix of the data in block form as
Note that
is not symmetrical, nor invertible or even square in general; although
and
are. The covariance matrix may include a number of elements, such as intrinsic scatter and measurement noise, with individual covariance matrices adding to give the final
. We also assume that the function
is linear across the width of the Gaussian error distribution of
, in which case the likelihood may be integrated analytically, as follows. We write
where
, and
and
are
-dimensional vectors:
and
for
,
and
. The inverse of
in block form is
where
Defining
, and
, we find that
Q has the quadratic form
where
With the definition of
Q in Equation (
12), the Gaussian integral of Equation (
9) can be performed, using
and noting that
is independent of
. The likelihood then simplifies to
where the inverse of the marginal covariance matrix of
is
. This is obtained using the Woodbury formula [
5]
, giving
This is a key result. We see that this looks just like a standard Gaussian (in terms of data) likelihood, but with the covariance matrix
(
in our current notation) replaced by
. Hence to compute the Fisher matrix, we can use the standard formula found in Equation (
3) and Equation (15) of [
2], and simply replace
by
:
Note that
depends not only on the standard covariance, but also on the covariance in the independent variable,
, the meta-covariance,
, and the first partial derivatives of the model function
. In the case of uncorrelated data pairs, the result reduces to that found in [
6]. For the simple case of no correlations between
and
values
, and with diagonal covariance matrices
and
we recover the propagation of error result that the variance of
for each data point is effectively
where
and
can be replaced in the standard Fisher expression, Equation (
3), by a diagonal
matrix with these enhanced entries.
Generalising Still Further
The analysis above is applicable not just to the simple case of data with errors in x as well as y, but to any system where the “data” depend (in a locally linear way) on any parameters that have some error associated with them.
4. Systematic Errors, or Errors from Simplified Nested Models
The Fisher matrix can also be useful to determine the errors in parameter inference that arise if one parameter is fixed at an erroneous value. This could arise in a number of contexts, such as a nuisance parameter (e.g., a calibration setting) being fixed at an incorrect value, or when considering nested models. An example of the latter would be cosmological models where the Universe is assumed to be flat. This is an example of a nested model, being a subset of a more general model, but with the curvature parameter (usually given the symbol
) set to zero. In these cases, the maximum likelihood values of all the other parameters are, in general, shifted from their maximum likelihood values in the more general model. See
Figure 1 for an illustration of this in two dimensions. With the usual Fisher assumption that the likelihood surface is a Gaussian function of the parameters, these shifts can be computed using the Fisher matrix.
We consider two models,
M, which has more (
) parameters than a simpler nested model
, which has
. The extra parameters are designated
, and these are fixed in
at values that are
from their maximum likelihood values in
M. In this case, the maximum likelihood values of all other parameters of
,
, are systematically shifted by [
7,
8]
where
which we recognise as a subset of the Fisher matrix.
5. Beyond the Gaussian Approximation—DALI
The Fisher matrix approach assumes that the likelihood surface is a multivariate Gaussian, which will be asymptotically true near the peak, but may not be a good approximation over the range of parameter values of interest. A generalisation of the Fisher matrix is DALI, Derivative Approximation for LIkelihoods [
9], which expands the likelihood surface to include higher-order derivatives than the second. This is a rather elegant expansion, in derivatives rather than parameters, that ensures that the approximate distribution is a genuine probability distribution—
i.e., it is non-negative and normalisable, non-divergent and asymptotically approaches the true likelihood.
The starting point is a Taylor expansion of the likelihood:
where
is a normalization constant and
,
and
.
If the expansion is arranged in order of derivatives, the expressions are normalisable and positive-definite. For example, to second order in the
derivatives, and assuming
is independent of
, we have
This is apparently true at every order (see [
9] for the third-order expansion, and [
10] for the case where the parameter dependence is in
).
Figure 2 shows the improvement in the expected likelihood surfaces for a supernova cosmology experiment.
6. The Expected Bayesian Evidence—Generalising Fisher Matrices to Model Selection
At the root of the Fisher matrix formalism is the Laplace approximation,
i.e., the assumption that the likelihood surface is a multivariate Gaussian when viewed as a function of the model parameters. We can generalise this to the higher-level question of
model selection, where we compute the posterior probabilities of different models, given the data collected, but regardless of the model parameters
. The ratio of these probabilities is the ratio of the prior model probabilities, multiplied by the “Bayes factor”, which is the ratio of the marginal likelihoods (or Bayesian evidence) of the models, where the evidence for a model
M is
With the Laplace approximation for the first, likelihood term, and a uniform prior (which can be generalised to a Gaussian prior), we can compute the expected evidence (conditional on some fiducial set of parameters) by performing Gaussian integrals. For nested models (with
and
parameters respectively), the considerations of
Section 4 on the locations of the peak likelihood is relevant, and the result depends on the shifts of the fiducial parameters away from the values that are fixed in the lower-dimensional model,
. If we further approximate that the expected Bayes factor is the ratio of the expected evidences, then the expected Bayes factor is (see [
7] for details)
where
are the prior ranges of the additional
p parameters in the extended model, and the offsets
are given by Equation (
19). Note that
is an
matrix,
is
, and
is an
block of the full
Fisher matrix
, given by Equation (
20). The expression we find is a specific example of the Savage-Dickey density ratio [
11]; here we explicitly use the Laplace approximation to compute the offsets in the parameter estimates which accompany the wrong choice of model.
Figure 3 shows the ratio of expected evidences, assuming the Laplace approximation (as the Fisher matrix does), for nested cosmological models. Details are in the caption, but essentially one parameter is fixed in the simpler model, but allowed to vary in the more complex model. If the more complex model applies, then the data will favour the simpler model if the parameter is close to the fixed value. This is shown in the figure by the cusp in the graph.
is positive to the left of the cusp, and negative to the right.