Wednesday, 1 August 2007

Statistical Modelling: The Bits Pt. 2

In my last post, I outlined the ideas of statistical inference: that we have a model, and we have a philosophical scheme that tells us what maths to do to estimate the parameters of the model. But this is not the same as doing this in practice. The equations are often complex, and difficult or impossible to solve (anyone fancy trying to integrate in 1000 dimensions?). Because of this, a lot of methods have been devised for estimating the parameters, and particularly estimating their variability (which, from now on, I will write about as estimating the standard error).

For maximum likelihood, the problem of getting the estimates is an optimisation problem: find the set of parameters that give the largest likelihood. For the simple linear regression, this is easy, because the equations were worked out a couple of hundred years ago, by some French mathematicians trying to work out where Paris was. For more complex models, such as generalised linear models, there are algorithms that have been shown to efficiently iterate towards the ML estimates. In more complex models, more general or complicated search algorithms might be needed (for example, there is the EM algorithm which is useful for dealing with missing data. So useful that the missing data is sometimes invented, in order to use it).

These algorithms only give point estimates. We also want to know the standard error, and there are methods for doing this. Again, it may be that the equations can be derived directly, or can be found with a simple numerical search. For less standard problems, more complex algorithms can be used. Two popular ones are called the bootstrap and the jackknife. They both work by re-sampling the data, and using that variation to estimate the standard error. What is less well known is that they can also be used to estimate the bias in the point estimates.

Even more complex problems have even more complex solutions. For Bayesians, the most common method for fitting their models is a technique called MCMC. This estimates the parameters (actually their full distribution) by simulation. The technique is not simple to explain, but it is used because it is very flexible and for a lot of problems works well in practice. However, this is not the same as a Bayesian analysis: MCMC can be used by frequentists as well, and other complicated methods can also be used to fit Bayesian models (e.g. sequential importance sampling).

So, now we have a model fitted. There is still one part of the process that is often done: model selection. This is not always done, and is perhaps done more often than it should. It also creates a lot of heated debates, which may surprise some of you. That's because model selection is often disguised as hypothesis testing.

A general description of the role of model selection is that it is a way of choosing between different mathematical models for the data, to find the best one. For frequentists there are two principal ways of doing this: null hypothesis statistical testing and using information criteria to compare models.

Null hypothesis statistical testing (NHST) is the method everyone is taught to test hypotheses. What one does is set up a null hypothesis, for example that there is no difference between two groups, and then test whether this hypothesis is supported by the data. If it is, then the null hypothesis is accepted (strictly, it's not rejected), otherwise it is rejected. In one common formulation, whenever a null hypothesis is set up, a more general alternative hypothesis is also proposed: rejecting the null hypothesis means accepting the alternative hypothesis.

Where is the model selection in this? We can take the Harry Potter example, and test the null hypothesis that the change in book length over time is zero. The relevant part of the mathematical model is

mi = a + b Xi

where mi is the expected number of pages, and Xi is the year of publication. The null hypothesis is that b=0. But this is equivalent to this model

mi = a

And mi = a + b Xi is the alternative model for the alternative hypothesis. Hence, hypothesis testing becomes a problem of comparing two models. The models are compared by calculating the amount of variation in the data explained by the two models, and seeing if the larger model (i.e. the one with the effect of time) explains a significantly greater amount of variation.

The alternative method for model comparison uses information criteria. These are measures of model adequacy: they consist of a measure of model fit (the deviance: the smaller the deviance, the closer the fitted model is to the data), and a penalisation term for complexity (the more parameters in the model, the higher this penalisation). The different models being compared are fitted to the data, and the relevant information criterion (e.g. AIC or BIC) is calculated for each one. The model with the lowest criterion is declared the best. If there are several models with similar criteria, they are sometimes all examined, and one of them chosen to be the best, for other reasons (e.g. because it makes sense scientifically).

Both of these methods have their counterparts in Bayesian analysis: hypothesis testing can be carried out using Bayes factors, for instance, and there is an information criterion for Bayesians (DIC. Alas, not the Bayesian information criterion!).

So, there we have it. OF course, this is not all there is in statistical inference: for example, I have not dealt with model checking (i.e. asking whether the model actually fits the data well), and there are many details I have left out. But I hope this give a scheme that separates out the different parts of fitting statistical models, and reduces some of the confusion. To summarise, here is a list of the parts of the process, and examples of the actual methods used in each part:

  1. The mathematical model

    • regression, ANOVA

    • ARIMA (time series)

    • Generalised linear models, and Generalised linear mixed models

  2. The inferential framework

    • Frequentist (Maximum Likelihood and REML)

    • Bayesian

    • least squares

    • non-parametric

  3. Model fitting (i.e. the computations)

    • bootstrap and jackknife

    • MCMC

    • importance sampling and sequential importance sampling

  4. Model selection

    • Significance tests

    • information methods (?IC: AIC, BIC, DIC, etc)


Anonymous said...

So, how did Laplace work out the mass of Saturn?

Bob O'Hara said...

Err, perhaps he put it on some scales?

OK, it's a few years since I read about the history. I'll have to check the article where I read this, to see what Stigler wrote.

But I'll probably use the same one-liner, because it's funnier. :-)


Anonymous said...

Sigh ... try again, please.