## Monday, 30 July 2007

### Statistical Modelling: The Bits Pt. 1

It's a sad fact of life that it's too short, so we don't have time to learn everything that we need to know. For most biologists (and probably most scientists, and indeed social scientists), one of the things they should learn to do science well is statistics. Of course, the poor dears are too busy learning about things like biochemistry or PCR, to have time to learn about the important stuff.

Because of this, there is a lot they don't know, and also a lot of confusion about the bits they do know. I therefore thought it was worth writing a few posts about some of these issues, to try and dispel at least some of the confusion.

Now we're below the fold, I should admit that I don't mind too much the lack of knowledge - there are large swathes of statistics that I don't know about either (I was too busy admiring the paper aeroplane building skills of my fellow biochemistry students). The confusion is more of a concern, and I think it is largely because biologists get little training in statistics as undergraduates, so don't have the grounding in the theory when they start doing research. And, yes, I am generalising - some biologists do become very good statisticians. But there is still plenty of confusion.

A lot of basic confusion comes from not understanding the different parts of using models to analyse data. For the moment I am ignoring things like model checking, and how to interpret the results. This is not because they are unimportant, but just because I want to write a blog entry, not a monograph. Anyway, There are, I think, 4 parts that can be separated out:

1. the mathematical model
2. the inferential framework
3. model fitting (i.e. the computations)
4. model selection

Point 1 is often passed over, and points 3 and 4 are often confused with point 2. I will deal with the first two points in this post, and treat the others later.

To help disentangle the different parts, it is useful to have an example. For this, we can use a simple linear regression, such as the one I used here. The mathematical model is self-explanatory. For the regression, it can be written as

mi = a + b Xi
Yi ~ N(mi, s2).

The first line is the equation for a straight line. Xi is the covariate (year, in the example). This is then related to an expected value, mi. The second line says that each data point, Yi, is normally distributed with mean mi and variance s2.

Note that the data are assumed to be drawn from some distribution. Hence, they are random. However, they depend on parameters, in this case the mean and variance. Here the mean is modelled further, using an equation, so it is deterministic. i.e. if we knew the parameters, we would know the true value of the mean.

More complex analyses also have a model underneath them: the usual ANOVA actually has a model that is almost the same. The mean can be a function of several parameters, and might not be the mean, but just a parameter that controls the likelihood. The variance might also be modelled. And the parameters that are modelled could themselves be functions of more parameters. And all these parameters might themselves be random. Yes, things can get complex.

The problem, then, is to estimate the parameters. There are several schemes for doing this. For the model above, the two commonly used schemes are the frequentist and the Bayesian approaches. Each of these gives us a way to estimate the parameters. They are, in essence, philosophical approaches to understanding what we mean by probability and randomness. In the frequentist scheme, what we observe is random, with fixed parameters, and hence any statistic we calculate (such as the slope of the straight line) is a random function of the data. In contrast, the Bayesian approach is to say that anything we are uncertain about is random. Hence, we treat the data as fixed (because we know it: we have observed it), and the parameters are random.

Both of these philosophical schemes are useful in that they then can be used to construct methods for estimating the parameters of the mathematical model. Both use the likelihood, i.e. the probability of the data given values of the parameters, albeit in different ways. For example, the frequentist scheme tells us that we should find the values of the parameters which give us the maximum value of the likelihood. Hence, this is often spoken about as a maximum likelihood approach (there is another philosophical scheme which leads to the same equations, but to me it always looks like asking Fisher to wave his fiducial wand to make everything mysteriously right. I assume I'm missing something important).

Estimating the parameters is not enough, though. We also want to know how reliable they are. It makes a difference to our estimate that for every year J.K. Rowling writes an extra 300 pages if the range of possible values is 290-310 or 0-600. In the former case, we have a pretty good idea how much time to book off work to read her latest book: in the latter, we have little idea about whether we need to have a headache or major heart surgery. We can summarise the reliability in a variety of ways: a popular one is the probability that the estimate is less than zero. Alternatives include giving a range of likely values (typically a range for which there is a 95% probability that the estimate is within it), or a statistic such as a standard deviation to summarise the variability in the estimate (in the frequentist scheme, this standard deviation is called the standard error).

In my next post, I will describe the model fitting and model selection, i.e. how these schemes are used in practice to make inferences. I suspect this will include a rant about p-values.

Before I stop to work myself up into a frenzy, one last point is worth mentioning. The approaches discussed above assume that the data are normally distributed. If we discard this assumption, we can use different approaches. One, called least squares, is based on optimising a property of the model and data (i.e. minimising the sum of the squares of the differences between the data and the expected values). It turns out that this is the same as maximising the likelihood, i.e. it is mathematically the same as the frequentist approach. I, therfore, use the frequentist interpretation of the equations, and it gets round a lot of fiddly problems. Another, called non-parametric statistics, does away with any distributional assumption, but also throws away part of the data: rather than use the data themselves, the analysis concentrates on using the ranks of the data (i.e whether each Yi is the smallest, second smallest etc.). In some cases, I think this can still have a likelihood interpretation, but I must admit that I haven't checked: it's just a hunch.