Deep Thoughts and Silliness: statistics

Showing posts with label statistics. Show all posts

Tuesday, 22 January 2008

Gender Differences: Need More Data!

I was asked about this last week by a colleague, and now it's hit the blogosphere, so I thought I would publicly leap into a dispute about sexism in science. And make a plea for people to actually look at their data.

This was all started by a group of biologists who have been working at NCEAS on publication biases in ecology (the biggest bias is, of course, that not enough of my papers get accepted straight away). They managed to get their latest results published in TREE.

The received wisdom is that there is a bias against women in science. One area where this might be seen is in acceptance of papers for publication – referees and editors might have a bias (conscious or subconscious) against women. If this is true, the proportion of papers published by women should be higher in journals where the gender of the author is not known.

For this and other reasons there have been suggestions floating around that journals shift to a system of double-blind reviews. At the moment most journals have single blinding: the authors' identities are known to the referees, but the referees' identities are not revealed to the authors (unless the referees wish to do so). In double blinding, the referee doesn't know the identity of the author. Hence, any bias due to gender of the author should be removed. So, if a journal shifts from single blinding to double blinding, the proportion of papers by female authors should increase.

In 2001 the journal Behavioural Ecology moved to double blinding. But did this change the proportion of female authors? Or, more exactly, was there a bias against women that was removed? After all, the proportion of female authors might be changing in the rest of science – the null expectation is that the change in Behavioural Ecology should be the same as in similar journal, rather than there should be no change. So, the group gathered data on the number of papers by male and female first authors from before and after Behavioural Ecology switched to double blinding for five similar journals too. And then they compared the change in proportion of female authors in Behavioural Ecology to that in the other journals.

Err, no.

What they did was to compare the change in the proportion of female authors in each journal to zero. They found was that Behavioural Ecology and Biological Conservation. had increases that were significantly different, but not the other journals. They therefore concluded that there was an effect of double blinding, and that the increase in Biological Conservation must have been due to other factors. Oddly, though, at no point did they seem to make a direct comparison. It is not clear that they looked at the data either. Had they done so, they would have seen this:

The lines show the change from before Behavioural Ecology went double blind to afterwards. The vertical lines are the standard errors. Behavioural Ecology is the thick black line. We see that the proportion of female authors increases in all of the journals, but also that it is greatest in Behavioural Ecology. But is that increase significantly (in any sense) greater than in the other journals? Well, comparing it to zero obviously inflates the estimate of significance, because the other journals are all also increasing.

We can get an idea about if the data show anything with a more focussed analysis. This is also simplified, but I an ignoring some variation, and a more sophisticated analysis (=too much hassle to explain) comes to the same conclusion (and yes, for those who have read the paper, so does including the "don't knows").

What we can do is calculate the difference between the before and after proportions of female authors for the “control group”, and estimate the distribution of differences that would be expected if there was no double blinding implemented. Then we can ask if the difference in the proportion for Behavioural Ecology falls so far outside this distribution that it would be unlikely to explain the change.

These are the differences:

Journal	Percentage Before	Percentage After	Difference (%)
Behavioural Ecology	23.7	31.6	7.9
Behavioral Ecology & Sociobiology	25.1	26.3	1.3
Animal Behaviour	27.4	31.6	4.2
Biological Conservation	13.8	20.6	6.8
Journal of Biogeography	14.4	16.5	2.0
Landscape Ecology	19.5	23.4	3.9

For the journals in black, the mean difference is 3.65%, with a standard deviation of 2.15%. If these were exact, then there would be a 95% chance that the change for another, similar, journal would be between -0.6% and 7.9%. So, Behavioural Ecology is right on the edge.

But it assumes that the variance is known. In reality it is estimated, and only estimated from 5 data points (i.e. not a lot). If we take this into account, we find that the prediction for a journal would fall between -2.3% and 9.6% (with 95% probability). Now Behavioural Ecology is reasonably well inside the limits. Even someone wanting to do a one-sided test will find it inside.

So, the analysis shows little evidence for any effect of double blinding. But there are a couple of caveats, which could have opposite effects. The first is simply that there is not a lot of data – only 6 data points. We would really need more journals to be able to come to any conclusion. In particular, there may have been some other changes at Behavioural Ecology that could have had an effect.

The second caveat is more subtle. Suppose you were a journal editor, and you introduce a rule that authors have to admit that statisticians are the highest form of life in their acknowledgements. After a couple of years, you notice that the proportion of authors called Fisher has increased. You wonder if this is because of the new rule. So, you compare it with other journals, and find no increase. You therefore declare that Coxes appreciate statisticians, but other people don't. But what about all those other effects you didn't see? What about the changes in numbers of Boxes, Coxes, and Nelders? Humans are very good at detecting patterns, but very bad at judging whether they are random. And using the same data from which you spotted a pattern to assess whether it is real is naughty – of course you're going to see an effect, because you've already noticed it in the mass of all possible things that could happen. Now, I don't know if the authors are guilty here – they don't say how they came to decide to examine this particular aspect of the data, but the introduction is a bit arm-wavy about the effect of double-blinding on sex ratio.

Of course, the solution to both caveats is simple – get more data. Anyone fancy trawling through the literature this weekend?

EDIT: Oops. I should have hat-tipped Grrlscientist for her post, which encouraged me to write this. Hedwig - I'm sorry. Please don't set Orpheus onto me...

Reference
BUDDEN, A., TREGENZA, T., AARSSEN, L., KORICHEVA, J., LEIMU, R., LORTIE, C. (2008). Double-blind review favours increased representation of female authors. Trends in Ecology & Evolution, 23(1), 4-6. DOI: 10.1016/j.tree.2007.07.008
Read more!

Monday, 21 January 2008

Damn Lies in So Many Languages

The ISI Glossary of Statistical Terms

Isn't this just great? You too can find out what the Afrikaans is for heteroscedasticity.

Seriously, this is the sort of resource that is really useful for a small number of people. The web is ideal for it. If only it was easy to find with Google.

I wanted to use the site to find out what the Finnish for "ordinal regression" is. Of course, it's one of the few terms they don't have. So, please tell me! Or tell me what it is in any other language that might be useful - help the ISI!

Monday, 31 December 2007

A physicist stumbles into a statistical field

By a virtual game of Chinese Whispers (in which no Chinese were involved), I found out about a paper on arΧiv where a poor unsuspecting physicist wanders into a curious part of statistics. I'm actually something of a bystander in this area, but it's not going to stop me commenting on it.

OK, so the paper is by a guy called Bruce Knuteson, from MIT. He's interested in working out the scientific worth of a piece of empirical work, and being a physicist, he wants to measure it.

So, the first problem is to decide what worth is. Knuteson decides to measure it in terms of "surprisal", i.e. how surprised we are by a result. So, if we collected data, and got a result (say, a measurement of a parameter) x_i, how shocked would we be by it? From this, Knuteson decides that ...

The scientific merit of a particular result x_i should thus (i) be a monotonically decreasing function of the expectation p(x_i) that the result would be obtained, and (ii) be appropriately additive.

and so suggests -log(Pr(x_i)) as a measurement, as it has these properties. He then suggests that the worth of an experiment can be estimated as the expected value of this, i.e. the sum of -Pr(x_i)log(Pr(x_i)). This is a measure called entropy: something beloved of physicists and engineers, but rather opaque to the rest of us. The idea is that a larger entropy will mean that the experiment is better - we will expect more surprising results.

But is this a good measure? Perhaps a good way of tackling this is to view it as a problem in decision theory. How can we decide what is the best course of action to take when we are uncertain what the results will be? For example, if we have a choice of experiments we can carry out, how can we decide which one to do? To do this we first need to define "best". This has to be measured, and the numerical value for each outcome is called the utility, U. This might, for example, be the financial gain or loss (e.g. if we are gambling), or might be something more prosaic, like one's standing in the scientific community (however that is measured. h-index?). All the effects of each action, both positive and negative, go into this number. So, for example, we would include the gain in prestige from publishing a good paper, and the cost (e.g. financial, or the effect on our notoriety if the results are a turkey). The second part of the decision analysis is to give a probability for each outcome, so for action A the probability might be 0.3 that we get a Nature paper, and 0.7 that we get a Naturens verden paper. For action B it might be 0.9 and 0.1 respectively. We then calculate the average utility for each action, i.e. sum the probability of each result multiplied by the utility for that result.

This is what Knuteson does to get his measure. The problem is that his only utility is surprisal, and in general this doesn't make sense. Two things are missing. Firstly, there is no cost element. So, if we want to measure the time it takes an apple to fall on a physicist's head, it makes no difference if we pay a couple of students $1 or £30,000,000 to do it. The second problem is that there is no measure of scientific worth. Finding out if the next toss of a €1 coin is treated exactly the same as finding out if the Higgs boson is green.

This leads to clearly nonsensical results. If there are only two possible outcomes of an experiment, then the maximum expected surprisal occurs if the probability of one is 0.5. Therefore the optimal experiment is one with this property. For example, tossing a €1 coin. According to Knutsen, then, we should fund lots of coin tossing experiments (hmm, there's an Academy of Finland application deadline coming up).

The second thing that is missing is where the probabilities come from. These are probabilities of outcomes that are not observed, so in general they cannot be measured (without doing the experiment...). Therefore one has to assign them based on one's subjective opinion. Now we are on familiar Bayesian ground, and is something that has been argued about for years. But here I think Knutsen can use a sneaky trick to sidestep the problems. Put simply, he could argue that in practice the estimation of merit is made by people, so they can assign their own probabilities. If someone else disagrees, fine. This way, it is clearer where the disagreement lies (e.g. which probabilities are being assigned differently).

So, estimating the merit of a piece of work before it is done can be problematic (and I haven't touched on comparing experiments with different numbers of possible outcomes!). But Knutsen develops his ideas even further. How about, he asks, worthing out the merit of an experiment after it has been done?

Before doing this, Knutsen sorts out a little wrinkle. It is not generally the experimental results themselves that are of interest - it is how they impact on our understanding of the natural world. We can think about this in the way that we have several models, M₁, M₂, ... M_k, for the world (these might correspond to theories, e.g. that the world is flat, or that it is a torus). The worth of an experiment could then be measured in terms of how it changes what we learn about these experiments, i.e. how M_j changes with the data, x_i. This can simply be measured as the entropy of the models, the M_k's, rather than the experimental outcomes.

Knutsen goes through the maths of this, and finds that the appropriate measure of the merit of an experiment is a measure of how far the probabilities of each model are shifted by the experiment. To be precise, it is a measure known as the Kullback-Leibler divergence (I will spare you the equations). Now, this again is something that is familiar. A big problem in statistics is deciding which model is, in some sense, best. This can be done by asking about how well it will predict an equivalent data set to the one being analysed. After going through a few hoops, we find that the appropriate tool is the K-L divergence between the fitted model and the "true" model. Of course, we don't know the true model, but there are several teaks and approximations that can be made so that we don't need to - it is the relative divergence of different possible models that is important. The result of this is a whole bunch of criteria that are all TLAs with IC at the end - AIC, BIC, DIC, TIC, and several CICs.

The optimal experiment is the one which will maximise the difference between our prior and posterior probabilities of the different models (yes, Bayesian again). The idea is natural - the greater the difference, the more we have learned, and hence the better the experiment is. Of course, we still have the same problems as above, i.e. assigning the probabilities, and getting the utility right, but we are in the right area. Indeed it turns out (after browsing wiki) that the idea is not original - the method proposed by Knutsen is the same as something called Bayesian D-optimality. And (after reading the literature), the idea goes back to 1956!

So, does this help? For the general problem of estimating scientific merit, I doubt it. There are too many problems with the measure. It may be useful for structuring thinking about the problem, but in that case it it little different from using a decision analytic framework.

In experimental design, it is of more use, but then the idea is not original. The other area it might be useful is in summarising the worth of an experiment for estimating a parameter, such as the speed of light. There will be cases where physical constants have to be measured. Previous measurements can then be used to form the prior (there are standard meta-analysis methods for this), and then the K-L divergence of several experiments can be calculated, to see which gives the largest divergence. This is some way from the ideas Knutsen is thinking about (he explicitly rejects estimating parameters as being of merit!). But I think more grandiose schemes will die because of naysayers like me nagging at the details.

Reference
Lindley, D.V. (1956). On a Measure of the Information Provided by an Experiment. The Annals of Mathematical Statistics, 27, 986-1005.
Read more!

Wednesday, 5 December 2007

Focus on DIC

This is something that's only of interest to Bayesians, so the rest of you can look away.
In a couple of analyses with BUGS, I've seen comparisons of DIC from different models, and the values have been almost exactly the same. This sort of thing is suspicious, and eventually I worked out why. I thought it was worth posting about this, so that everyone else can share The Secret.

It's best to explain the problem with an example (the complete R code for this is below). I simulated a couple of simple hierarchical models, so

Y_ij ~ N(θ_i, σ_y²)
θ_i ~ N(μ_i, σ_θ²)

(j =1,...,n, i=1,...,m. There are m groups, each with n observations). I then had two models for μ_i:

Model 1
μ_i = φ + β (i-5.5)

Model 2
μ_i = φ

The first model has a covariate (cunningly equal to the identity of the group), and the second has none. The data are plotted below. The effect of the covariate is clear, so DIC should be able to pick it up.

Now, I fit each of the models to each data set. This is easy running it through the R2WinBUGS package in R. From this I can extract the DIC:

Data 1
Model 1 DIC: 1464.0 pD: 10.9
Model 2 DIC: 1464.4 pD: 11.2

Data 2
Model 1 DIC: 1407.7 pD: 10.9
Model 2 DIC: 1407.7 pD: 10.9

So, in both cases the DIC is the same (for Data 2 the difference is in the third decimal place!). But for Data 1, Model 1 should be better - hey, we can see it on the figure! So, what's going on?

We can get a clue from plotting the posteriors for μ_i from the groups, from the two models:

Only the error bars are plotted (i.e. plus/minus 1 posterior standard deviation), and the 1:1 line is drawn in a fetching shade called "hotpink". Obviously the models are predicting the same means for the groups, and hence we will get the same deviance. We can see why this is happening from the group-level standard deviations (σ_θ², posterior means and standard errors in parentheses):

Data 1
Model 1: 0.87 (0.280)
Model 2: 4.4 (1.31)

Data 2
Model 1: 1.1 (0.34)
Model 2: 1.1 (0.31)

So, for the data where there is a trend, but none is fitted, σ_θ² is much larger - essentially, the lack of the linear trend is compensated by the increase in variance. The difference is not in the model for θ at all, but higher in the hierarchy (or hier in the higherarchy?).

Of course, this is obvious from looking at the models. The solution is to change the focus, from θ to φ and &beta. This then means calculating the marginal deviance, marginalising over θ, i.e. looking at P(Y | φ, β) and integrating over P(θ | Y). This can be done analytically (Hat-tip to David Spiegelhalter for correcting my errors, and refraining from making any justified comments about my mathematical ability!), whence we find that the deviance can be calculated because

barY_i. ~ N(μ_i, σ_y²/n + σ_θ²) )

when we do this, we get these results:

Data 1
Model 1 DIC: 26.9 pD: 2.7
Model 2 DIC: 57.8 pD: 1.7

Data 2
Model 1 DIC: 30.9 pD: 2.6
Model 2 DIC: 30.3 pD: 1.7

Now this makes more sense, for the data with an effect, the DIC massively favours the correct model. Without the effect in the data, the DIC is pretty similar. In both cases, also note that pD is larger by 1 for the model with 1 extra parameter. Which is what should happen!

What lessons can we draw from this? Firstly, that DIC is not an automatic panacea - you do have to focus it on the right part of the model. If the focus is not at the level immediately above the data (i.e. θ here), then you can't use the DIC given by BUGS. The correctly focussed DIC is more complex to get at: you have to calculate it yourself. For more complex models this might be awkward, if there are no analytical results, then the parameters to be integrated out have to be simulated, for example by MCMC. But this requires some further thought...

library(R2WinBUGS)
# Simulate data
Cov=1:10
Group = rep(Cov, each=50)
beta = 1
# With covariate
GrpMean1=rnorm(length(Cov), beta*Cov, 1)
Value1=rnorm(length(Group), GrpMean1[Group], 1)
DataToBUGS1=list(N=length(Value1), NGrp=length(Cov), Group=Group, Y=Value1)
# Without covariate
GrpMean2=rnorm(length(Cov), 5.5, 1)
Value2=rnorm(length(Group), GrpMean2[Group], 1)
DataToBUGS2=list(N=length(Value2), NGrp=length(Cov), Group=Group, Y=Value2)

# Plot the data
# png("C:/Bob/Blog/Data.png", width = 960, height = 480)
par(mfrow=c(1,2), mar=c(2.1,2.1,1.1,1.1), oma=c(2,2,0,0), las=1)
plot(jitter(Group), Value1, pch=3, col="grey70")
points(Cov, GrpMean1, col=1, pch=3, cex=1.5)

plot(jitter(Group), Value2, pch=3, col="grey70")
points(Cov, GrpMean2, col=1, pch=3, cex=1.5)
mtext("Group", 1, outer=T)
mtext("Y", 2, outer=T)
#dev.off()

# Write BUGS models to files
model1 <- function(){
for (i in 1:N){ Y[i] ~ dnorm(muGrp[Group[i]], tau.y) }
for (j in 1:NGrp){
muGrp[j] ~ dnorm(muG[j], tau.Grp)
muG[j] <- mu0 + betaGrp*(j-5.5)
}
mu0 ~ dnorm (0.0, 1.0E-6)
betaGrp ~ dnorm (0.0, 1.0E-6)
tau.y <- pow(sigma.y, -2)
sigma.y ~ dunif (0, 1000)
tau.Grp <- pow(sigma.Grp, -2)
sigma.Grp ~ dunif (0, 1000)
}
write.model(model1, "C:/Bob/Blog/model1.txt")

model2 <- function(){
for (i in 1:N){ Y[i] ~ dnorm(muGrp[Group[i]], tau.y) }
for (j in 1:NGrp){ muGrp[j] ~ dnorm(mu0, tau.Grp) }
mu0 ~ dnorm (0.0, 1.0E-6)
tau.y <- pow(sigma.y, -2)
sigma.y ~ dunif (0, 1000)
tau.Grp <- pow(sigma.Grp, -2)
sigma.Grp ~ dunif (0, 1000)
}
write.model(model2, "C:/Bob/Blog/model2.txt")

# Initial values
Inits1=list(list(mu0=0, betaGrp=1, sigma.Grp=5, sigma.y=1), list(mu0=2, betaGrp=0, sigma.Grp=5, sigma.y=1) )
Inits2=list(list(mu0=0, sigma.Grp=5, sigma.y=1), list(mu0=2, sigma.Grp=5, sigma.y=1) )

# Fit the models
# Data 1, Model 1
Data1Model1.post=openbugs(DataToBUGS1, inits=Inits1, c("mu0", "muGrp", "betaGrp", "sigma.y", "sigma.Grp"), model.file = "C:/Bob/Blog/model1.txt", n.chains = 2, n.iter = 11000, n.burnin = 1000, n.thin = 10)
# Data 1, Model 2 (no slope in model)
Data1Model2.post=openbugs(DataToBUGS1, inits=Inits2, c("mu0", "muGrp", "sigma.y", "sigma.Grp"), model.file = "C:/Bob/Blog/model2.txt", n.chains = 2, n.iter = 11000, n.burnin = 1000, n.thin = 10)
# Data 2, Model 1
Data2Model1.post=openbugs(DataToBUGS2, inits=Inits1, c("mu0", "muGrp", "betaGrp", "sigma.y", "sigma.Grp"), model.file = "C:/Bob/Blog/model1.txt", n.chains = 2, n.iter = 11000, n.burnin = 1000, n.thin = 10)
# Data 2, Model 2 (no slope in model)
Data2Model2.post=openbugs(DataToBUGS2, inits=Inits2, c("mu0", "muGrp", "betaGrp", "sigma.y", "sigma.Grp"), model.file = "C:/Bob/Blog/model2.txt", n.chains = 2, n.iter = 11000, n.burnin = 1000, n.thin = 10)
print(Data1Model1.post)
print(Data1Model2.post)
print(Data2Model1.post)
print(Data2Model2.post)

# Plot mu
# png("C:/Bob/Blog/Theta.png", width = 960, height = 480)
par(mfrow=c(1,2), mar=c(2.1,2.1,1.1,1.1), oma=c(2,2,1,0), las=1)
plot(Data1Model1.post$mean$muGrp, Data1Model2.post$mean$muGrp, type="n", main="Data 1")
segments(Data1Model1.post$mean$muGrp,Data1Model2.post$mean$muGrp-Data1Model2.post$sd$muGrp, Data1Model1.post$mean$muGrp, Data1Model2.post$mean$muGrp+Data1Model2.post$sd$muGrp)
segments(Data1Model1.post$mean$muGrp-Data1Model1.post$sd$muGrp, Data1Model2.post$mean$muGrp, Data1Model1.post$mean$muGrp+Data1Model1.post$sd$muGrp, Data1Model2.post$mean$muGrp)
abline(0,1, col="hotpink")

plot(Data2Model1.post$mean$muGrp, Data2Model2.post$mean$muGrp, type="n", main="Data 2")
segments(Data2Model1.post$mean$muGrp, Data2Model2.post$mean$muGrp-Data2Model2.post$sd$muGrp, Data2Model1.post$mean$muGrp, Data2Model2.post$mean$muGrp+Data2Model2.post$sd$muGrp)
segments(Data2Model1.post$mean$muGrp-Data2Model1.post$sd$muGrp, Data2Model2.post$mean$muGrp, Data2Model1.post$mean$muGrp+Data2Model1.post$sd$muGrp, Data2Model2.post$mean$muGrp)
abline(0,1, col="hotpink")
mtext(expression(paste(theta[i], ", Model 1", sep="")), 1, outer=T)
mtext(expression(paste(theta[i], ", Model 2", sep="")), 2, outer=T, las=0)
#dev.off()

####################################################################### DevCalc1=function(mcmc, data) {
NamesGrp=paste("muGrp[", data$Group, "]", sep="")
-2*sum(dnorm(data$Y, mcmc[NamesGrp], mcmc["sigma.y"], log=TRUE))
}
DevCalc2=function(mcmc, data) {
Mean=unlist(tapply(data$Y, list(data$Group), mean))
Mu=mcmc["mu0"]
N=length(data$Group)/length(unique(data$Group))
-2*sum(dnorm(Mean, Mu, sqrt((mcmc["sigma.y"]^2)/N + mcmc["sigma.Grp"]^2), log=TRUE)) }
DevCalc2beta=function(mcmc, data) {
Mean=unlist(tapply(data$Y, list(data$Group), mean))
Mu=mcmc["mu0"] + mcmc["betaGrp"]*(as.numeric(names(Mean)) - 5.5)
N=length(data$Group)/length(unique(data$Group))
-2*sum(dnorm(Mean, Mu, sqrt((mcmc["sigma.y"]^2)/N + mcmc["sigma.Grp"]^2), log=TRUE))
}
DIC.calc=function(dat, mcmc, func) {
Dev=apply(mcmc$sims.array, c(1,2), func, data=dat)
mean=apply(mcmc$sims.array, 3, mean)
pD=mean(Dev) - func(mean, dat)
DIC=mean(Dev) + pD
return(list(DIC=DIC, pD=pD, Dbar=mean(Dev)))
}
PrintDIC=function(DIC, Label) { cat(Label, " DIC:", DIC$DIC, " pD:", DIC$pD, "\n")}

# Focus on theta
DIC.D1M1=DIC.calc(DataToBUGS1, Data1Model1.post, DevCalc1) DIC.D1M2=DIC.calc(DataToBUGS1, Data1Model2.post, DevCalc1)
DIC.D2M1=DIC.calc(DataToBUGS2, Data2Model1.post, DevCalc1)
DIC.D2M2=DIC.calc(DataToBUGS2, Data2Model2.post, DevCalc1)
PrintDIC(DIC.D1M1, "Data 1, Model 1")
PrintDIC(DIC.D1M2, "Data 1, Model 2")
PrintDIC(DIC.D2M1, "Data 2, Model 1")
PrintDIC(DIC.D2M2, "Data 2, Model 2")

# Focus on phi
DIC2.D1M1=DIC.calc(DataToBUGS1, Data1Model1.post, DevCalc2beta)
DIC2.D1M2=DIC.calc(DataToBUGS1, Data1Model2.post, DevCalc2)
DIC2.D2M1=DIC.calc(DataToBUGS2, Data2Model1.post, DevCalc2beta)
DIC2.D2M2=DIC.calc(DataToBUGS2, Data2Model2.post, DevCalc2)
PrintDIC(DIC2.D1M1, "Data 1, Model 1")
PrintDIC(DIC2.D1M2, "Data 1, Model 2")
PrintDIC(DIC2.D2M1, "Data 2, Model 1")
PrintDIC(DIC2.D2M2, "Data 2, Model 2")

Read more!

Saturday, 29 September 2007

My Week in Pictures

I had a good week work-wise this week, I got a few things to work. I also realised that I can pretty much summarise it in pictures.

The first is from something I was playing with the previous Friday afternoon, seeing if I could get the ODE solver in BUGS to work. For this I fitted the Lotka-Volterra equations to some famous data on Lynx-Hare cycles from the Hudson bay Company. I got it working this week, and the model fits reasonably well:

I'm happy with the lynx fit, but the hares aren't great - I might have to add carrots into the model. But considering how simple the model is, it's pretty good.

The second result is something I've been working on for some time, a multi-trait QTL analysis. This is genetics - trying to locate genes that affect different traits. We've been looking at models that can include several traits in the same analysis. Here I plotted what is roughly the probability that each locus (on the x-axis) affects each trait (each line).

Locus 12 seems to be affecting several traits - yes, we have pleiotropy! The numbers aren't totally convincing, but it looks like there's something there.

At the same time, I've also been doing another predator-prey analysis, on voles and weasels. That worked as well, and this is a plot showing that the interactions are local:

The ideas is that if the circles overlap, there is a correlation between the sites. Actually, this interpretation may be a bit dodgy, but the main thing that comes out of this is that most of the sites are behaving independently.

Finally, this Friday's afternoon playing around involved more ODEs, modelling the numbers of butterflies over a season. This was fairly quick to fit, although it could behave better.

It also needs a bit more work on it (hey, it was only about an hour's work!), because there is more variation than the model accounts for.

Other than showing I had a good, interesting week, does this really say anything? One thing is that none of the graphs will look like that in their final versions - these are all produced as part of the process of doing the analysis. But "quick and dirty" graphs are still very useful for looking at the results, and getting a feel for what's happening. It also shows the worth having packages with good, flexible graphing functions - two of the plots are standard BUGS plots, the other two are drawn in R. Plots are also an easy way of showing collaborators what you've done - I've drawn several other plots this week, and not all will end up in the final write-ups of the work.

Now I just have to write all this up.
Read more!

Wednesday, 1 August 2007

Statistical Modelling: The Bits Pt. 2

In my last post, I outlined the ideas of statistical inference: that we have a model, and we have a philosophical scheme that tells us what maths to do to estimate the parameters of the model. But this is not the same as doing this in practice. The equations are often complex, and difficult or impossible to solve (anyone fancy trying to integrate in 1000 dimensions?). Because of this, a lot of methods have been devised for estimating the parameters, and particularly estimating their variability (which, from now on, I will write about as estimating the standard error).

For maximum likelihood, the problem of getting the estimates is an optimisation problem: find the set of parameters that give the largest likelihood. For the simple linear regression, this is easy, because the equations were worked out a couple of hundred years ago, by some French mathematicians trying to work out where Paris was. For more complex models, such as generalised linear models, there are algorithms that have been shown to efficiently iterate towards the ML estimates. In more complex models, more general or complicated search algorithms might be needed (for example, there is the EM algorithm which is useful for dealing with missing data. So useful that the missing data is sometimes invented, in order to use it).

These algorithms only give point estimates. We also want to know the standard error, and there are methods for doing this. Again, it may be that the equations can be derived directly, or can be found with a simple numerical search. For less standard problems, more complex algorithms can be used. Two popular ones are called the bootstrap and the jackknife. They both work by re-sampling the data, and using that variation to estimate the standard error. What is less well known is that they can also be used to estimate the bias in the point estimates.

Even more complex problems have even more complex solutions. For Bayesians, the most common method for fitting their models is a technique called MCMC. This estimates the parameters (actually their full distribution) by simulation. The technique is not simple to explain, but it is used because it is very flexible and for a lot of problems works well in practice. However, this is not the same as a Bayesian analysis: MCMC can be used by frequentists as well, and other complicated methods can also be used to fit Bayesian models (e.g. sequential importance sampling).

So, now we have a model fitted. There is still one part of the process that is often done: model selection. This is not always done, and is perhaps done more often than it should. It also creates a lot of heated debates, which may surprise some of you. That's because model selection is often disguised as hypothesis testing.

A general description of the role of model selection is that it is a way of choosing between different mathematical models for the data, to find the best one. For frequentists there are two principal ways of doing this: null hypothesis statistical testing and using information criteria to compare models.

Null hypothesis statistical testing (NHST) is the method everyone is taught to test hypotheses. What one does is set up a null hypothesis, for example that there is no difference between two groups, and then test whether this hypothesis is supported by the data. If it is, then the null hypothesis is accepted (strictly, it's not rejected), otherwise it is rejected. In one common formulation, whenever a null hypothesis is set up, a more general alternative hypothesis is also proposed: rejecting the null hypothesis means accepting the alternative hypothesis.

Where is the model selection in this? We can take the Harry Potter example, and test the null hypothesis that the change in book length over time is zero. The relevant part of the mathematical model is

m_i = a + b X_i

where m_i is the expected number of pages, and X_i is the year of publication. The null hypothesis is that b=0. But this is equivalent to this model

m_i = a

And m_i = a + b X_i is the alternative model for the alternative hypothesis. Hence, hypothesis testing becomes a problem of comparing two models. The models are compared by calculating the amount of variation in the data explained by the two models, and seeing if the larger model (i.e. the one with the effect of time) explains a significantly greater amount of variation.

The alternative method for model comparison uses information criteria. These are measures of model adequacy: they consist of a measure of model fit (the deviance: the smaller the deviance, the closer the fitted model is to the data), and a penalisation term for complexity (the more parameters in the model, the higher this penalisation). The different models being compared are fitted to the data, and the relevant information criterion (e.g. AIC or BIC) is calculated for each one. The model with the lowest criterion is declared the best. If there are several models with similar criteria, they are sometimes all examined, and one of them chosen to be the best, for other reasons (e.g. because it makes sense scientifically).

Both of these methods have their counterparts in Bayesian analysis: hypothesis testing can be carried out using Bayes factors, for instance, and there is an information criterion for Bayesians (DIC. Alas, not the Bayesian information criterion!).

So, there we have it. OF course, this is not all there is in statistical inference: for example, I have not dealt with model checking (i.e. asking whether the model actually fits the data well), and there are many details I have left out. But I hope this give a scheme that separates out the different parts of fitting statistical models, and reduces some of the confusion. To summarise, here is a list of the parts of the process, and examples of the actual methods used in each part:

The mathematical model

regression, ANOVA

ARIMA (time series)

Generalised linear models, and Generalised linear mixed models

The inferential framework

Frequentist (Maximum Likelihood and REML)

Bayesian

least squares

non-parametric

Model fitting (i.e. the computations)

bootstrap and jackknife

MCMC

importance sampling and sequential importance sampling

Model selection

Significance tests

information methods (?IC: AIC, BIC, DIC, etc)

Monday, 30 July 2007

Statistical Modelling: The Bits Pt. 1

It's a sad fact of life that it's too short, so we don't have time to learn everything that we need to know. For most biologists (and probably most scientists, and indeed social scientists), one of the things they should learn to do science well is statistics. Of course, the poor dears are too busy learning about things like biochemistry or PCR, to have time to learn about the important stuff.

Because of this, there is a lot they don't know, and also a lot of confusion about the bits they do know. I therefore thought it was worth writing a few posts about some of these issues, to try and dispel at least some of the confusion.

Now we're below the fold, I should admit that I don't mind too much the lack of knowledge - there are large swathes of statistics that I don't know about either (I was too busy admiring the paper aeroplane building skills of my fellow biochemistry students). The confusion is more of a concern, and I think it is largely because biologists get little training in statistics as undergraduates, so don't have the grounding in the theory when they start doing research. And, yes, I am generalising - some biologists do become very good statisticians. But there is still plenty of confusion.

A lot of basic confusion comes from not understanding the different parts of using models to analyse data. For the moment I am ignoring things like model checking, and how to interpret the results. This is not because they are unimportant, but just because I want to write a blog entry, not a monograph. Anyway, There are, I think, 4 parts that can be separated out:

the mathematical model
the inferential framework
model fitting (i.e. the computations)
model selection

Point 1 is often passed over, and points 3 and 4 are often confused with point 2. I will deal with the first two points in this post, and treat the others later.

To help disentangle the different parts, it is useful to have an example. For this, we can use a simple linear regression, such as the one I used here. The mathematical model is self-explanatory. For the regression, it can be written as

m_i = a + b X_i
Y_i ~ N(m_i, s²).

The first line is the equation for a straight line. X_i is the covariate (year, in the example). This is then related to an expected value, m_i. The second line says that each data point, Y_i, is normally distributed with mean m_i and variance s².

Note that the data are assumed to be drawn from some distribution. Hence, they are random. However, they depend on parameters, in this case the mean and variance. Here the mean is modelled further, using an equation, so it is deterministic. i.e. if we knew the parameters, we would know the true value of the mean.

More complex analyses also have a model underneath them: the usual ANOVA actually has a model that is almost the same. The mean can be a function of several parameters, and might not be the mean, but just a parameter that controls the likelihood. The variance might also be modelled. And the parameters that are modelled could themselves be functions of more parameters. And all these parameters might themselves be random. Yes, things can get complex.

The problem, then, is to estimate the parameters. There are several schemes for doing this. For the model above, the two commonly used schemes are the frequentist and the Bayesian approaches. Each of these gives us a way to estimate the parameters. They are, in essence, philosophical approaches to understanding what we mean by probability and randomness. In the frequentist scheme, what we observe is random, with fixed parameters, and hence any statistic we calculate (such as the slope of the straight line) is a random function of the data. In contrast, the Bayesian approach is to say that anything we are uncertain about is random. Hence, we treat the data as fixed (because we know it: we have observed it), and the parameters are random.

Both of these philosophical schemes are useful in that they then can be used to construct methods for estimating the parameters of the mathematical model. Both use the likelihood, i.e. the probability of the data given values of the parameters, albeit in different ways. For example, the frequentist scheme tells us that we should find the values of the parameters which give us the maximum value of the likelihood. Hence, this is often spoken about as a maximum likelihood approach (there is another philosophical scheme which leads to the same equations, but to me it always looks like asking Fisher to wave his fiducial wand to make everything mysteriously right. I assume I'm missing something important).

Estimating the parameters is not enough, though. We also want to know how reliable they are. It makes a difference to our estimate that for every year J.K. Rowling writes an extra 300 pages if the range of possible values is 290-310 or 0-600. In the former case, we have a pretty good idea how much time to book off work to read her latest book: in the latter, we have little idea about whether we need to have a headache or major heart surgery. We can summarise the reliability in a variety of ways: a popular one is the probability that the estimate is less than zero. Alternatives include giving a range of likely values (typically a range for which there is a 95% probability that the estimate is within it), or a statistic such as a standard deviation to summarise the variability in the estimate (in the frequentist scheme, this standard deviation is called the standard error).

In my next post, I will describe the model fitting and model selection, i.e. how these schemes are used in practice to make inferences. I suspect this will include a rant about p-values.

Before I stop to work myself up into a frenzy, one last point is worth mentioning. The approaches discussed above assume that the data are normally distributed. If we discard this assumption, we can use different approaches. One, called least squares, is based on optimising a property of the model and data (i.e. minimising the sum of the squares of the differences between the data and the expected values). It turns out that this is the same as maximising the likelihood, i.e. it is mathematically the same as the frequentist approach. I, therfore, use the frequentist interpretation of the equations, and it gets round a lot of fiddly problems. Another, called non-parametric statistics, does away with any distributional assumption, but also throws away part of the data: rather than use the data themselves, the analysis concentrates on using the ranks of the data (i.e whether each Y_i is the smallest, second smallest etc.). In some cases, I think this can still have a likelihood interpretation, but I must admit that I haven't checked: it's just a hunch.
Read more!

Tuesday, 24 July 2007

Book review: Bayesian Methods for Ecology

Review of Bayesian Methods for Ecology (NHBS) by Mick McCarthy

I've been meaning to write this for a couple of weeks, but I've either been too busy, or the moment I sit down to write it, a cat appears and sits on the keyboard. Well, I sorted the cat problem: he leapt up, I cut his nails.

So, the title of the book is fairly explanatory: it's about Bayesian methods for ecology. It's clearly aimed at ecologists who are not trained in statistics, but who need to use statistical methods. Mick spends a lot of the first half of the book giving the background to Bayesian methods, justifying their use, and criticising the use of hypothesis testing. He then moves on to describing standard models (regression and ANOVA) and how to fit them in a Bayesian way. The final part of the book consists of case studies using Bayesian methods, showing how they work in practice.

On the whole, I really like the book: it provides a simple, easy to follow, explanation of what Bayesian methods are, and how to use BUGS to fit simple models. the latter point is important, as it means that ecologists can see how to use the methods in practice, and the code and data are available on the web.

One thing I did like about the book is its emphasis on using information from outside the data to improve the estimation. This is an aspect if Bayesian methods that I use less than I should, but is perhaps particularly important in practical conservation problems, where there is a lot of background information, and the aim is not to demonstrate some theory (where informative priors can bias the demonstration). If this book encourages scientists to use Bayesian method for problems where several strands of information have to be synthesised, then it will have done a useful service.

I have a couple of criticisms. The first is that I feel the strength of the Bayesian approach is in the way it can handle complex models. This relies on a hierarchical scheme for modelling data. Although this is mentioned, I would have liked to have seem more on it: I think a whole chapter would have been worth aiming for (and possible combining the regression and ANOVA chapters: they're really the same thing). My second criticism is over the excessive use of DIC. This is a criterion for comparing how adequately different models fit to the data. This is really a philosophical complaint: the problem with using criteria like DIC is similar to one problem with hypothesis testing: it says something about the statistical properties of the models, but nothing about the substansive, ecological, properties. Given enough data, DIC will show that the more complex model is better. It will not show whether the extra effects are important in any real way. I would rather see a greater emphasis on examining the parameters, and using them to decide if the more complex model explains anything: does removing the parameters reduce the predictive ability substansively (rather than statistically)?

So, the book isn't perfect, but I would still recommend it to ecologists wanting to understand the basics of Bayesian methods: not only does it give a glimpse of what they can do, but it also allows them to do it. I hope it will help ecologists (and others!) get over the first few steps to doing Bayesian analyses, by giving a simple explanation, and code to run (and change!) on simple problems. After that, what follows is the same, but bigger. Once the first hurdle is passed, more advanced problems can be tackled with the help of Andrew Gelman's latest book, so it should not be too difficult to progress to the complex models of the sort I find myself working on.

Read more!

Tuesday, 17 July 2007

On the growth of literary yeast genes

I think one reason I enjoy working as a statistician so much is that we can use the tools of my trade to tease out patterns in data, and actually learn something about the world we live in. One aspect of this is the realisation that if one is trained in statistics, then one is aware of how much more can be extracted from the data. Oh, and also spot subtle mistakes. So, whilst I feel statistics is a service industry: there to help others in their work, there is also a strong element of either pride or egotism in my work. In the light of this...

A couple of computer scientists have just had a paper released in PNAS on "Temporal patterns of genes in scientific publications" (all figures below are from the paper). They looked at how references to genes changed over time in the scientific literature, in particular yeast genes, and whether there was variation in the way different genes were treated. I suspect this was done because it looked like a fun thing to do (a motivation I thoroughly approve of), but it does have implications for the way science is done. We might expect that some "super genes" are intrinsically more important, and hence there is more work on them than on others. Or the patterns of investigation might just be a result of following the herd: people study genes just because others are studying them. Now, a paper like this can't give a full explanation for the patterns of investigations, but it can give some useful information: it can show what the patterns are, and so help focus a more detailed investigation on the interesting aspects of what's going on.

What they did was to get their computers to trawl through the scientific literature and find all the mentions of thousands of yeast genes between 1975 and 2005. Having done that, they then asked how the pattern of references to these genes evolved. The data for two genes look like this:

ACT1 is the most popular gene, whereas PET54 is more typical. The solid lines give the actual data, the dotted line is from a simulation. They assumed a simple model where the number of references to a gene depended on three things:

the number of references to that gene in the previous year,
the number of references to all genes in the previous year, and
a background rate of reference.

So, they're imagining a system where work on a gene starts occurs randomly (this is the background rate), but the more work there is on yeast genes, the more likely work on a new gene will start. Makes sense: for whatever reason, someone gets a grant and starts to work on a gene. And this can be seeded by work on another related gene that is part of the same biochemical system. Once work on a gene is started, it should beget more work on that gene; it's good scientific practice, every answer brings up 3 more questions (and hence three more grant applications).

This model wasn't quite good enough though, so they added a saturation term: looking at the data, eventually the rate of citations slows down (for those who are interested, they use the Maynard-Smith and Slatkin model of density dependence). They then fitted this model to the data, assuming the same parameters for all genes (*). At about this point I think "whooo, dodgy". Determined to thwart me in proving my superiority, they then show that this model predicts the distribution of references pretty well. They suggest that what this is showing is that the assumption that the parameters are the same is reasonable. In other words, the genes are behaving equivalently, and there aren't any genes that are more important (and hence more highly cited) than others.

What this implies is that there is nothing intrinsically more important about any one gene as compared to any others: they are behaving as if they are equal, and the reason any one gene is researched more is just happenstance. Once genes become popular, they get researched on more. I guess this could be explained simply because there are more questions that can be asked about a better studied gene.

A few things bother me about this research, though. First, the interpretation of the parameters is off. The authors claim that, if there is no saturation, the growth rate of citations is 0.23, the sum of the growth rates for the gene effect and the general rate of gene reference. Well, um, OK. Let's look at the equation (ignoring saturation, for simplicity):

N_i,t = 0.028 P^*_t-1 + 0.20 P_i,t-1 + 0.005

N_t is the number of new references about a gene in a year, P^*_t-1 is the total number of references about all genes (including gene i) up to year t-1, and P_i,t-1 is the total number of references about gene i up to year t-1. So, the growth rate is only 0.23 (=0.028+0.20) if the number of citations for any one gene (P_i,t-1) equals the total number of citations of all genes (P^*_t-1) (and so the total number of genes is... work it out for yourself). There are over 4000 genes in the data base, so the strength of effect of the total genes should be much higher: on
average, the effect of P^*_t-1 shoud dominate.

They compound this error by claiming that most of the growth is caused by the gene itself. The model above says that an increase in one reference to one gene in one year leads to an average of 0.2 references in the next year, whereas an increase in one reference to a yeast gene in general leads to 0.028 references (on average) the next year. Look like the effect of a single gene is stronger, doesn't it?

No. There are 4000 genes, so there is more variation in the total number of references of genes. If we only had 100 genes, all equally cited, then if they all increased by 1 reference in a year, then P_i,t-1 would increase by 1, leading to an increase in reference due to this of 0.2, but P^*_t-1 would increase by 100, so causing an increase of 0.028*100 = 2.8. i.e. over 10 times higher.

Statistically, the point is that the interpretation of regression coefficients depends on the variation in the covariate: one can get larger coefficients simply by changing the measurements from kilometres to nanometres. Substansively, the point is that the dynamics are probably not being driven by single genes, but by the overall rate of growth in yeast genetics. This could be because yeast genetics overall is experiencing growth, or because there has been an inflation in papers generally in science. But it suggests that the dynamics are even more neutral: work on a gene does not beget more work on that same gene. But, I am not convinced that we can conclude this, because...

My other concern is that it's not at all clear to me that the model is working well. One problem is that the model seems to predict too few intermediate genes. This is their plot:

The dots are the observed frequency, the crosses are the simulated frequency. In the middle of the plot, the dots are all above the crosses, so too few genes are being predicted with intermediate frequencies. I'm not sure what this means: usually one would expect the opposite pattern, as variation in the parameters make some genes more extreme. My initial guess is that the rate at which genes appear is not homogeneous: if genes appear at a greater rate in the middle of the time period studied, then this is what we would see. The simulation would spread these genes out over the whole time period, so there would be more genes appearing early (so having more citations) and late (so having very few citations).

Unfortunately, I'm guessing about this, which bring me on to one of my big Statistician Are Important points. The way the model is checked is very simple, so it could be missing a lot of problems. My main worry is that the data were collected at the gene level, so I would like to see how the model predicts the behaviour of individual genes. In the first plot above we see a couple plotted, and it looks like the prediction for ACT1 is poor: it badly under-predicts the behaviour from about 1995. However, this is only one simulation: what about the rest? If they all do this, then there is a problem with predicting the popular genes, i.e. the ones that yeast biologist have decided are important.

There is a lot of data, but something as simple as plotting the residuals (the difference between the observed values and those predicted by the data) against the predicted values is often very telling. For example, if the model fit plot above was plotted as a residual plot, it would give a curve that would be frowning at us, whereas a good residual plot shouldn't have any visible pattern. This could simply be done for all the data (giving one huge plot!), or for subsets (e.g. sub-setted by popularity). I suspect all sorts of problems would be seen: some important, others trivial and ignorable.

So, OK. I'm not a great fan of the analysis. What would I do? I think the first thing would be to look at the data. There is too much to look at it all, but it might be possible to cluster the genes into groups according to their publication record, and look at representatives from each group. From that, I could get a feel for how they are behaving: are most increasing exponentially? Do many reach a plateau? Do they decrease in publication (as researchers get bored and move on to other genes)? Do different genes become popular at different times? Or do all the plots look similar? This would guide me in deciding what sort of model to fit.

Then I would look at fitting a model with different parameters for each gene. I might drop rare genes first: the ones that only get mentioned once or twice. Any estimates for them will be crap anyway. Now, this leads to a lot of parameters, but we can use an approach called hierarchical modeling: basically, assume that the parameters are related, so knowing about 20 parameter values tells us about the 21st. This gives us a direct way of measuring how similarly the genes are behaving: if the parameter estimates are almost the same, then the genes are acting the same. We can even ask where the differences are: is it that genes are appearing in the literature at different times? Or are the growth rates different - i.e. do some genes spawn 10 publications per publication whilst other only spawn 1 or 2?

Once this has been done, we can then check if the model actually works, i.e. if it actually predicts the data well. If it doesn't, then we have to do the process again, based on what we have learned. This can be something of an iterative process, but eventually it should converge to a reasonable model. Hopefully.

The advantage of the approach I'd envisioning is that it becomes more of a conversation with the data (or an interrogation, if it doesn't want to behave). The process is one of learning what the data is telling us, using the plots and models to pick out the important information. We then need to interpret this information (correctly!) to be able to say something about the system - i.e. whether the genes are behaving the same, or what are the differences in rates of reference. And also the extent to which work on one gene is driven by previous work on the same gene, or whether yeast genetics is working by looking at genes in general. One could even try to ask questions about whether this has changed in time: has there been a shift from focus on a single gene to a more general, holistic approach (i.e. do the regression parameters in the equation above change over time. More hierarchical models)? These are all questions that could be tackled: the techniques are there, as long as the data is prepared to cooperate.

Stepping back a bit, and trying not to trip over the cat, this shows a response I have to many papers. The questions are interesting, but the data analysis and interpretation can be improved. For this paper, the problems in the analysis are fairly typical: I suspect the authors are not aware of the techniques that are available. This is understandable, as statistics is its own field, and I would be as lost in, say, yeast genetics. It's because of this that statisticians are needed: even if it's just to have a chat to over coffee, or a pint of Guinness. The interpretation aspect is more of a problem for me: it does not need a great technical knowledge to understand what is wrong, just an appreciation of what the equations mean.

Overall, could do better. But I hope I've shown how to do better. Constructive criticism, and all that.

Reference: Pfeiffer, T. & Hoffmann, R. (2007) Temporal patterns of genes in scientific publications. Proc. Natl. Acad. Sci., 104: 12052-12056. DOI: 10.1073/pnas.0701315104.

* Technical note: they used a general optimisation algorithm (optim in R) to do this. However, given the values of alpha and P_S, the model is a generalised linear model with an identity link. So, they could have used used this fact, and just written a function that would optimise alpha and P_{_S}, fitting the model as part of the function. Actually, I think I would have started with a density dependence function with one less parameter: even easier!

Read more!

Monday, 25 June 2007

OpenBUGS3.0.1 released

Of the many "it seemed like a good idea at the time" things I have done in my life, one was to volunteer to set up the OpenBUGS pages. I did this in the spirit of helping out, rather than one of showing off my html skills (I hope this is obvious from the pages). Nowadays this mainly means transferring a zip file from a memory stick to our we server, and then checking I haven't broken anything. The real programming work is done by Andrew Thomas, so he's the one who should get all the thanks now.

Anyway, as you might have guessed from the title, I've just put version 3.0.1 of OpenBUGS up on the web. Actually, this is a slight cheat - I put a version up a couple of weeks ago, and then we promptly found a couple of bugs/features. Andrew solved both of them, although one solution is geopolitical - giving the Åland islands back to Sweden.

The good news about the changes is that it gave Andrew more time to work on the BUGS to LaTeX language converter, ably supported by Mikko Sillanpää. Of course, the first thing I tried with it was throwing some of my more complex models at it, but it seems to manage. For example, here is what the LaTeX code produces for this paper:

OK, it's not perfect (and looks better on its own: bits are disappearing off the edge of the window) but it is fairly easy to edit the LaTeX code. There are also a few changes to the BUGS code that would make it more readable (e.g. writing tau.S, rather than tauS, so that BUGS recognises the tau). Overall, I'm impressed, and suspect I'm going to drift even deeper into the clutches of LaTeX now.

So, I'm happy for a bit - even if I should be doing N other things this week.

Deep Thoughts and Silliness