Tuesday 17 July 2007

On the growth of literary yeast genes

I think one reason I enjoy working as a statistician so much is that we can use the tools of my trade to tease out patterns in data, and actually learn something about the world we live in. One aspect of this is the realisation that if one is trained in statistics, then one is aware of how much more can be extracted from the data. Oh, and also spot subtle mistakes. So, whilst I feel statistics is a service industry: there to help others in their work, there is also a strong element of either pride or egotism in my work. In the light of this...

A couple of computer scientists have just had a paper released in PNAS on "Temporal patterns of genes in scientific publications" (all figures below are from the paper). They looked at how references to genes changed over time in the scientific literature, in particular yeast genes, and whether there was variation in the way different genes were treated. I suspect this was done because it looked like a fun thing to do (a motivation I thoroughly approve of), but it does have implications for the way science is done. We might expect that some "super genes" are intrinsically more important, and hence there is more work on them than on others. Or the patterns of investigation might just be a result of following the herd: people study genes just because others are studying them. Now, a paper like this can't give a full explanation for the patterns of investigations, but it can give some useful information: it can show what the patterns are, and so help focus a more detailed investigation on the interesting aspects of what's going on.

What they did was to get their computers to trawl through the scientific literature and find all the mentions of thousands of yeast genes between 1975 and 2005. Having done that, they then asked how the pattern of references to these genes evolved. The data for two genes look like this:


ACT1 is the most popular gene, whereas PET54 is more typical. The solid lines give the actual data, the dotted line is from a simulation. They assumed a simple model where the number of references to a gene depended on three things:

  • the number of references to that gene in the previous year,
  • the number of references to all genes in the previous year, and
  • a background rate of reference.

So, they're imagining a system where work on a gene starts occurs randomly (this is the background rate), but the more work there is on yeast genes, the more likely work on a new gene will start. Makes sense: for whatever reason, someone gets a grant and starts to work on a gene. And this can be seeded by work on another related gene that is part of the same biochemical system. Once work on a gene is started, it should beget more work on that gene; it's good scientific practice, every answer brings up 3 more questions (and hence three more grant applications).

This model wasn't quite good enough though, so they added a saturation term: looking at the data, eventually the rate of citations slows down (for those who are interested, they use the Maynard-Smith and Slatkin model of density dependence). They then fitted this model to the data, assuming the same parameters for all genes (*). At about this point I think "whooo, dodgy". Determined to thwart me in proving my superiority, they then show that this model predicts the distribution of references pretty well. They suggest that what this is showing is that the assumption that the parameters are the same is reasonable. In other words, the genes are behaving equivalently, and there aren't any genes that are more important (and hence more highly cited) than others.

What this implies is that there is nothing intrinsically more important about any one gene as compared to any others: they are behaving as if they are equal, and the reason any one gene is researched more is just happenstance. Once genes become popular, they get researched on more. I guess this could be explained simply because there are more questions that can be asked about a better studied gene.

A few things bother me about this research, though. First, the interpretation of the parameters is off. The authors claim that, if there is no saturation, the growth rate of citations is 0.23, the sum of the growth rates for the gene effect and the general rate of gene reference. Well, um, OK. Let's look at the equation (ignoring saturation, for simplicity):

Ni,t = 0.028 P*t-1 + 0.20 Pi,t-1 + 0.005

Nt is the number of new references about a gene in a year, P*t-1 is the total number of references about all genes (including gene i) up to year t-1, and Pi,t-1 is the total number of references about gene i up to year t-1. So, the growth rate is only 0.23 (=0.028+0.20) if the number of citations for any one gene (Pi,t-1) equals the total number of citations of all genes (P*t-1) (and so the total number of genes is... work it out for yourself). There are over 4000 genes in the data base, so the strength of effect of the total genes should be much higher: on
average, the effect of P*t-1 shoud dominate.

They compound this error by claiming that most of the growth is caused by the gene itself. The model above says that an increase in one reference to one gene in one year leads to an average of 0.2 references in the next year, whereas an increase in one reference to a yeast gene in general leads to 0.028 references (on average) the next year. Look like the effect of a single gene is stronger, doesn't it?

No. There are 4000 genes, so there is more variation in the total number of references of genes. If we only had 100 genes, all equally cited, then if they all increased by 1 reference in a year, then Pi,t-1 would increase by 1, leading to an increase in reference due to this of 0.2, but P*t-1 would increase by 100, so causing an increase of 0.028*100 = 2.8. i.e. over 10 times higher.

Statistically, the point is that the interpretation of regression coefficients depends on the variation in the covariate: one can get larger coefficients simply by changing the measurements from kilometres to nanometres. Substansively, the point is that the dynamics are probably not being driven by single genes, but by the overall rate of growth in yeast genetics. This could be because yeast genetics overall is experiencing growth, or because there has been an inflation in papers generally in science. But it suggests that the dynamics are even more neutral: work on a gene does not beget more work on that same gene. But, I am not convinced that we can conclude this, because...

My other concern is that it's not at all clear to me that the model is working well. One problem is that the model seems to predict too few intermediate genes. This is their plot:



The dots are the observed frequency, the crosses are the simulated frequency. In the middle of the plot, the dots are all above the crosses, so too few genes are being predicted with intermediate frequencies. I'm not sure what this means: usually one would expect the opposite pattern, as variation in the parameters make some genes more extreme. My initial guess is that the rate at which genes appear is not homogeneous: if genes appear at a greater rate in the middle of the time period studied, then this is what we would see. The simulation would spread these genes out over the whole time period, so there would be more genes appearing early (so having more citations) and late (so having very few citations).

Unfortunately, I'm guessing about this, which bring me on to one of my big Statistician Are Important points. The way the model is checked is very simple, so it could be missing a lot of problems. My main worry is that the data were collected at the gene level, so I would like to see how the model predicts the behaviour of individual genes. In the first plot above we see a couple plotted, and it looks like the prediction for ACT1 is poor: it badly under-predicts the behaviour from about 1995. However, this is only one simulation: what about the rest? If they all do this, then there is a problem with predicting the popular genes, i.e. the ones that yeast biologist have decided are important.

There is a lot of data, but something as simple as plotting the residuals (the difference between the observed values and those predicted by the data) against the predicted values is often very telling. For example, if the model fit plot above was plotted as a residual plot, it would give a curve that would be frowning at us, whereas a good residual plot shouldn't have any visible pattern. This could simply be done for all the data (giving one huge plot!), or for subsets (e.g. sub-setted by popularity). I suspect all sorts of problems would be seen: some important, others trivial and ignorable.

So, OK. I'm not a great fan of the analysis. What would I do? I think the first thing would be to look at the data. There is too much to look at it all, but it might be possible to cluster the genes into groups according to their publication record, and look at representatives from each group. From that, I could get a feel for how they are behaving: are most increasing exponentially? Do many reach a plateau? Do they decrease in publication (as researchers get bored and move on to other genes)? Do different genes become popular at different times? Or do all the plots look similar? This would guide me in deciding what sort of model to fit.

Then I would look at fitting a model with different parameters for each gene. I might drop rare genes first: the ones that only get mentioned once or twice. Any estimates for them will be crap anyway. Now, this leads to a lot of parameters, but we can use an approach called hierarchical modeling: basically, assume that the parameters are related, so knowing about 20 parameter values tells us about the 21st. This gives us a direct way of measuring how similarly the genes are behaving: if the parameter estimates are almost the same, then the genes are acting the same. We can even ask where the differences are: is it that genes are appearing in the literature at different times? Or are the growth rates different - i.e. do some genes spawn 10 publications per publication whilst other only spawn 1 or 2?

Once this has been done, we can then check if the model actually works, i.e. if it actually predicts the data well. If it doesn't, then we have to do the process again, based on what we have learned. This can be something of an iterative process, but eventually it should converge to a reasonable model. Hopefully.

The advantage of the approach I'd envisioning is that it becomes more of a conversation with the data (or an interrogation, if it doesn't want to behave). The process is one of learning what the data is telling us, using the plots and models to pick out the important information. We then need to interpret this information (correctly!) to be able to say something about the system - i.e. whether the genes are behaving the same, or what are the differences in rates of reference. And also the extent to which work on one gene is driven by previous work on the same gene, or whether yeast genetics is working by looking at genes in general. One could even try to ask questions about whether this has changed in time: has there been a shift from focus on a single gene to a more general, holistic approach (i.e. do the regression parameters in the equation above change over time. More hierarchical models)? These are all questions that could be tackled: the techniques are there, as long as the data is prepared to cooperate.

Stepping back a bit, and trying not to trip over the cat, this shows a response I have to many papers. The questions are interesting, but the data analysis and interpretation can be improved. For this paper, the problems in the analysis are fairly typical: I suspect the authors are not aware of the techniques that are available. This is understandable, as statistics is its own field, and I would be as lost in, say, yeast genetics. It's because of this that statisticians are needed: even if it's just to have a chat to over coffee, or a pint of Guinness. The interpretation aspect is more of a problem for me: it does not need a great technical knowledge to understand what is wrong, just an appreciation of what the equations mean.

Overall, could do better. But I hope I've shown how to do better. Constructive criticism, and all that.

Reference: Pfeiffer, T. & Hoffmann, R. (2007) Temporal patterns of genes in scientific publications. Proc. Natl. Acad. Sci., 104: 12052-12056. DOI: 10.1073/pnas.0701315104.


* Technical note: they used a general optimisation algorithm (optim in R) to do this. However, given the values of alpha and PS, the model is a generalised linear model with an identity link. So, they could have used used this fact, and just written a function that would optimise alpha and PS, fitting the model as part of the function. Actually, I think I would have started with a density dependence function with one less parameter: even easier!

1 comment:

Anonymous said...

Dear Bob, thanks for featuring our paper in your blog. Just two quick responses to the points you are criticizing: First, P* stands in our paper for the average number of publications per gene, not the overall number. Thus P* and Pi are of the same metric, and a comparison of the two rates makes sense. It is pretty clear from the data that most novel publication are indeed about genes that appear already at high frequency in the literature. Second, you argue that the model does not necessarily work very well. I agree that more complex models, say with individual parameters for each gene or a hierarchical model, can be expected to give a better description of the data. But in my opinion, the important point is that a simple three-parameter model gives already a good description of the data – it captures the major phenomenon we are discussing in the paper, namely the evolution of the frequency distribution of genes in the literature. I hope this clarifies your points. With my very best regards, Thomas
ps – and please keep posting on my papers ;) - good stuff on similar issues will come out soon!!!