By a virtual game of Chinese Whispers (in which no Chinese were involved), I found out about a paper on arΧiv where a poor unsuspecting physicist wanders into a curious part of statistics. I'm actually something of a bystander in this area, but it's not going to stop me commenting on it.

OK, so the paper is by a guy called Bruce Knuteson, from MIT. He's interested in working out the scientific worth of a piece of empirical work, and being a physicist, he wants to measure it.

So, the first problem is to decide what worth is. Knuteson decides to measure it in terms of "surprisal", i.e. how surprised we are by a result. So, if we collected data, and got a result (say, a measurement of a parameter) *x _{i}*, how shocked would we be by it? From this, Knuteson decides that ...

The scientific merit of a particular resultxshould thus (i) be a monotonically decreasing function of the expectation p(_{i}x) that the result would be obtained, and (ii) be appropriately additive._{i}

and so suggests -log(Pr(

*x*)) as a measurement, as it has these properties. He then suggests that the worth of an experiment can be estimated as the expected value of this, i.e. the sum of -Pr(

_{i}*x*)log(Pr(

_{i}*x*)). This is a measure called entropy: something beloved of physicists and engineers, but rather opaque to the rest of us. The idea is that a larger entropy will mean that the experiment is better - we will expect more surprising results.

_{i}But is this a good measure? Perhaps a good way of tackling this is to view it as a problem in decision theory. How can we decide what is the best course of action to take when we are uncertain what the results will be? For example, if we have a choice of experiments we can carry out, how can we decide which one to do? To do this we first need to define "best". This has to be measured, and the numerical value for each outcome is called the utility, U. This might, for example, be the financial gain or loss (e.g. if we are gambling), or might be something more prosaic, like one's standing in the scientific community (however that is measured. h-index?). All the effects of each action, both positive and negative, go into this number. So, for example, we would include the gain in prestige from publishing a good paper, and the cost (e.g. financial, or the effect on our notoriety if the results are a turkey). The second part of the decision analysis is to give a probability for each outcome, so for action A the probability might be 0.3 that we get a Nature paper, and 0.7 that we get a Naturens verden paper. For action B it might be 0.9 and 0.1 respectively. We then calculate the average utility for each action, i.e. sum the probability of each result multiplied by the utility for that result.

This is what Knuteson does to get his measure. The problem is that his only utility is surprisal, and in general this doesn't make sense. Two things are missing. Firstly, there is no cost element. So, if we want to measure the time it takes an apple to fall on a physicist's head, it makes no difference if we pay a couple of students $1 or £30,000,000 to do it. The second problem is that there is no measure of scientific worth. Finding out if the next toss of a €1 coin is treated exactly the same as finding out if the Higgs boson is green.

This leads to clearly nonsensical results. If there are only two possible outcomes of an experiment, then the maximum expected surprisal occurs if the probability of one is 0.5. Therefore the optimal experiment is one with this property. For example, tossing a €1 coin. According to Knutsen, then, we should fund lots of coin tossing experiments (hmm, there's an Academy of Finland application deadline coming up).

The second thing that is missing is where the probabilities come from. These are probabilities of outcomes that are not observed, so in general they cannot be measured (without doing the experiment...). Therefore one has to assign them based on one's subjective opinion. Now we are on familiar Bayesian ground, and is something that has been argued about for years. But here I think Knutsen can use a sneaky trick to sidestep the problems. Put simply, he could argue that in practice the estimation of merit is made by people, so they can assign their own probabilities. If someone else disagrees, fine. This way, it is clearer where the disagreement lies (e.g. which probabilities are being assigned differently).

So, estimating the merit of a piece of work before it is done can be problematic (and I haven't touched on comparing experiments with different numbers of possible outcomes!). But Knutsen develops his ideas even further. How about, he asks, worthing out the merit of an experiment after it has been done?

Before doing this, Knutsen sorts out a little wrinkle. It is not generally the experimental results themselves that are of interest - it is how they impact on our understanding of the natural world. We can think about this in the way that we have several models,

*M*

_{1},

*M*

_{2}, ...

*M*

_{k}, for the world (these might correspond to theories, e.g. that the world is flat, or that it is a torus). The worth of an experiment could then be measured in terms of how it changes what we learn about these experiments, i.e. how

*M*changes with the data,

_{j}*x*. This can simply be measured as the entropy of the models, the

_{i}*M*

_{k}'s, rather than the experimental outcomes.

Knutsen goes through the maths of this, and finds that the appropriate measure of the merit of an experiment is a measure of how far the probabilities of each model are shifted by the experiment. To be precise, it is a measure known as the Kullback-Leibler divergence (I will spare you the equations). Now, this again is something that is familiar. A big problem in statistics is deciding which model is, in some sense, best. This can be done by asking about how well it will predict an equivalent data set to the one being analysed. After going through a few hoops, we find that the appropriate tool is the K-L divergence between the fitted model and the "true" model. Of course, we don't know the true model, but there are several teaks and approximations that can be made so that we don't need to - it is the relative divergence of different possible models that is important. The result of this is a whole bunch of criteria that are all TLAs with IC at the end - AIC, BIC, DIC, TIC, and several CICs.

The optimal experiment is the one which will maximise the difference between our prior and posterior probabilities of the different models (yes, Bayesian again). The idea is natural - the greater the difference, the more we have learned, and hence the better the experiment is. Of course, we still have the same problems as above, i.e. assigning the probabilities, and getting the utility right, but we are in the right area. Indeed it turns out (after browsing wiki) that the idea is not original - the method proposed by Knutsen is the same as something called Bayesian D-optimality. And (after reading the literature), the idea goes back to 1956!

So, does this help? For the general problem of estimating scientific merit, I doubt it. There are too many problems with the measure. It may be useful for structuring thinking about the problem, but in that case it it little different from using a decision analytic framework.

In experimental design, it is of more use, but then the idea is not original. The other area it might be useful is in summarising the worth of an experiment for estimating a parameter, such as the speed of light. There will be cases where physical constants have to be measured. Previous measurements can then be used to form the prior (there are standard meta-analysis methods for this), and then the K-L divergence of several experiments can be calculated, to see which gives the largest divergence. This is some way from the ideas Knutsen is thinking about (he explicitly rejects estimating parameters as being of merit!). But I think more grandiose schemes will die because of naysayers like me nagging at the details.

Reference

Lindley, D.V. (1956). On a Measure of the Information Provided by an Experiment. The Annals of Mathematical Statistics, 27, 986-1005.

## 2 comments:

Your criticisms of this paper appear to be adequately addressed in the paper itself.

Specifically:

"First, there is no cost element." This paper is one of the first I have read that explicitly factors in cost together with payoff. Note the tables in Section III, which explicitly divide by cost.

"Finding out if the next toss of a €1 coin is treated exactly the same as finding out if the Higgs boson [exists]." The first three paragraphs of Section II B (and subsequent development of that section) note this is not the case, using the very example you quote.

I do not understand what specific points you were trying to make in the rest of your posting. As far as application of the proposed method in practice, it seems pretty clear from the examples given in Section III that the method has real teeth.

Is this truly possible? "...in practice the estimation of merit is made by people, so they can assign their own probabilities. If someone else disagrees, fine."? What is the measure? How is the data obtained? This is rather bewildering...

Post a Comment