Monday, March 22, 2010

Prediction vs Explanation

A commenter on an earlier post raised the question of whether predicting something in advance is different in some important way from explaining it after the fact. I think it is, I think the reason is interesting, hence this post.

Suppose someone does ten experiments and comes up with a theory that is consistent with the results of those ten and predicts the outcome of ten more experiments, none of which has been done yet. The experiments are done and the predictions are correct.

Someone else looks at the results of the experiments and creates a theory consistent with all twenty.

We now do one more experiment, for which the two theories give different predictions. Are they equally likely to be correct, and if not why?

Let me start with the obvious argument to show that the two theories are equally good. There are lots of possible theories to deal with the subject of the experiments. All we know about the two candidates theories is that each is consistent with the first twenty experiments. Hence they are equally likely to be correct.

Imagine that each possible theory is written on a piece of paper, and pieces of paper are sorted into barrels according to the results they predict for the various experiments. The first theorist restricted himself to the barrels containing theories consistent with the first ten experiments, drew one theory from one of those barrels, and it happened to be from the barrel containing theories also consistent with the next ten experiments. The second experimenter went straight to that barrel and drew a theory from it.

What is wrong with this model is the implicit assumption that experimenters are drawing theories at random. Suppose we assume instead, as I think much more plausible, that some people are better at coming up with correct theories, at least on this subject, than others. Only a small fraction of the barrels contained theories consistent with the second ten experiments, so it would be very unlikely for the first experimenter to have chosen one of those barrels by chance. It's much more likely if he is someone good at coming up with correct theories. Hence his coming up with a theory in that barrel is evidence that he is such a person—increases the probability of it. We have no similar evidence for the second person, since he looked at the results of all twenty experiments before choosing a barrel.

Since we have more reason to believe that the first theorist is good at creating correct theories, or at least more nearly correct theories, than that the second one is, we have more reason to believe his theory and so more reason to trust his prediction for the next experiment.

Statisticians may recognize the argument as a version of spurious contagion. Picking the right barrel doesn't make the theorist any better, but the fact that he did pick the right barrel increases the probability that he was (even before picking it) a good theorist.

12 comments:

Anonymous said...

A slight modification of this argument gives us a separate reason prefer predictions to explanations. Predictions, as you say, tell us more about whether a theorist is better at creating nearly-correct theories. But for a similar reason, predictions also tell us more about whether the world itself is (correctly) understandable in the particular respect being studied by the theorist.

An bad theorist in a sufficiently easy-to-understand world will be as successful in his theories as a good theorist in a hard-to-understand world. And, just as we don't know whether a particular theorist is good or bad, similarly, we don't know whether the world we are in is easy to understand or hard to understand (in the particular respect under study). Much as prediction gives us better information about the quality of the theorist, so does it give us better information about the comprehensibility of the world itself.

The better the theorist, the more confident we should be in those theories of his that are compatible with the evidence so far. But similarly, the more comprehensible the world, the more confident we should be in theories compatible with the evidence so far.

The comprehensibility of the world isn't something we already know. In every respect about which we are ignorant of the world, one of the areas of our ignorance is probably whether the world is comprehensible in that respect. (There are doubtless exceptions.)

Jon Leonard said...

More concisely, a theory that makes successful predictions is less likely a result of overfitting.

You can add exceptions to a theory to make it fit the test data better, but that often reduces its use for prediction; testing against data that wasn't used in setting up the theory avoids this problem.

Matt said...
This comment has been removed by the author.
Matt said...

You've discussed this before. The discussion of your previous post on (then overcomingbias.com, now moved to) lesswrong.com is wonderfully mathy / bayesian.

Ilíon said...

"What is wrong with this model is the implicit assumption that experimenters are drawing theories at random. Suppose we assume instead, as I think much more plausible, that some people are better at coming up with correct theories, at least on this subject, than others."

The correct word to use is not 'correct,' but rather 'consistent.'

Alex Perrone said...

If a theory correctly predicted an outcome, then that theory is to some extent confirmed. It could have been wrong, but it wasn't. If a theory merely explains past outcomes, there is no confirmation of the theory. It was never given the chance to be confirmed or disconfirmed (or in Karl Poppers terms, falsified). There is already this difference between the two theories; one doesn't have to resort to assessing how good the theorists are.

Ilíon said...

Ah, but even a confirmed theory may still be incorrect, which is to say, it may yet be wrong.

This problem is due to the very nature of 'modern science' (or, as our ancestors called it, 'natural philosophy'). Science can *never* get us truth; at best, it can be used to identify error. What's left, after we remove the scientifically identified error, can never be said, on scientific grounds, to be more that "possibly true."

Anonymous said...

I'm in the middle of analyzing about fifteen years' worth of grade data on computer science students, to see if particular modes of teaching (of which we've tried several in those 15 years) are more effective than others. (Grades aren't the best measure, but they're easy to quantify and should give us some useful information.). I have a lot of independent variables, and it's tempting to just put them all into SPSS and go prospecting for anything that reaches statistical significance. If I did that, I would almost certainly find some "significant" correlations, just by chance, because there are so many possible correlations to choose from.

My current plan is to divide the data set in half randomly (e.g. odd and even student numbers), use one half for "prospecting", and the other half to check predictions made on the basis of prospecting.

Humans, even with the aid of statistics, tend to find patterns whether they exist or not.

Ilíon said...

Of course, Hudebnik, the problem with the approach you're taking is that you have different individuals in the various groups. So, unless the numbers of individuals in the groups are large enough that individual apptitudes tend to cancel, then the patterns you derive will be worse than useless; for the patterns will look like knowledge, but will actually be anti-knowledge.

Alex Perrone said...

"My current plan is to divide the data set in half randomly (e.g. odd and even student numbers), use one half for 'prospecting', and the other half to check predictions made on the basis of prospecting."

I wish I knew more about statistics, but it seems to me you should use all of the data because it is a larger sample. One particular problem with your approach is that you are only using one of them for prospecting, so there may be a pattern in the prediction group that you never pick up on because you do not test them for new patterns, only patterns that hold for the prospecting group. You have to flip it around and let each group be the prospecting and prediction half, but then I think you should just use the whole data all the time.

I don't know enough statistics to know whether if you find a significant pattern in each of two halves of a data set, then this will carry into or perhaps be stronger than an effect for the whole group. I also don't know whether it is possible to find effects in both halves in the same direction, whereas the effect doesn't exist on the whole group level, but I would be interested to hear answers to the questions. Anyone know?

Anonymous said...

... there may be a pattern in the prediction group that you never pick up on because you do not test them for new patterns, only patterns that hold for the prospecting group. You have to flip it around and let each group be the prospecting and prediction half

I'll have to think about that; I'm not sure it's legitimate.

... but then I think you should just use the whole data all the time.

This is exactly what I must not do: it leaves me with no way to test the predictive power of whatever I find in the prospecting stage. It leaves the very real possibility that I found significant correlations just because I looked for so many.

When somebody says a statistical result is "significant at the 10% level," it means that in truly random data, the observed pattern would only be 10% likely to appear. Which means that if I go prospecting among fifty possible correlations, I would expect five of them to show up as "significant at the 10% level" even if the data were truly random, with no real patterns at all. If I then report those five correlations as statistically proven, I've read meaning into something that actually has none. I must have a way to check them against data that weren't involved in the prospecting expedition.

David Friedman said...

Hudebnik correctly describes the problem with what is sometimes described as a specification search, but unfortunately his approach doesn't solve it. If he first uses half the data to find specifications that fit that data at the 10% level and then eliminates all of those that don't fit the other half at the 10% level, he is following a procedure that has a 1% chance of producing a false positive for each specification he started with. He could have done that more easily and more efficiently by fitting the whole sample and looking for specifications that worked at the 1% level.