This post carries on from my previous excursion into Bayesian statistics.
Bayesian Science
A mathematician friend once told me that Bayesian inference is the type of inference that fits in most readily with the scientific method (that being the method I am most prepared to use in the majority of situations). It is true that a Bayesian inference, if done properly, represents a mathematical version of an idealised scientific inference - we have some explicitly stated prior beliefs, based on previous evidence, and we look for data, in the form of experiments or observations, which are combined to form an inference. Lovely.
And it is lovely, if it can be properly applied. However, in practice it is virtually impossible to accurately translate our prior beliefs about the world into probability distributions. To inflict a sudden explosion of reality onto my unfortunate reader, let us take a look at what might have formed our prior beliefs about the knocked-out disease gene in the previous post.
We have a single gene that we believe may be involved in the immune system. We might have noticed that it is consistently present throughout all mammals, but it has a larger number of changes than other genes in the region, suggesting that it is under selection to change a lot; this tends to hint at immune system function, since the immune system has to change in response to different types of disease. Perhaps we try to predict what protein the gene produces, to look at the sequences of amino acids that make it up, and find that it has some sections that are similar to bits of proteins found on the surface of white blood cells. However, neither of these are definite evidence; they may well be false alarms. We can look to see if mice with certain disease disorders have different amounts of the gene being made into protein, but then microarrays can be damn fiddly; if we don’t find any evidence of differential expression, then it is entirely possible we missed it by chance, and if we do find some it might just be a non-causative correlation.
I am sure you can see that trying to turn all this information into a prior would be extremely difficult, and even if you managed it you would have to make so many ad-hoc decisions about the probability of various things that you would have very little confidence in the answer. Plus, complex priors make for complex calculations, so the cost of computing it all will add up. In reality, when we try to apply Bayesian inference we try and find a simple prior that is ‘good enough’, and can be manipulated easily enough to do inference.
The Root of the Problem
The argument about Bayesian statistics and science really breaks down into two parts (I am unfortunately at a loss to think up a third, which disappoints me). The first part is uncontroversial; neatly laying down our prior beliefs and assumptions, and showing how evidence modifies these, are Good Things when doing science. Secondly, we should be using the mathematics developed by the Bayesian school of statistics to do our scientific inference.
The Bayesian project is to attempt to produce a consistent, reliable underpinning for statistics. This is a very laudable aim, but it is not the aim of science. It is to misunderstand the role of statistics in science to state that the theoretical and philosophical niceties of the Bayesian school provide justification for their use in biology. A scientist uses an algorithm if it can be shown to work with real data; the underlying theory is important for generating the algorithms, but ultimately it is their successful application that justifies their use.
Yes, we want to infer things from data, and incorporate what is already known, but we do that by the the process of science. Statistics has a very important part to play in that, but it is not the only or the main tool we use. The classical school, full of hypothesis tests and likelihood ratio tests, is popular among scientists because of it’s relative objectivity; the tests tend to rest on foundations that scientists can agree on. The actual ‘Bayesian’ part, where the subjective judgments take place, occurs after the statistics have been done, in Discussion sections, lab meetings and tea rooms. We might not be able to form a mathematical prior based on the previous evidence we’ve seen, but we are trained in making decisions on what to believe and what choices to make with future experiments based on this information.
Bayesian statistics certainly has a lot to contribute to science, in the form of new algorithms for extracting data and making inferences. However, Bayesian inference can answer questions if and only if we have a reliable method of generating priors that can be generally agreed upon; in other instances, we use whatever statistical test has been shown to do the job, and if no such test can be found, we propose further experiments or observations to answer the question in other ways. Subjective judgment is important, but Bayesian tests that can outperform scientific judgment in this arena are few and far between.
Exactly! An interesting take on this is http://www.jstor.org/stable/2291752 — a catalogue of attempts to form an ‘objective’ uninformative prior. It’s well worth a read, but to summarise:
Many people think a uniform prior is uninformative. They are usually wrong. There are better ideas, but most experts believe there is no such thing as ‘the least informative prior’. These ideas are useful, and can solve problems caused by choosing the wrong prior. However, they have two major drawbacks: many of the priors are constructed by taking expectations over the data, which breaks the likelihood principle; and the priors are often improper, which can cause all sorts of issues.
At least, that’s how things were in 1996. I don’t know if the field has advanced since then?
I don’t know about inference rules in general, but there has been a lot of talk about how to pick priors for biological models. In general people want uniformative priors, but there have been freak-outs recently over the fact that you cannot pick priors that are ‘uniformative’ with respects to all features of biological systems. For instance, attempted non-informative distributions across gene networks tend to imply very non-non-informative priors on certain regulatory motifs (i.e. make certain patterns of causation more likely than others a priori, which is pretty annoying if you are trying to figure out things about regulatory motifs in the first place).
People want to move towards using priors based on biologically plausable mechanisms, e.g. attemting to generate priors for species trees based on evolutionary birth-death processes and coallesence models. I don’t know how well that works, to be honest I think it might just be one of those ‘the way forward is to do this’ that people always say in their conclusions but no-one ever does, though to be fair I’ve never looked it up.
I may well read that paper. If I have time[1], I’ll take a look at who[2] has cited it and where things have gone since.
[1] I won’t have time
[2] over 300 people according to google scholar. See [1]