Bayesian inference is useful. It also provides a ‘quick and dirty‘ route to thinking about statistical inference since it just uses basic probability theory. This is to me its biggest strength and weakness – the ‘everything is probability theory’ view.
Why isn’t everything probability theory? In short
- Not everything is a ‘sum to one’ game
- Both uncertainty (‘randomness’) and structure are important
Regarding the first point, I think it makes sense to have a notion of ‘observable’ for which mutually exclusive possibilities must ‘sum to one’ – you either observe a head or observe a tail, for example.
One the other hand, I think it makes sense to include a notion of quantity for which the mutually exclusive possibilities do not need to sum to one. Two distinct models can be equally consistent with given observations. Introducing a third distinct model, also equally consistent with observations, shouldn’t change the ‘possibility’ value of the first two. It does change the probability of the first two, however.
These ‘purely possibilistic’ quantities are what I would call parameters. Probabilistic quantities are observables or ‘data’.
Interestingly, the key difference between likelihood and probability is that the former need not sum to one. Probability applies to data, likelihood to parameters. In certain special cases we can strengthen from likelihood to probability and regain Bayes (or perhaps confidence/fiducial) – identifiable models being one requirement. In less well-constrained problems I prefer likelihood.
The second issue is how to represent structure in addition to uncertainty. For example, a probability distribution is assigned in a given context. How does that probability distribution change when we change context? This is in a sense ‘external’ or ‘structural’ information. You can kludge it within Bayes using ‘conditioning on background information’ but this background information typically does not require a probability distribution. It is instead usually more akin to a ‘possibilistic’ quantity under analyst control or subject to assumptions external to the probability distribution. That is, it is ‘prior information’ but it does not take the form of a probability distribution. This is more common than you would think from the usual Bayesian story.
For example, Pearl – a self-described ‘half-Bayesian’ (perhaps even less these days) – uses ‘do’ notation to distinguish some types of structural assumption from the merely ‘seen’ observables described by probability. Likelihood can also incorporate these assumptions somewhat more naturally than Bayes as it allows for non-probabilistic ‘possibilistic’ quantities.
to the likelihood (and frequentist) notation
In this case z, and the dependence of each p(y|x) on z, lies outside the probability model. That is, the above represents an indexed family of probability distributions
no prior over z required.
An even more subtle ‘structural’ issue is the question of how ‘raw’ data obtains semantics or meaning. This to me is a point where both likelihood and probability/Bayes are open to criticism. Luckily, some of Fisher’s original ideas on, and motivations for, sufficiency and ancillarity can be used to improve the likelihood approach. I think. But that’s a topic for another day!
These might seem like purely philosophical concerns but to me they are quite practical. In short – and contra Lindley – I don’t think Bayesian inference works very well in the presence of non-identifiability. I’ll try to illustrate with some examples at some point.
It’s also worth noting that one of Neyman’s goals (see the discussions at the end of this) was “to construct a theory of mathematical statistics independent of the conception of likelihood…entirely based on the classical theory of probability.” Albeit in a different manner to Bayesian inference. This led him to “the basic conception…of frequency of errors in judgment”. I think such ‘error statistical’ frequentist inference also suffers some similar issues in dealing with non-identifiable problems.