# Likelihood, plausibility and extended likelihood

Version 1.8. (Mostly) written on a phone while Easter shopping.

Likelihood and plausibility

The usual definition of likelihood takes a probability model family

$p(y;\theta)$

and defines the likelihood as

$\mathcal{L}(\theta;y) = C p(y;\theta)$

For some arbitrary constant C. Note that the data and parameter have switched roles: the likelihood is considered a function of the parameter for fixed data.

A minor modification is to define the ‘unconstrained’ or ‘joint’ likelihood

$\mathcal{L}(\theta,y) := \frac{p(y;\theta)}{\text{sup}_{y, \theta}(p(y;\theta))}$

which is now considered a function of both the parameter and data. Furthermore, a normalisation condition is now a consequence of our definition, rather than somewhat informally added on as in the usual approach. To make this useful, and capable of getting us back to the usual approach as a special case, we define the operation of constraining the likelihood. Importantly, by starting from the joint likelihood, this can be defined in the same manner for both parameter and data.

Firstly, constraining on the data gives the usual (normed) likelihood

$\mathcal{L}(\theta || y) := \frac{p(y;\theta)}{\text{sup}_{\theta}(p(y;\theta))}$

where we use the notation || to indicate the ‘constraining’ operation. Note also the similarity to the definition of conditional probability (which would require theta to be probabilistic)

$p(\theta|y) :=\frac{p(y,\theta)}{p(y)} = \frac{p(y,\theta)}{\int p(y,\theta)d\theta}$

The only difference being the use of the ‘sup’ operation vs the integral operation and whether theta is probabilistic or non-probabilistic (in the former case we consider a joint distribution, in the latter a family of distributions).

The advantage of this slight generalisation is that we can now we can consider the joint likelihood constrained on the parameter, giving

$\mathcal{L}(y || \theta) := \frac{p(y;\theta)}{\text{sup}_{y}(p(y;\theta))}$

This is in fact Barndorff-Nielsen’s plausibility function. It can be considered as a measure of self-consistency for a single parameter value (single model instance from the family) and is strongly related to p-value type tests. It does not need alternative parameter values to be considered, rather it needs alternative data to be considered. Clearly it is related in spirit to frequentist concerns of the form ‘what if the data were different?’.

As argued by Barndorff-Nielsen, both aspects give us insight into the problem under consideration – they ask (and hopefully help answer) different questions.

Finally, given the above ‘constraining’ operation, note that our unconstrained likelihood really is just what it says it is – it’s what you get when you ‘constrain on none of the quantities’:

$\mathcal{L}(\theta,y) := \mathcal{L}(\theta,y || ) := \frac{p(y;\theta)}{\text{sup}_{y, \theta}(p(y;\theta))}$

So everything works together as expected.

Extended likelihood and prediction

A related notion, with roots going back quite a few years, clearly explained recently in various places by Pawitan, Lee and Nelder (see e.g. here, here or here as well as references therein for the full history) is extended likelihood. The key twist is allowing random data to be treated as unknown parameters. This helps, for example, for defining a notion of likelihood prediction.

But we already have all the ingredients needed, using the above! Our plausibility function allows the data to be treated as a parameter in a likelihood. It is in essence already a predictive likelihood.

The above is a slightly different way of looking at these issues than that given by e.g. Pawitan, Lee and Nelder – it is instead based on Barndorff-Nielsen’s ideas mentioned above (which in turn derive from Barnard’s). See also here. On the other hand, the basic idea of extended likelihood is to start from the model family considered as a joint likelihood, so the approach considered here is essentially equivalent (but see future posts on nuisance parameters).

To see how we might incorporate past data, consider a model family

$p(y,x;\theta)$

where we suppose x is observed (known) and y is to be predicted (is unknown). So we want to constrain on x and consider y and theta. Consider then

$\mathcal{L}(\theta,y || x) := \frac{p(y,x;\theta)}{\text{sup}_{\theta ,y}(p(y,x;\theta))}$

For a fixed choice of theta we would take

$\mathcal{L}(y || x,\theta) := \frac{p(y,x;\theta)}{\text{sup}_{y}(p(y,x;\theta))}$

Interestingly, in this latter case we learn nothing from past iid samples from the same model instance. That is, for x and y iid from the same fixed model instance (same fixed parameter value) we have

$\mathcal{L}(y || x,\theta) = \frac{p(y;\theta)p(x;\theta)}{\text{sup}_{y}(p(y;\theta))p(x;\theta)}$

which reduces to (cancelling the x terms)

$\mathcal{L}(y || x,\theta) = \mathcal{L}(y || \theta) := \frac{p(y;\theta)}{\text{sup}_{y}(p(y;\theta))}$

This follows because we are taking theta as fixed and known. We don’t need to use x to estimate theta since we assume it. If instead we use the first expression we get

$\mathcal{L}(\theta,y || x) = \frac{p(y;\theta)p(x;\theta)}{\text{sup}_{\theta}(\text{sup}_{y}(p(y;\theta))p(x;\theta))}$

This allows us to account for our uncertainty in theta and our gain in knowledge about theta from observing x. Now, if the family has constant mode – i.e. if

$\text{sup}_{y}(p(y;\theta))$

is the same for all theta, then we have

$\mathcal{L}(\theta,y || x) = \frac{p(y;\theta)p(x;\theta)}{\text{sup}_{y}(p(y;\theta))\text{sup}_{\theta}(p(x;\theta))} = \mathcal{L}(y||\theta)\mathcal{L}(\theta||x)$

Note again the similarity to probabilistic updating – the difference is simply that instead of multiplying our model for y by a posterior over theta (based on x) we instead multiply it by a likelihood over theta (based on x). Relatedly, note that our prediction function is the product of two terms. The analogous probabilistic prediction would be based on something like

$p(\theta,y|x) = \frac{p(y,\theta)p(x,\theta)}{\int(p(y;\theta))dy\int(p(x;\theta))d\theta} = p(y|\theta)p(\theta|x)$

where instead of the constant mode assumption we need to use something like

$p(y|x,\theta) =p(y|\cdot, \theta)$

i.e. the above is independent of x. Following this aside a bit further, note that if we further marginalised we would get the posterior predictive distribution

$p(y|x) =\int \frac{p(y,\theta)p(x,\theta)}{\int(p(y;\theta))dy\int(p(x;\theta))d\theta}d\theta = \int p(y|\theta)p(\theta|x) d\theta$

but that this is a further reduced form of our (probabilistic) prediction function (Barndorff-Nielsen gives a similar definition for the likelihood prediction function, replacing the integration by maximisation over theta). This indicates to me that there is something to Murray Aitkin’s comments on predictive distributions and his approach summarised here.

Stepping back, note that, in general, all of our main expressions depend on all of the original quantities – no reduction automatically takes place via our constraining operation. We are not removing quantities e.g. via marginalisation or even via maximisation (c.f. the comments on posterior predictive distributions etc). Furthermore no guidance is offered on when or how to choose any particular candidate or set of candidates (like just choosing the max likelihood or set of candidates within some distance from the max likelihood). I’ll look at these issues and some concrete examples next time. These will illustrate the importance of the concepts of independence and orthogonality – whether exact or approximate – for model and/or inferential reduction. I’ll also try to touch on some EDA issues at some point.

# Some teaching material

…can be found here: https://github.com/omaclaren/open-learning-material

As mentioned there

Like all such material much of it is either shamelessly (or shamefully!) plagiarised, borrowed, edited etc from other sources, especially courses I’ve taken before or notes I’ve inherited from past lecturers. I will try to add some credits as I go. This is essentially impossible for some material, however, as it was effectively picked up from the aether.

If you have any objections to me making this material (or some parts of it) available then let me know. My intention is simply to make it available for anyone who wants to learn about these topics.

Also, this material should be assumed to contain at least some gross errors or major misconceptions!

At the moment it includes material on partial differential equations, probability, Markov processes and qualitative analysis of differential equations.

I’m working on improving the material (and adding more) so expect it to change. In particular I’m currently updating the material on qualitative analysis of differential equations. I’m hoping to better balance the (interesting!) underlying theory with more emphasis on, and examples of, simple and direct applications of the theory.

# Converting sample statistics to parameters: passive observation vs active control

As mentioned in my previous post I think it makes sense to distinguish parameters from observables (‘data’). What happens if you want to convert between them, however? For example,

what if you want to treat an observable as a fixed or controllable parameter?

The first, intuitive answer is that we should just condition on it using standard probability theory. But we’ve already claimed that observables and parameters should be treated differently – it seems then that there may be a difference between conditioning on an observable, in the sense of probability theory, and treating it as a parameter.

Here then is an unusual, alternative proposal (it is not without precedent, however). As mentioned above, the motivation and distinctions are perhaps subtle – let’s just write some things down instead of worrying too much about these.

Firstly, a non-standard definition to capture what we want to do:

Definition: an ancillary statistic is a statistic (function of the observable data) that is treated as a fixed or controlled parameter.

This is a bit different to the usual definition. No matter.

For one thing

Ancillaries are loved or hated, accepted or rejected, but typically ignored

Much of the literature is concerned with difficulties that can arise using this third Fisher concept, third after suffciency and likelihood…However, little in the literature seems focused on the continued evolution and development of this Fisher concept, that is, on what modications or evolution can continue the exploration initiated in Fisher’s original papers

So why not mix things up a bit?

Now, given an arbitrary model (family) with parameter $\theta$,

$p(y; \theta)$

we can calculate the distribution of various statistics $t(y)$, $a(y)$ and associated joint or conditional distributions in the usual way(s). The conditional distribution of $t(y)$ given $a(y)$ is defined here as

$p(t(y)|a(y); \theta) := \frac{p(t(y),a(y); \theta)}{p(a(y); \theta)}$.

I’ll ignore all measure-theoretic issues.

Now, the above can already be considered a function of $a(y)$ and $\theta$. The question, though, is: is this the right sort of function for treating a statistic as a parameter? Suppose it isn’t. What else can we define? We want to define

$p(t(y); a(y), \theta)$

as something similar to, but not quite the same as

$p(t(y)| a(y); \theta)$.

I propose

$p(t(y); a(y), \theta) := \begin{cases} p(t(y)| a(y); \theta), \text{ if } p(a(y); \theta) = 1\\ 0, \text{ otherwise} \end{cases}$

This ensures that our ancillary statistic $a(y)$ is, conceptually at least, actively held fixed for arbitrary $\theta$, rather than merely coincidentally taking its value. That is, it is an observable treated as a controlled parameter rather than simply a passive, uncontrolled observable.

If $t(y) | a(y)$ and $t(y) ; a(y)$ are taken to be distinct then what should the latter be called?

Well, I believe this is essentially the same distinction between conditioning and intervening made by Pearl and others. On the other hand, it might be nice to have a term that is a bit more neutral.

Furthermore, this is a distinction that has long been advocated for by a number of likelihood-influenced (if not likelihoodist) and/or frequentist statisticians. It appears Bayesians have been the quickest to blur the distinction – I came across a great example of this recently in an old IJ Good paper. Here is saying he is unable to see the merit in this distinction:

The present post of course claims the distinction is a valuable one.

Now, given the relation to the mathematical notion of a restriction, perhaps this could simply be called restricted to a, t restricted along a or, to go latin, restrictus/restrictum/restricta a?

Rather than conditional inference, should this use of ancillaries instead be called restrictional inference? Or just constrained inference?

You might also wonder how this connects back to the usual likelihood theory.  For one, as mentioned above, this distinction between types of given has long been made in the likelihood literature. In addition it might be possible to connect the above even more closely to the usual likelihood definitions by considering a generalisation of the above along the lines of (?)

$\mathcal{L}_{a,C} := \begin{cases} p(t(y)| a(y); \theta), \text{ if } p(a(y); \theta) = C, C \in (0,1] \\ 0, \text{ otherwise} \end{cases}$

which leads to something like the idea of comparing models along curves of constant ancillary probability. If you are willing to entertain even more structural assumptions, for example a pivotal quantity

$a(y,\theta)$

which is a function of both $y$ and $\theta$ you can potentially go even further. Such quantities provide an even more explicit bridge between the world of parameters and the world of statistics. Is this further extension, mixing parameters and data even more intimately, necessary or a good idea? I’m not really sure. It seems to me, rather, that ancillaries can get you most of the way without needing to impose or require additional structure on the parameter space.

Bayesian inference is useful. It also provides a ‘quick and dirty‘ route to thinking about statistical inference since it just uses basic probability theory. This is to me its biggest strength and weakness – the ‘everything is probability theory’ view.
Why isn’t everything probability theory? In short

• Not everything is a ‘sum to one’ game
• Both uncertainty (‘randomness’) and structure are important

Regarding the first point, I think it makes sense to have a notion of ‘observable’ for which mutually exclusive possibilities must ‘sum to one’ – you either observe a head or observe a tail, for example.

One the other hand, I think it makes sense to include a notion of quantity for which the mutually exclusive possibilities do not need to sum to one. Two distinct models can be equally consistent with given observations. Introducing a third distinct model, also equally consistent with observations, shouldn’t change the ‘possibility’ value of the first two. It does change the probability of the first two, however.

These ‘purely possibilistic’ quantities are what I would call parameters. Probabilistic quantities are observables or ‘data’.

Interestingly, the key difference between likelihood and probability is that the former need not sum to one. Probability applies to data, likelihood to parameters. In certain special cases we can strengthen from likelihood to probability and regain Bayes (or perhaps confidence/fiducial) – identifiable models being one requirement. In less well-constrained problems I prefer likelihood.

The second issue is how to represent structure in addition to uncertainty. For example, a probability distribution is assigned in a given context. How does that probability distribution change when we change context? This is in a sense ‘external’ or ‘structural’ information. You can kludge it within Bayes using ‘conditioning on background information’ but this background information typically does not require a probability distribution. It is instead usually more akin to a ‘possibilistic’ quantity under analyst control or subject to assumptions external to the probability distribution. That is, it is ‘prior information’ but it does not take the form of a probability distribution. This is more common than you would think from the usual Bayesian story.

For example, Pearl – a self-described ‘half-Bayesian’ (perhaps even less these days) –  uses ‘do’ notation to distinguish some types of structural assumption from the merely ‘seen’ observables described by probability. Likelihood can also incorporate these assumptions somewhat more naturally than Bayes as it allows for non-probabilistic ‘possibilistic’ quantities.

Compare Pearl’s

p(y|x,do(z))

to the likelihood (and frequentist) notation

p(y|x;z)

In this case z, and the dependence of each p(y|x) on z, lies outside the probability model. That is, the above represents an indexed family of probability distributions

z->p(y|x),

no prior over z required.

An even more subtle ‘structural’ issue is the question of how ‘raw’ data obtains semantics or meaning. This to me is a point where both likelihood and probability/Bayes are open to criticism. Luckily, some of Fisher’s original ideas on, and motivations for, sufficiency and ancillarity can be used to improve the likelihood approach. I think. But that’s a topic for another day!

Postscript
These might seem like purely philosophical concerns but to me they are quite practical. In short – and contra Lindley – I don’t think Bayesian inference works very well in the presence of non-identifiability. I’ll try to illustrate with some examples at some point.

It’s also worth noting that one of Neyman’s goals (see the discussions at the end of this) was “to construct a theory of mathematical statistics independent of the conception of likelihood…entirely based on the classical theory of probability.” Albeit in a different manner to Bayesian inference. This led him to “the basic conception…of frequency of errors in judgment”. I think such ‘error statistical’ frequentist inference also suffers some similar issues in dealing with non-identifiable problems.

# Substitutional vs objectual quantification

Overview/motivation
The interpretation of logical quantifiers such as ‘there exists‘ and ‘for all‘ and the associated ontological implications of these interpretations is (apparently – or at least it was) an important topic in philosophy. I encountered these interpretations a few years ago when reading Haack’s ‘Philosophy of Logics’, but didn’t pay much attention. Quine is a central figure here – e.g. his famous (within philosophy, anyway) saying ‘to be is to be the value of a variable‘ concerns this issue.

I realised recently that I’ve been thinking about somewhat similar issues, albeit in a more ‘applied’ context, e.g. when talking about the interpretation of ‘for all’ in formulating ‘schematic’ model closure assumptions (see here). So, here a few notes on the topic. The obvious disclaimer applies – I am not a philosopher. I’m just hoping to get a few basic conceptual ideas straightened out in my head, so that I may better formalise some arguments useful in science and statistics. I am not aiming to ‘solve’ the general philosophical problems! Corrections or comments welcome.

The problem
Here is a brief sketch of the issue as it arises for the existential quantifier. The question is: how should we interpret statements of quantified logic of the form

$\exists x P(x)$

We have (or there exists??), in fact, two options.

Objectual: There exists an object x such that it has property P.

Substitutional: There exists an instance of a statement having the general form P(x), obtained by substituting some name, term or expression etc for x, that is true.

In the former, the emphasis is placed on objects and their possession of properties, in the latter, the emphasis is placed on statement forms and the truth of particular statement instances.

In particular, in the latter, substitutional, case truth is a property of statements ‘as a whole’ and need not relate to ‘actual objects’ occurring in the sentence.

The classic example is that, on the objectual reading,

(S): Pegasus is a flying horse

can be taken to mean

(S via Obj.): “There exists an object (e.g. Pegasus) which is both a horse and can fly”

We would normally take this as false, since no such object ‘really exists’. On the other hand, on the substitutional reading we may take this to mean

(S via Subs.): There is a true statement of the form ‘x is a flying horse’ (e.g. Pegasus is a flying horse)

The justification for taking this as true is that, given our knowledge of mythology (certainly a real subject itself), we may take this to express a true statement without further commitment to (or even ‘attention to’) the existence of the ‘objects’ or ‘properties’ involved.

Thus the substitutional interpretation refers to the truth or falsity of resultant sentences/statements ‘as a whole’ (and the forms of such sentences/statements), while the objectual intepretation refers to the existence of objects with properties, and hence in a sense gives a more ‘granular’ interpretation.

Both seem to me to involve subtle issues of context, however – e.g. we can presumably only interpret the above statement instance as ‘true’ in the substitutional interpretation given the context of mythology.

Marcus and Kripke offered defenses of the substitutional interpretation while Quine advocated the objectual interpretation (hence ‘to be is to be the value of a variable’).

There is obviously much more to this topic – see e.g. Haack’s book, the SEP. For now, I note that I find myself reasonably sympathetic to the substitutional interpretation (or perhaps both interpretations, depending on the circumstances). This appears to be roughly consistent with what I was attempting to express here.

There also seems to be something here that depends on whether, given the ‘function’ P(x), we focus on the ‘domain of naming’ or on the ‘codomain of statements’. These issues hence also seem to connect with the issue of how to interpret (proper) names e.g. as ‘mere tags’ (Marcus), ‘rigid designators’ (Kripke), ‘definite descriptions’ (Russell) or as ‘predicates’ (Quine). The substitutional interpretation is generally allied with the view of proper names as ‘mere tags’ or as ‘rigid designators’, and I have become quite fond of (what I understand by) this idea. It would be too much to go into this in any detail at the moment, however.

# The tacking ‘paradox’ revisited – notes on the dimension and ordering of ‘propositional space’

Another short (and simple) note on the so-called tacking paradox from the philosophy of science literature. Continuing on from here and related to a recent blog comments exchange here. See those links for the proper background.

[Disclaimer: written quickly and using wordpress latex haphazardly with little regard for aesthetics…]

Consider a scientific theory with two ‘free’ or ‘unknown’ parameters, a and b say. This theory is a function $f(a,b)$ which outputs predictions $y$. I will assume this is a deterministic function for simplicity.

Suppose further that each of the parameters is discrete-valued and can take values in $\{0,1\}$. Assuming that there is no other known constraint (i.e. they are ‘variation independent’ parameters) then the set of possible values is the set of all pairs of the form

$(a,b) \in \{(0,0), (0,1), (1,0), (1,1)\}$

That is, $(a,b) \in \{0,1\}\times \{0,1\}$. Just to be simple-minded let’s arrange these possibilities in a matrix giving

$\begin{pmatrix} (0,0), & (0,1)\\(1,0), & (1,1) \end{pmatrix}$

This leads to a set of predictions for each possibility, again arranged in a matrix

$\begin{pmatrix} f(0,0), & f(0,1)\\ f(1,0), & f(1,1) \end{pmatrix}$

Now our goal is to determine which of these cases are consistent with, supported by and/or confirmed by some given data (measured output) $y_0$.

Suppose we define another function of these two parameters to represent this and call it $C(a,b;y_0)$ for ‘consistency of’ or, if you are more ambitious, ‘confirmation of’ any particular pair of values $(a,b)$ with respect to the observed data $y_0$.

For simplicity we will suppose that $f(a,b)$ outputs a definite $y$ value which can be definitively compared to the given $y_0$. We will then require $C(a,b;y_0) = 1$ iff $f(a,b) = y_0$, and $C(a,b;y_0) = 0$ otherwise. That is, it outputs 1 if the predictions given $a$ and $b$ values match, 0 if the predictions do not. Since $y_0$ will be fixed here I will drop $y_0$, i.e. I will use $C(a,b)$ without reference to $y_0$.

Now suppose that we find the following results for our particular case

$\begin{pmatrix} C(0,0) = 1, & C(0,1) = 1\\ C(1,0) = 0, & C(1,1) = 0 \end{pmatrix}$

How could we interpret this? We could say e.g. $(0,0)$ and $(0,1)$ are ‘confirmed/consistent’ (i.e. $C(0,0) = C(0,1) = 1$), or we could shorten this to say $(0,\cdot)$ is confirmed for any replacement of the second argument. Clearly this corresponds to a case where the first argument is ‘doing all the work’ in determining whether or not the theory matches observations.

Now the ‘tacking paradox’ argument is essentially:

$C(0,0) = 1$

so

$(0,0)$

is confirmed, i.e. ‘a=0 & b=0’ is confirmed. But ‘a=0 & b=0’ logically implies ‘b=0’ so we should want to say ‘b=0’ is confirmed. But we saw

$C(0,1) =1$

and so

$(0,1)$

is also confirmed, which under the same reasoning gives that ‘b=1’ is confirmed!

There are a number of problems with this argument, that I would argue are particularly obscured by the slip into simplistic propositional logic reasoning.

In particular, we started with a clearly defined function of two variables $C(a,b)$. Now, we found that in our particular case we could reduce some statements involving $C(a,b)$ to an ‘essentially’ one argument expression of the form ‘$C(0,\cdot) = 1$‘ or ‘$(0,\cdot)$ is confirmed’, i.e. we have confirmation for a=0 and b ‘arbitrary’. This is of course just ‘quantifying’ over the second argument – we of course can’t leave any free (c.f. bound) variables. But then we are led to ask

What does it mean to say ‘b is confirmed’ in terms of our original givens?

Is this supposed to refer to $C(b)$? But this is undefined – $C$ is of course a function of two variables. Also, b is a free (unbound) variable in this expression. Our previous expression had one fixed and one quantified variable, which is different to having a function of one variable.

OK – what about trying something similar to the previous case then? That is, what about saying $C(\cdot,0) = 1$? But this is a short for a claim that both $C(0,0) = 1$ and $C(1,0) = 1$ hold (or that their conjunction is confirmed, if you must). This is clearly not true. Similarly for $C(\cdot,1)$.

So we can clearly see that when our theory and hence our ‘confirmation’ function is a function of two variables we can only ‘localise’ when we spot a pattern in the overall configuration, such as our observation that $C(0,\cdot) = 1$ holds.

So, while the values of the C function (i.e. the outputs of 0 and 1) are ordered (or can be assumed to be), this does not guarantee a total order when it is ‘pulled back’ to the parameter space. That is, $C^{-1}\{1\}$ does not guarantee an ordering on the parameter space that doesn’t already admit an ordering! It also doesn’t allow us to magically reduce a function of two variables to a function of one without explicit further assumptions. Without these we are left with ‘free’ (unbound) variables.

This is essentially a type error – I take a scientific theory (here) to be a function of the form

$f: A \times B \rightarrow Y$,

i.e a function from a two-dimensional ‘parameter’ (or ‘proposition’) space to a (here) one dimensional ‘data’ (or ‘prediction’, ‘output’ etc) space. The error (or ‘paradox’) occurs when taking a scientific theory to be simply a pair

$A \times B$,

rather than a function defined on this pair.

That is, the paradox arises from a failure to explicitly specify how the parameters of the theory are to be evaluated against data, i.e. a failure to give a ‘measurement model’.

(Note: Bayesian statistics does of course allow us to reduce a function of two variables to one via marginalisation, and given assumptions on correlations, but this process again illustrates that there is no paradox; see previous posts).

One objection is to say – “well this clearly shows a ‘logic’ of confirmation is impossible”. Staying agnostic with respect to this response, I would instead argue that what it shows is that:

The ‘logic’ of scientific theories cannot be a logic only of ‘one-dimensional’ simple propositions. A scientific theory is described at the very, very minimum by a ‘vector’ of such propositions (i.e. by a vector of parameters), which in turn lead to ‘testable’ predictions (outputs from the theory). That is, scientific theories are specified by multivariable functions. To reduce such functions of collections of propositions, e.g. a function f(a,b) of a pair (a,b) of propositions, to functions of less propositions, e.g. ‘f(a)’, requires the use – again at very, very minimum, of quantifiers over the ‘removed’ variables, e.g. ‘f(a,b) = f(a,-) for all choices of b’.

Normal probability theory (e.g the use of Bayesian statistics) is still a potential candidate in the sense that it extends to the multivariable case and allows function reduction via marginalisation. Similarly, pure likelihood theory involves concepts like profile likelihood to reduce dimension (localise inferences). While standard topics of discussion in the statistical literature (e.g. ‘nuisance parameter elimination’), this all appears to be somewhat overlooked in the philosophical discussions I’ve seen.

So this particular argument is not, to me, a good one against Bayes/Likelihood approaches.

(I am, however, generally sympathetic to the idea that $C$ functions like that above are better considered as consistency functions rather than as confirmation functions – in this case the, still fundamentally ill-posed, paradox ‘argument’ is blocked right from the start since it is ‘reasonable’ for both ‘b=0’ and ‘b=1‘ to be consistent with observations. On the other hand it is still not clear how you are supposed to get from a function of two variables to a function of one. Logicians may notice that there are also interesting similarities with intuitionistic/constructive logic see here or here and/or modal logics, see here – I might get around to discussing this in more detail someday…)

To conclude: the slip into the language of simple propositional logic, after starting from a mathematically well-posed problem, allows one to ‘sneak in’ a ‘reduction’ of the parameter space, but leaves us trying to evaluate a mathematically undefined function like $C(b=0)$.

The tacking ‘paradox’ is thus a ‘non-problem’ caused by unclear language/notation.

Addendum – recently, while searching to see if people have made similar points before, I came across this nice post ‘Probability theory does not extend logic‘.

The basic point is that while probability theory uncontroversially ‘extends’ what I have called simple ‘one-dimensional’ propositional logic here, it does not uncontroversially extend predicate logic (i.e. the basic logical language required for mathematics, which uses quantifiers) nor logic involving relationships between quantities requiring considerations ‘along different dimensions’.

While probability theory can typically be made to ‘play nice’ with predicate logic and other systems of interest it is important to note that it is usually the e.g. predicate logic or functional relationships – basically, the rest of mathematical language – doing the work, not the fact that we replace atomic T/F with real number judgements. Furthermore the formal justifications of probability theory as an extension of logic used in the propositional case do not translate in any straightforward way to these more complicated logical or mathematical systems.

Interestingly for the Cox-Jaynesians, while (R.T.) Cox appears to have been aware of this, and he even considered extensions involving ‘vectors of propositions’ – leading to systems which no longer satisfy all the Boolean logic rules (see e.g. the second chapter of his book) – Jaynes appears to have missed the point (see e.g. Section 1.8.2 of his book). As hinted at above, some of the ambiguities encountered are potentially traceable – or at least translatable – into differences between classical and constructive logic. Jaynes also appears to have misunderstood the key issues here, but again that’s a topic for another day.

Now all of this is not to say that Bayesian statistics as practiced is either right or wrong but that the focus on simple propositional logic is the source of numerous confusions on both sides.

Real science and real applications of probability theory involve much more than ‘one-dimensional’ propositional logic. Addressing these more complex cases involves numerous unsolved problems.

# Hierarchical Bayes

This is ‘Not a Research Blog’, but nevertheless some thoughts on, and application of, hierarchical Bayes that are related to what I’ve been posting about here can be found in my recent preprint:

A hierarchical Bayesian framework for understanding the spatiotemporal dynamics of the intestinal epithelium

A few comments. I actually wrote essentially all of this about a year ago. The quirks of interdisciplinary research mean, however, that I have only just recently been able to post even a preprint of this work online (data/other manuscript availability issues etc). Some of my views may have changed slightly since then – but probably not overly much (and most of the more different ideas would relate to alternative frameworks rather than modifications of the present approach). Of course the usual delays of publication mean this happens fairly often – yet another reason for using preprints. This was also my first bioRxiv submission (bioRvix is essentially arXiv targeted specifically at biology and biological applications) – it was extremely easy to use and went through screening in less than a day.

This manuscript was also a first attempt to pull together a lot of ideas I’d been playing around with relating to hierarchical models, statistical inference, prediction, evidence, causality, discrete vs continuum mechanistic models, model checking etc, and apply them to a real problem with real data. As such it’s reasonably long, but I think readable enough. In some ways it probably reads more like a textbook, but some might find that useful so I’ve tried to frame that as a positive.

# Linear or nonlinear with respect to what?

Overview
I’m teaching a partial differential equations (PDEs) course in the mathematics department at the moment. A typical ‘gimme’ question for assignments and tests is to get the students to classify a given equation as linear or nonlinear (most of the theory we develop in the course is for linear equations so we need to know what this means). Since we aim to introduce the students to a bit of operator theory we often switch back and forward between talking about linear/nonlinear PDEs and linear/nonlinear operators.

One of the students noticed that this introduced some ambiguity into our classification problem and asked a great question. I think it illustrates a useful general point about terminology like linear vs nonlinear and how these terms can be misleading or ambiguous. So here’s the question and my attempt at clarifying the ambiguity.

The question
It’s my understanding that a PDE is linear if we can write it in the form $Lu = f(x,t)$, where L is a linear differential operator.

If we are given a PDE that looks like $Au = 0$ for some differential operator A and asked to show that the PDE is nonlinear, I can (probably) show that A is not a linear differential operator. However this doesn’t necessarily imply that you cannot rearrange the equation in such a way to make it linear.

For example the operator A defined by

$Au = (u^2+1)u_t+(u^2+1)u_{xx}$

is not a linear differential operator. However the equation Au = 0 is the same as $u_t + u_{xx} = 0$, and the differential operator B defined by $Bu = u_t + u_{xx}$ is linear.

So (I believe I’m correct in saying this), the original PDE is linear, because it can be rewritten in this form Lu = f(x,t) for some linear differential operator L and function f(x,t).

My question is what sort of working are we expected to show, if we aim to prove the PDE Au=0 is not linear? For the purposes of the assignment does it suffice to prove that A is not linear?

My response
Here was my response and attempt to clarify (corrections/comments welcome!).

Great question!

As you’ve noticed there is some ambiguity when we move back and forward between talking about equations and operators. This is to be expected since a function (e.g. an operator) is a different type of mathematical object to an equation

For example the function $f:x \mapsto x^2$ is a different ‘object’ to the equation $x^2 = 0$.

You’ve correctly noticed that if we can write a differential equation as Lu = f where L is some linear operator then the differential equation is also called linear. Unfortunately, again as you’ve noticed, this definition makes it hard to decide when an equation is nonlinear as you may be able to write a linear equation in terms of a nonlinear operator with the right choice of f. This is because the negation of ‘there exists’ a linear operator is ‘there doesn’t exist a linear operator’.

So proving that an equation is linear is easy using the operator definition – we just find any linear operator that works.

On the other hand, proving that an equation is nonlinear is harder using this definition – it would require showing all operators for which Au = f are nonlinear.

This seems too hard to do directly, so let’s reformulate it in an equivalent but easier-to-use way.

We want to keep our definitions of linear and nonlinear as close as possible for the two cases of operators and equations.

Improved definitions

An operator acting on u is linear iff L(au+bv) = aL(u) + bL(v) for any u and v in the operator’s domain and constants a, b.

and

Given an equation written in the from Au = f for some operator A and forcing function f, the equation is linear iff A(au+bv) = aA(u) + bA(v) for any two solutions u, v to the equation Au = f.

I think this definition should cover your example (try it! Note that it is slightly subtle how this makes a difference! But, basically, we get to use the f = 0 in the equation case now).

Also note that:

The function definition now explicitly talks about linearity with respect to how it operates on objects in its domain while the equation definition talks explicitly about behaviour with respect to solutions to that equation. This seems natural given the different ‘nature’ of ‘functions’ and ‘equations’.

Does that make sense?

Morally speaking
I think the broader lesson is that terms like linear/nonlinear are relative to the specific mathematical representation chosen and how we interact with that representation. A ‘system’ is not really intrinsically linear or nonlinear, rather an ‘action’ (or function or operator or process) is linear or nonlinear with respect to a specific set of ‘objects’ or ‘measurements’ or ‘perturbations’ or whatever. This needs to be made explicit for an unambiguous classification to be carried out.

Generalisation
Perhaps generalising too far, something like this came up in some recent ‘philosophical’ discussions I’ve been having over at Mayo’s blog (and was also at the heart of another scientific disagreement I once had with an experimentalist about interpreting aquaporin knockout experiments…).

For example, it has been pointed out that while ‘chaos’ is typically associated with (usually finite-dimensional) nonlinear systems, there are examples of infinite-dimensional linear systems that exhibit all the hallmarks of chaos – see e.g. ‘Linear vs nonlinear and infinite vs finite: An interpretation of chaos‘ by Protopopescu for just one example. So, changing the underlying ‘objects’ used in the representation changes the classification as ‘linear’ or ‘nonlinear’ or, as Protopopescu states

Linear and nonlinear are somewhat interchangeable features, depending on scale and representation…chaotic behavior occurs… when we have to deal with infinite amounts of information at a finite level of operability. In this sense, even the most deterministic system will behave stochastically due to unavoidable and unknown truncations of information.

This theme appears again and again at various levels of abstraction – e.g. we saw it in a high-school math problem where a singularity (a type of ‘lack of regularity’) arose (which we interpreted as) due to an incompatibility between a regular higher-dimensional system and a constraint restricting that system to a lower-dimensional space. (Compare the abstract operator itself with the operator + equating it to zero to get an equation.) We were faced with the choice of a regular but underdetermined system that required additional information for a unique solution or a ‘unique’ but singular (effectively overdetermined) system. Similarly other ‘irregular’ behaviour like ‘irreversibility’ can often be thought of as arising due to a combination of ‘reversible’ (symmetric/regular etc) microscopic laws + asymmetric boundary conditions/incomplete measurement constraints. Similar connections between ‘low/high dimensional’ systems and ‘stable/unstable’ systems are discussed by Kuehn in ‘The curse of instability‘.

To me this presents a helpful heuristic decomposition of models of the world into two-level decompositions like ‘irregular nature’ -> ‘regular, high-dimensional nature’ + ‘limited accessibility to nature’ (h/t Plato) or ‘internal dynamics’ + ‘boundary conditions’, ‘reversible laws’ + ‘irreversible reductions/coarse-graining’ etc. Note also that, on this view, ‘infinite’ and ‘finite’ are effectively ‘relative’, ‘structural’ concepts – if our ‘access’ to the ‘real world’ is always and instrinsically limited it leads us to perceive the world as effectively infinite (in some sense) regardless of whether the world is ‘actually’ infinite. You still can’t really avoid ‘structural infinities’ –  e.g. continuous transformations – though.

It seems clear that this also inevitably introduces ‘measurement problems’ that aren’t that dissimilar to those considered to be intrinsic to quantum mechanics into even ‘classical’ systems, and leads to ideas like conceiving of ‘stochastic’ models as ‘chaotic deterministic’ systems and vice-versa.

# Recent reading: a miscellany of slightly obscure things

Sometimes I forget which things I’m currently reading (i.e. dipping in and out of). So, here are a few notes, mainly to myself and mainly about books and more obscure sources than the usual current research papers.

A couple of things on category theory: Category Theory for the Sciences by Spivak and Sets for Mathematics by Lawvere and Rosebrugh. (Also Mathematical Physics by Geroch, but that is more of a broad coverage of essential mathematics using category theory than a book introducing/studying category theory itself.) Really enjoying both. Would like to code up some of the content of Spivak to illustrate the main ideas.

A few things on mathematical biology/physiology etc (mainly for work/background I should know but have either forgotten or not learned). Mathematical Physiology by Keener and Sneyd (the latter being my old PhD supervisor). Free Energy Transduction and Biochemical Cycle Kinetics by Hill (as well as the older, longer version). An underrated book, I need to summarise the best bits at some point. Basic Principles of Membrane Transport by Schultz. Another great classic, helped me a lot during my PhD. Both a bit old but the main thing that seems to have changed is that we have actually identified a lot of the proteins behind the mechanisms originally predicted on based on coarse information and largely theoretical modelling!

Stochastic Modelling for Systems Biology by Wilkinson, Chemical Biophysics by Qian and Beard and Stochastic Process for Physics and Chemistry by van Kampen. Good complements to the above books, generally more focused on stochastic aspects, but still similar concepts. See also the papers Entropy Production in Mesoscopic Stochastic Thermodynamics: Nonequilibrium Kinetic Cycles Driven by Chemical Potentials, Temperatures, and Mechanical Forces by Qian et al. as well as Contact Geometry of Mesoscopic Thermodynamics and Dynamics by Grmela. Also, the book Statistical Thermodynamics of Nonequilibrium processes by Keizer. Should summarise the various key concepts and how to think about ‘mesoscopic’ processes in biology.

A few references on mechanics: some point particle stuff (want to use in some applications), also differential geometry, symmetry etc. Introduction to Physical Modelling by Wellstead (mainly interested in the ‘mobility analogy’). The Variational Principles of Mechanics by Lanczos (a classic!). Analytical Dynamics by Udwadia and Kalaba. Nonholonomic Mechanics and Control by Bloch et al. First Steps in Differential Geometry: Riemannian, Contact, Symplectic by McInerney. Discrete Differential Geometry: An Applied Introduction by Grinspun et al. Foundations of Mechanics by Abraham and Marsden. Introduction to Mechanics and Symmetry by Ratiu and Marsden. Mathematical Foundations of Elasticity by Marsden and Hughes. Also the paper: ‘On the Nature of Constraints for Continua Undergoing Dissipative Processes’ by Rajagopal and Srinivasa.

Dynamical systems (research and teaching – solution and analysis methods): Numerical Continuation Methods for Dynamical Systems by Krauskopf, Osinga and Galan-Vioque. Recipes for Continuation by Dankowicz and Schilder. Stability, Instability and Chaos by Glendinning. Nonlinear Systems by Drazin. Elements of Applied Bifurcation Theory by Kuznetsov. Applications of Lie Groups to Differential Equations by Olver. Scaling by Barenblatt. Renormalization Methods: A Guide For Beginners by McComb. Multiple Time Scale Dynamics by Kuehn.

Measure, Integral and Probability by Capinski and Kopp, Integral, Measure and Derivative by Shilov and Gurevich and Hilbert Space Methods in Probability and Statistical Inference by Small and McLeish (see also Functional Analysis by Muscat). Probability via Expectation By Whittle. Functional Analysis for Probability and Stochastic Processes: An Introduction by Bobrowski. Trying to decide on my preferred abstract framework for thinking about these topics. Each presents a slightly different perspective, each has its strengths and weaknesses. Will have to write a ‘compare and contrast’ to help me decide. I’ve pretty well decided on the functional analysis point of view. Update: see also Differential Geometry and Statistics by Amari and Differential Geometry and Statistics by Murray and Rice. So basically: functional analysis + differential geometry seems to be the way to go. Same as for mechanics.

Related to the above, a few books (and a paper or two) on inverse problems, parameter estimation, Bayesian inference and numerical approximation. Data Assimilation: A Mathematical Introduction by Law, Stuart and Zygalakis. Inverse Problems: A Bayesian Perspective by Stuart. Mapping Of Probabilities by Tarantola (as well as his classic book Inverse Problem Theory). Statistical and Computational Inverse Problems by Kaipio and Somersalo. PTLoS by Jaynes (Ch. 18; I keep reinventing something similar to this but don’t quite understand it. I think it might correspond to reinventing the functional analysis approach?). Data Analysis and Approximate Models by Davies. Moore, Kearfott and Cloud Introduction to Interval Analysis. Measuring Statistical Evidence Using Relative Belief by Evans. Theoretical Numerical Analysis: A Functional Analysis Framework by Atkinson and Han. Moore and Cloud Computational Functional Analysis. Discrete and Continuous Boundary Problems by Atkinson. Fletcher Computational Galerkin Methods. Functional Data Analysis by Ramsay Silverman.

Teaching PDEs: Partial Differential Equations for Scientists and Engineers by Farlow. Applied Mathematics by Logan. Partial Differential Equations by Evans. Advanced Engineering Mathematics by Greenberg. Green’s functions and boundary value problems by Stakgold. Principles and Techniques of Applied Mathematics by Friedman. Partial Differential Equations of Applied Mathematics by Zauderer. A First Course in Continuum Mechanics by Gonzalez and Stuart. Physical Foundations of Continuum Mechanics By Murdoch. Nonlinear Partial Differential Equations by Debnath. Mathematical Methods for Engineers and Scientists 3: Fourier Analysis, Partial Differential Equations and Variational Methods by Tang.Methods of Mathematical Physics II by Courant and Hilbert. Ames Nonlinear PDEs in Engineering.Ern and Guermond Theory and Practice of Finite Elements. Still need to find a book I really like that balances mathematical, numerical and physical concepts at the right level. The short article Generalized Solutions by Tao is nice.

# Conditional probability as the basic notion of probability theory

Overview
A number of conceptual debates in both applications and philosophy of statistics and probability implicitly or explicitly depend on which concept of conditional probability is used. In particular there are two main conceptions floating about – ‘ratio’ based, which takes unconditional probability as basic and conditional probability as derived (and corresponds to Kolmogorov’s approach), and the reverse case which takes conditional probability as basic. I will call this the ‘conditionalist’ view. I point to a few arguments in favor of this latter view, how it relates to ‘model closure’ and a hierarchical/structural view of theories, and why it is popular among certain Bayesians as a resolution of the ‘catchall’ problem.

Disclaimer
Obviously I am only one of many to make this point. I still find it useful to record my agreement with the ‘conditionalists’. Very rough for now. Many more examples to come. Version 0.2

[Edit: I have a new appreciation for Kolmogorov’s approach after teaching it recently. If we consider a Kolmogorov ‘probability model’ to be a full probability space/probability triple, rather than just the measure, then we effectively get the same thing as advocated here. We have to imagine that we have – in principle – a sufficiently large (and generally ‘inaccessible’) background probability space to work ‘within’. Each concrete probability space, i.e. particular model, is then a restriction of this ‘global universe’. This makes the fact explicit that we are always using at least some restriction conditions, and forces us to give the form these restriction conditions take (e.g. orthogonality conditions?).

A rough idea occurs to me – we trade-off the fact that our ‘background universe’ grows exponentially as we relax closure assumptions (i.e. make less details irrelevant and hence more details relevant and hence more unique possibilities) with the expectation that as we include more details our models will become more deterministic. Hence our distributions ‘shrink’ relative to the new domains even if they are ‘bigger’ than their restriction to the old domains. So each ‘point’ in a higher dimension contains a whole universe of lower dimension. Think power sets.  An interesting starting point for thinking more about the mathematical ‘universe(s)’ we work in (and how we get a hierarchy of sub-universes) is https://ncatlab.org/nlab/show/universe. See also https://en.wikipedia.org/wiki/Universe_mathematics]. See also the more recent blog post on the meaning of the terms linear/nonlinear and infinite/finite.

What conditional probability could not be
The above heading is the title of a paper by Alan Hájek (2003, Synthese) see here.

…the ratio analysis of conditional probability…has become so entrenched that it is often referred to as the definition of conditional probability. I argue that it is not even an adequate analysis of that concept…I marshal many examples from scientific and philosophical practice against the ratio analysis. I conclude more positively: we should reverse the traditional direction of analysis. Conditional probability should be taken as the primitive notion, and unconditional probability should be analyzed in terms of it.

The article is a good read and is in agreement with the positions of a number of Bayesians, as well as my own sort-of/occasional-Bayesian-but-there-are-probably-deeper-issues-to-worry-about view.

In and of itself I find the above article fairly convincing; the point has been reinforced to me however by reading a number of similar arguments for and against, as well as my own thinking about the nature of mathematical modelling.

I will collect some of these below, and then present my own main motivations for adopting a conditionalist view, which are somewhat independent of the Bayesian/Frequentist divide.

A collection of examples
[To fill in]
Bayesian classics
– de Finetti
– Jaynes
– ?

Internet arguments
– Gelman, Mayo, Wasserman and other characters
– Pearl, causality and conditioning

My motivation – hierarchies, closure, contradiction, expansion and invariant structure
Catchall vs conditional closure
My first post used conditional probability statements to formulate the basic idea of ‘model closure’ and argue against needing a ‘catchall’ (see the post for details). You’ll notice, however, that this argument only makes sense if you accept (as I did somewhat implicitly) that conditional probability is a basic notion and can be defined even in the absence of a joint distribution.

So, for the record, I take conditional probability as basic and definable even in the absence of unconditional distributions. Thus a ‘catchall’ unconditional distribution is not required for closure.

[Another disclaimer re: the following – these view of mine have been motivated by a number of authors, from Jaynes to Gelman, to my own lecturers in mathematical modeling and physics. So while I present it as my own perspective, it is inevitably strongly derivative of a number of others’ views. Perhaps the most original part is relating this view to the ideas of structural invariance, but this concept has itself been advocated by many.]

Conditional contradiction, hierarchies, regularisation, model expansion and invariant structure
Models always use temporary, approximate closures.

We generally need to begin work by ‘fixing’ (conditioning on) ‘external’ variables and working ‘within’ a system. As illustrated in the previous post, however, we often (inevitably?) reach contradictions or inconsistencies within our models as we approach the ‘boundary’ of our model closures. This leads to the idea (for one example) of ‘singular limits’.

Again as illustrated in the previous post, the way (or one way) to resolve this inconsistency is to ‘expand’ our model by embedding it in a larger model which relaxes a constraint implicit in the smaller model. This naturally leads to greater undetermination due to the additional degrees of freedom. This larger model is also often structurally isomorphic to the original model (at least in some respects), however, and thus gives us a ‘hierarchical’ and ‘structural’ – if not absolutely fixed – foundation to reason from. [Shades of Godel.]

So my perspective is thus ‘conditionalist’, ‘hierarchicalist’ and ‘structuralist’.

A (slightly) more concrete example
Consider a model of the form

p(a|c) = ∫ p(a|b)p(b|c) db

Where we have used the closure condition p(a|b,c) = p(a|b) to make p(a|b) ‘internal’ (invariant) relative to the ‘external’ variable (last conditioning variable) c.

We reason as follows – we want a model (directly) independent of our controlled variables c, with only boundary values b depending in a known manner on c.

IF we reach an internal contradiction – identified for example by p(a|b,c) != p(a|b) – we can (hopefully) expand our model to resolve this by moving previously controlled or ignored variables into the set of explanatory variables (ie expanding the state space) and then rewriting things so as to recover a model of the same schematic/structural ‘causal’ form via the redefinitions

p(a|c”) = ∫ p(a|b,c’)p(b,c’|c”) dbdc’

Equiv.

p(a|c”) = ∫ p(a|b’)p(b’|c”) db’

Where we have split c into (c’,c”) and defined b’ as (b,c’).

We now have an expanded theory having a different partition of variable classes. This leads to greater indeterminacy in the (internal/explanatory) variables, but gives a corresponding theory which possesses the same (invariant) structure as before. By prioritising the theory form I am taking a structuralist view of the essence of mathematical and scientific theories. Variable indeterminacy is the price we pay for removing inconsistency and maintaining structure at a higher level, but it is very often worth it (and exciting) – it corresponds in many cases to ‘new’ or ‘novel’ phenomena appearing. [Bifurcations].

Again, see the previous post for a simple example of expanding a model to remove a singularity and hence introducing indeterminacy.

Observations
Note that we make crucial use of a ‘conditionalist’ and hierarchical view of model structure. Yet another reason to take conditional probability (and conditional thinking) as basic, instead of unconditional probability.

Note also that what was previously a non-probabilistic variable can always become probabilistic as we ‘shift’ where we are in the hierarchy. The position of a variable in the structure is more important than the nature of the variable itself. Another reason to not dismiss Bayesian modelling for allowing us to treat variables as probabilistic (internal to the theory) if and when we choose to – or are forced to.

A possible point of agreement with the frequentist view, however, is that we always maintain some ‘conditioned on’ but non-probabilistic variables (controlled or ‘external’ variables) as temporary scaffolding.