# ‘For all’ is not ‘catch all’: closure, model schema and how a Bayesian can be a Falsificationist

Disclaimer
This is the first of (what should be) a few posts which aim to connect some basic puzzles in the philosophy and methodology of science to the practice of mathematical and computational modelling. They are not intended to be particularly deep philosophically or to be (directly) practical scientifically. Nor are they fully complete expositions. Still, I find thinking about these puzzles in this context to be an interesting exercise which might provide a conceptual guide for better understanding (and perhaps improving?) the practice of mathematical and computational modelling. These are written by a mathematical modeller grappling with philosophical questions, rather than by a philosopher, so bear that in mind! Comments, criticisms and feedback of course welcome! [Current version: 8.0]

Introduction
This particular post is a quick draft based on some exchanges with the philosopher Deborah Mayo here on statistical inference frameworks. The post is only rough for now; I’ll try to tidy it up a little later, including making it more self-contained and with less waffling. For now I’ll assume you’ve read that post. You probably don’t need to bother with my train-of-thought (and at times frustrated!) comments there as I’ve tried to make them clearer here. The notation is still a bit sloppy in what follows.

The basic problem is to do with closure of mathematical and/or theoretical frameworks. Though here the debate is over closure of statistical inference frameworks, the same issues arise everywhere in mathematical models. For example in statistical mechanics it’s possible to begin from Liouville’s equation involving the full phase-space distribution involving all particles. To derive anything more immediately tractable from this – such as the equations of kinetic theory or the equations of continuum mechanics – one needs to reduce the information contained in the full set of equations. This can be done by ‘coarse graining’ – throwing information away – and/or by identifying a ‘lossless reduction’ using implicit constraints or symmetries. Note that one may also simply postulate and test a reduced model without deriving it from a more ‘basic’ model. Regardless of how the derivation proceeds, the ultimate result of this is identifying a reduced set of variables giving a self-contained set of equations. Here is a nice little article by the materials scientist Hans Christian Öttinger with the great title ‘On the stupendous beauty of closure’. The following passage reiterates the above ideas:

In its widest sense, closure is associated with the search for self-contained levels of description on which time-evolution equations can be formulated in a closed, or autonomous, form. Proper closure requires the identification of the relevant structural variables participating in the dominant processes in a system of interest, and closure hence is synonymous with focusing on the essence of a problem and consequently with deep understanding.

Overview of the ‘problem(s)’ and my ‘solutions’
The questions raised by Mayo are ‘how do Bayesians deal with the problem of normalising probabilities to one when there are always background, unspecified alternatives which should come with some amount of probability attached?’ and, relatedly, ‘can a Bayesian be a Falsificationist?’ My answers are: closure via ‘for all’ conditional probability statements at the level of model schema/model structure and yes, via ‘for all’ conditional probability statements at the level of model schema/model structure.

I give a (sketchy) elaboration of my answers in the next section but first, consider the prototypical example of a falsifiable theory given by Popper: ‘all swans are white’. As he pointed out, the quantifier ‘for all’ plays a key role in this [need some Popper refs here]. That is, a single counterexample – a ‘there exists’ (a non-white swan) statement – can falsify a ‘for all’ statement. He further noted that part of the appeal of the ‘for all’ theory is that it is sharp and bold, as compared to the the trivially true but less-useful ‘some swans are white’.

Popper’s example is relevant for the closure problem of Bayesian inference as follows. In her post, Mayo quotes a classic exchange in which some famous Bayesians (Savage, Lindley) propose “tacking on a catch-all ‘something else’” hypothesis (the negation of the main set of hypotheses considered), which is given a ‘small lump of prior probability’. This is to avoid having to explicitly ‘close off’ the model. That is, since the ‘catchall’ is of the form ‘or something else happens’, it evades (or tries to evade) the seeming need for Bayesians to have all possibilities explicitly known in advance. Knowing ‘all possible hypotheses’, whether explicitly or implicitly via a ‘catchall’, is (it is argued) required by Bayesians to normalise their probability distributions.

I think this is misguided, and prefer a set of explicit, falsifiable closure assumptions.

The ‘for all’ closure assumptions
The closure statements I argue for instead are ‘for all’ statements, which give ‘sharp closure’ a la Popper, but are at the level of model structure. These describe what a self-contained – closed – theory should look like, if it exists; they do not guarantee that we can always find one, however . Hence I say ‘for all is not catch all’ . These statements can be explained as follows (originally based on my comments on Mayo’s blog, but modified quite a bit). First I need to emphasise the ambiguity over ‘parameter values’ vs ‘models’ vs ‘hypotheses’:

Since here we are constructing a mathematical model capturing the process of inference itself, each parameter value within a model structure (following a model schema) correponds to a particular instance of a ‘mechanistic’ model. It is in the higher-level model of inference that we formulate closure conditions applying to model structures.

For related reasons, I prefer to use point parameter values within a model to refer to ‘simple hypotheses’, rather than ‘compound hypotheses’ which may include multiple parameter values. Philosophers often refer to the latter which can cause much miscommunication. Michael Lew’s comments on Mayo’s post make the same point from a Likelihoodist point of view. This is an interesting topic to return to.

Consider a model structure where we predict a quantity y as a function of x in background context b. As mentioned above, each value of x should be considered a possible parameter value within a mathematical model. Grant the existence of p(y|x,b) and p(x|b). Note b is only ever on the right hand side so need not be considered probabilistic (notation can/will be formalised further).

My first two closure assumptions are
1) p(y|x,b) = p(y|x) for all x,y,b
2) p(x|b) is given/known for all b

These establish a boundary between the explanatory variables x and their effect on y (for a class of models) and the external/environmental variables b and their effect on x. If these model schemata are satisfied by the model structure of interest then it’s fine to apply the usual methods of Bayesian parameter inference within this model structure. Each possible parameter value corresponds to one hypothetical model possibility. Note that these conditional probabilities only involve b on the right-hand side of the conditioning and integrate to one over the possible values on the left-hand side of the conditioning. This includes both integrating over y for statement (1) and integrating over x for statement (2), so is Bayesian to the extent that the parameter(s) x come with a probability distribution. No ‘or something else’ hypothesis is required for x, at least not one with any probability attached.

It helps to further assume a separation of environmental variables into ‘strictly irrelevant’ (not in x or experimentally-controlled) variables b” and ‘experimentally-controlled’/’experimental boundary’ variables b’. These are defined via p(x|b’,b”) = p(x|b’), where b” are the ‘irrelevant’ variables in (the vector) b and b’ are the experimentally controlled variables in b. This sharp division is useful to maintain unless/until we reach a contradiction. This is an idealisation, a mathematical assumption, and a crucial part of model building. It is likely not true but we will try to get away with it until we get caught. We are being bold by claiming ‘x are my theory variables, b’ are my controlled variables all other variables b” are explicitly irrelevant’. The experimentally-controlled bs – b’ – are ‘weakly relevant’ or ‘boundary’ variables in that they affect x but not y and are controlled. They allow us to say what p(x|b’) is.

We can make this another explicit closure condition by stating

(3) p(x|b) = p(x|b’,b”) = p(x|b’) for all b” called ‘irrelevant or fully external variables’ of the background vector b

The difference with the Bayesian catchall is that we don’t have a probability distribution over the background variables, b’ and b” making up b, we only condition on them. Thus we don’t violate any laws of probability by not leaving a ‘lump of probability’ behind. If we put forward a new model in which a previously ‘irrelevant’ variable is considered ‘relevant’ the new model is not related to the old model by any probability statements unless a mapping between the models is given. Functions with different domains are different functions and should not be (directly) compared.

An analogy for the parameter estimation within a model structure is a conservation of mass differential equation (where mass plays the role of probability; one could also simply directly consider a Master equation expressing conservation of probability) within a given domain, boundary conditions at the edge of the domain and all variables that aren’t ‘inside’ or ‘on the boundary’ assumed irrelevant. If the closure conditions are not satisfied then the model structure is misspecified, i.e. the problem is not well-posed, just as with a differential equation model lacking boundary conditions. The inference problem is then to see how probability ‘redistributes’ itself within the domain (over parameter values/model instances of interest) given new observations – again imagine a ‘probability fluid’ for example – subject to appropriate boundary and initial conditions and independence from the external environment. A good model structure has a large domain of applicability – the domain of b/set of values satisfying the model schema (1) & (2), as well as (3) if necessary – and we can only investigate this by varying b and seeing if the conditions still hold. This is Bayesian within the model since the model parameters x have probability distributions.

What is the domain of the ‘for all’?
A further clarification is needed [see the comment section for the origins of this]: the closure conditions are schematic/structural and only implicitly determine the domain of validity B for a given theory. That is, in the general scheme, b and B are placeholders; for a particular proposed theory we need to find particular b and B such that the closure conditions are satisfied. This has an affinity with the ideas of mathematical structuralism (without necessarily committing to endorsing the entire position, at least for now). For example, Awodey (2004, An Answer to Hellman’s Question), describes:

the idea of specifying, for a given…theory only the required or relevant degree of information or structure, the essential features of a given situation, for the purpose at hand, without assuming some ultimate knowledge, specification, or determination of the ‘objects’ involved…The statement of the inferential machinery involved thus becomes a…part of the mathematics…the methods of reasoning involved in different parts of mathematics are not ‘global’ and uniform across fields…but are themselves ‘local’ or relative…[we make] schematic statement[s] about a structure…which can have various instances

This lack of specificity or determination is not an accidental feature of mathematics, to be described as universal quantification over all particular instances in a specific foundational system as the foundationalist would have it…rather it is characteristic of mathematical statements that the particular nature of the entities involved plays no role, but rather their relations, operations, etc. – the ‘structures’ that they bear – are related, connected, and described in the statements and proofs of the theorems.

This can be seen as following in the (in this case, algebraic) ‘structuralist’ tradition of Hilbert (1899, in a letter to Frege):

it is surely obvious that every theory is only a scaffolding or schema of concepts together with their necessary relations to one another, and that the basic elements can be thought of in any way one likes…

…the application of a theory to the world of appearances always requires a certain measure of good will and tactfulness: e.g., that we substitute the smallest possible bodies for points and the longest possible ones, e.g., light-rays, for lines. At the same time, the further a theory has been developed and the more finely articulated its structure, the more obvious the kind of application it has to the world of appearances

So, here we are defining a model schema capturing the idea of the ‘closure of a model’ or, alternatively, a ‘closed model structure’, and meant to capture some notion of induction ‘within’ a model structure and falsification ‘outside’ a model structure. Hilbert’s last paragraph captures this second point.

Suppose we have a background of interest for which we want to create a theory. It may be/almost certainly is the case that there are (many) possible contexts/backgrounds for which we cannot find ‘good’ theories satisfying the closure conditions – e.g. the theories are either much too general or much too specific. This is why psychology is in some ways ‘harder’ than physics – it is very difficult to partition the large number of possibly relevant variables for predicting ‘target’ variables y into a small number of invariant theoretical contructs x, a small set of ‘controllable’ variables b’ and a large set of ‘irrelevant’ variables b”. If we wish to retain an ability to ‘realistically represent’ the phenomenon of interest captured by y, then most things will be ‘explanatory variables’ needing to be placed in x and/or controlled in b’. That is, we will have a very descriptive theory, as opposed to a very ‘causal’ theory. Note that the division (3) into ‘controlled’ and ‘irrelevant’ variables b’ and b”, respectively, tries to help with this, to some extent, but means that controlled lab experiments can be both quite reproducible within a lab but can fail to generalise outside it.

The closure conditions mean that we still know what a theory should look like, if it exists, though and this helps with the search.

Further interpretation, testing and relation to ‘stopping rules’
We see that

(1) is an assumption on mechanism ‘inside’ a domain – i.e. ‘x determines y regardless of context b’
(2) is an assumption on experimental manipulation – i.e. boundary conditions of a sort
(3) is a further division into ‘controlled’ and ‘irrelevant’ background/boundary variables, meaning all background effects pass through and are summarised by knowledge of the boundary manipulations

As emphasised these sort of assumptions are ‘meta-statistical’ closure assumptions but testable to the extent we can explore/consider different contexts (values of b). Another ‘structural’ analogy used here is how, in formal logic, axiom schema are used as a way to express higher-order logic (e.g. second-order logic) formulae as a collection of axioms within a lower-order logic (e.g. first-order logic). In fact this is one way of deductively formalising the other form of inductive inference within first-order logic – mathematical induction. Here, though, we likely have to work much harder to find good instances of the closure assumptions for particular domains of interest.

The analogy to physics problems with divisions of ‘inside the system’, the ‘boundary of the system’ and the ‘external environment of the system’ is clear. Closed systems are defined similarly in that context.

Statistically, these conditions can be checked to some extent by the analogue of so-called ‘pure significance testing’, that is without alternatives lying ‘outside of’ b’s domain. These essentially ask – ‘can I predict y given x (to acceptable approximation) without knowing the values of other variables?’ and ‘do I know how and which of my interventions/context/experimental set up affect my predictor x?’.

Things such as ‘stopping rules’ may be included as part of the variable b, so could affect the validity of assumption (1) and/or assumption (2). For example, a particular stopping rule may be construed as preserving (1) while requiring modification of (2) i.e. a different prior. Here the stopping rule is part of b’, the experimentally-controlled variables having an effect on x. Other stopping rules may be irrelevant and hence lie in b”. This point has been made by numerous Bayesians – I first came across it in Gelman et al.’s book and/or Bernardo and Smith’s book (hardly unknown Bayesians). Similar points to this (and others made in this post) can be found on the (slightly more polemical) blog here by the mysterious internet character ‘Laplace’.

A slightly subtle, but interesting, point is that if the model structure is misspecified then it may be corrected on that data in that context but this may invalidate its application in other contexts (a more formal explication can be given). Invariance of the relationship between y and x for all contexts b is crucial here. So, again, it’s really the closure assumptions doing most of the ‘philosophical work’ – this is elaborated on more below.

Recap so far
I think this is a fairly defensible sketch (note the word sketch!) of how a Bayesian may be able to be a Falsificationist. They provisionally accept two/three conditional probability statements which involve conditioning on (dividing with) a ‘boundary’ background domain of validity. The ‘background’ variables do not need a probability distribution over their domain as they are only ever conditioned on. To emphasise: probabilities (which are all conditional) integrate to one within (conditional on) a model structure/schema but the background variables do not need a probability distribution and the closure assumptions can be falsified.

As I see it then, the goal of a scientist is hence a ‘search’ problem [a la Glymour?] to find (e.g. by guessing, whatever) theories, the form of which satisfies these closure conditions for a desirable range of background contexts/divisions, along with more specific estimates of the quantities within these theories under more specific conditions of immediate interest. When the closure conditions are not satisfied for a given background then the theory is false (-ified) for that domain and any quantities estimated within that theory are meaningless.

Haven’t I seen this idea before?
If you’re a philosopher of science then this sounds very ‘Conjectures and Refutations’, no? Shades of the Kuhnian normal science/paradigm shift structure, too (as Gelman has noted on many occasions). If you’re a ‘causal modeller’ then you might think about Pearl and the concept of ‘surgery’ describing (possibly hypothetical) experimental interventions, as well as some related causal inference work by Glymour et al. (though I need to read more of this literature). If you’ve read any Jaynes/Cox you might recognise some kinship with Cox’s theorem and the derivation of probability theory from given axioms expressed as functional equations; see e.g. p. 19 of Jaynes’ PT:LoS where he mentions ‘interface conditions’ required to relate the behaviour of an ideal ‘reasoning robot’ – i.e. model of the inference process in the terms used here – to the ‘real world’. (Also, given my affinity for functional equations and ‘model schema’ I should really go back over this in more detail.) In fact, Jaynes explicitly states essentially the central point made here, e.g. p. 326 of PT:LoS –

The function of induction is to tell us not which predictions are right, but which predictions are indicated by our present knowledge. If the predictions succeed, then we are pleased and become more confident of our present knowledge; but we have not learned much…it is only when our inductive inferences are wrong that we learn new thing about the real world.

It is clear that Jaynes is saying the same thing as expressed here – use inductive reasoning (e.g. Bayesian parameter inference) inside a ‘closed’ model structure (see ‘interface conditions’ from PT:LoS cited above) until a contradiction is reached. At this point the closure conditions – the model structure conditions – are ‘inadequate’ and must be ‘respecified’ before the ‘within-model’ inference can be considered sound. Finally, as is apparent from the opening examples, if you come from a physical science background then it’s clear that many analogous ideas are present in the statistical mechanics/thermodynamics literature (Jaynes shows up here again, along with many others; I’d like to write more on this at some point as well).

Interestingly, many of these ideas also seem quite similar to ‘best practice’ ‘Frequentist’ methods. For example Spanos’ version of Mayo’s ‘Error Statistical’ perspective [in my understanding – see comment section] requires an adequate model structure, established with the help of general (Fisherian-style) tests, before (Neyman-Pearson/severe test) parameter estimation can be soundly carried out. We seem to differ mostly on specific formalisation and on the parameter estimation methods used within a structure. I know Glymour has written something on relating ‘Error Statistical’ ideas to the causal inference literature, though I haven’t looked at it in detail.

Finally, of note from an epistemological perspective, these are not ‘knowledge is closed under entailment’ assumptions. I’m generally against this a la Nozick. The closure here is different to that in the epistemological literature dealing with knowledge closure, though is perhaps related; it would also be interesting to look into this [update: see here for a start]. Note that Nozick’s proposed solution to that problem was effectively to go to a ‘higher level’ by relativising knowledge to methods, in a manner very similar to Mayo’s, and similar to the present approach in that I use higher-level model structures.

A brief example and why one might be only a ‘half-Bayesian’ – closure does the work!
As an example, Newton’s law f-ma=0 is a general scheme characterising an invariant relationship parameterised by ‘context’. Feyman’s lectures give a great discussion of this [insert link]. When that involves knowing ‘gravity is present and the relevant masses are known’ and I want to predict acceleration, then the expression for f is determined by background knowledge and is used to predict acceleration. When acceleration and mass are known relative to a background reference frame then the net force can be predicted. The rest of the background is assumed irrelevant. This relationship is nice because we can satisfy the two conditions I gave under a wide range of conditions.

A Bayesian would typically express what is known (given a model structure) – e.g. a range of reasonable mass values – in terms of a prior and then report predictions – e.g. the acceleration – in terms of predictive distributions. This is not really the central issue, however:

These closure assumptions don’t really have anything to do with being Bayesian or not – I believe Glymour and Pearl have said things along these lines (see ‘Why I am not a Bayesian’ and ‘Why I am only a half Bayesian’, respectively) – but are still perfectly compatible with a Bayesian approach.

If you don’t want to use Bayesian parameter estimation, fine, but the argument that it cannot be compatible with a Falsificationist approach to doing science is clearly wrong (to me anyway). Bayesian and Likelihoodist methods also happen to have particularly intuitive interpretations for parameter estimation within a model structure defined conditionally w.r.t. a background context. Furthermore, there are ‘Bayesian analogues’ of Fisherian tests (see BDA for examples) which are particularly useful for graphical exploration, so this does not present too much difficulty in principle.

Another recap
As I have said above, it is the scientist’s job to find particular theories with a structure satisfying the closure conditions, determine the range of backgrounds over which these conditions are satisfied and then estimate quantities within these conditional model structures. They may also seek to relate different theories by allowing background variables of one model to be primary variables of another and carrying out some sort of reduction and/or marginalisation/coarse-graining process.

There is no ‘catchall’ however! There are, instead, schematic ‘for all’ statements for which we need to determine (find) the truth sets – the range of values for which the quantifications hold – and hence determine the explanatory variables and domain(s) for which our theory/model structure is applicable. This defines the ‘closure’ of the model structure (paradigm) and allows us to proceed to the ‘normal science’ of parameter estimation. At any point we can work with a ‘temporary closure’ of B, i.e. a subset of B, that captures the range of conditions we are currently interested in or able to explore. The background variables b are usually further (assumed to be) divided into manipulable/boundary and irrelevant/fully external, and can be taken to parameterise various subsets of B.

And here seems a good place to close this post, for now.

Postscript
Mayo replies on her blog:
“The bottom line is that you don’t have inference by way of posteriors without a catchall. The issue of falsification is a bit different. You don’t have falsification without a falsification rule. It will not be deductive, that’s clear. So what’s your probabilistic falsification rule? I indicated some possible avenues.”