Disclaimer
This is the first of (what should be) a few posts which aim to connect some basic puzzles in the philosophy and methodology of science to the practice of mathematical and computational modelling. They are not intended to be particularly deep philosophically or to be (directly) practical scientifically. Nor are they fully complete expositions. Still, I find thinking about these puzzles in this context to be an interesting exercise which might provide a conceptual guide for better understanding (and perhaps improving?) the practice of mathematical and computational modelling. These are written by a mathematical modeller grappling with philosophical questions, rather than by a philosopher, so bear that in mind! Comments, criticisms and feedback of course welcome! [Current version: 8.0]
Introduction
This particular post is a quick draft based on some exchanges with the philosopher Deborah Mayo here on statistical inference frameworks. The post is only rough for now; I’ll try to tidy it up a little later, including making it more self-contained and with less waffling. For now I’ll assume you’ve read that post. You probably don’t need to bother with my train-of-thought (and at times frustrated!) comments there as I’ve tried to make them clearer here. The notation is still a bit sloppy in what follows.
The basic problem is to do with closure of mathematical and/or theoretical frameworks. Though here the debate is over closure of statistical inference frameworks, the same issues arise everywhere in mathematical models. For example in statistical mechanics it’s possible to begin from Liouville’s equation involving the full phase-space distribution involving all particles. To derive anything more immediately tractable from this – such as the equations of kinetic theory or the equations of continuum mechanics – one needs to reduce the information contained in the full set of equations. This can be done by ‘coarse graining’ – throwing information away – and/or by identifying a ‘lossless reduction’ using implicit constraints or symmetries. Note that one may also simply postulate and test a reduced model without deriving it from a more ‘basic’ model. Regardless of how the derivation proceeds, the ultimate result of this is identifying a reduced set of variables giving a self-contained set of equations. Here is a nice little article by the materials scientist Hans Christian Öttinger with the great title ‘On the stupendous beauty of closure’. The following passage reiterates the above ideas:
In its widest sense, closure is associated with the search for self-contained levels of description on which time-evolution equations can be formulated in a closed, or autonomous, form. Proper closure requires the identification of the relevant structural variables participating in the dominant processes in a system of interest, and closure hence is synonymous with focusing on the essence of a problem and consequently with deep understanding.
Overview of the ‘problem(s)’ and my ‘solutions’
The questions raised by Mayo are ‘how do Bayesians deal with the problem of normalising probabilities to one when there are always background, unspecified alternatives which should come with some amount of probability attached?’ and, relatedly, ‘can a Bayesian be a Falsificationist?’ My answers are: closure via ‘for all’ conditional probability statements at the level of model schema/model structure and yes, via ‘for all’ conditional probability statements at the level of model schema/model structure.
I give a (sketchy) elaboration of my answers in the next section but first, consider the prototypical example of a falsifiable theory given by Popper: ‘all swans are white’. As he pointed out, the quantifier ‘for all’ plays a key role in this [need some Popper refs here]. That is, a single counterexample – a ‘there exists’ (a non-white swan) statement – can falsify a ‘for all’ statement. He further noted that part of the appeal of the ‘for all’ theory is that it is sharp and bold, as compared to the the trivially true but less-useful ‘some swans are white’.
Popper’s example is relevant for the closure problem of Bayesian inference as follows. In her post, Mayo quotes a classic exchange in which some famous Bayesians (Savage, Lindley) propose “tacking on a catch-all ‘something else’” hypothesis (the negation of the main set of hypotheses considered), which is given a ‘small lump of prior probability’. This is to avoid having to explicitly ‘close off’ the model. That is, since the ‘catchall’ is of the form ‘or something else happens’, it evades (or tries to evade) the seeming need for Bayesians to have all possibilities explicitly known in advance. Knowing ‘all possible hypotheses’, whether explicitly or implicitly via a ‘catchall’, is (it is argued) required by Bayesians to normalise their probability distributions.
I think this is misguided, and prefer a set of explicit, falsifiable closure assumptions.
The ‘for all’ closure assumptions
The closure statements I argue for instead are ‘for all’ statements, which give ‘sharp closure’ a la Popper, but are at the level of model structure. These describe what a self-contained – closed – theory should look like, if it exists; they do not guarantee that we can always find one, however . Hence I say ‘for all is not catch all’ . These statements can be explained as follows (originally based on my comments on Mayo’s blog, but modified quite a bit). First I need to emphasise the ambiguity over ‘parameter values’ vs ‘models’ vs ‘hypotheses’:
Since here we are constructing a mathematical model capturing the process of inference itself, each parameter value within a model structure (following a model schema) correponds to a particular instance of a ‘mechanistic’ model. It is in the higher-level model of inference that we formulate closure conditions applying to model structures.
For related reasons, I prefer to use point parameter values within a model to refer to ‘simple hypotheses’, rather than ‘compound hypotheses’ which may include multiple parameter values. Philosophers often refer to the latter which can cause much miscommunication. Michael Lew’s comments on Mayo’s post make the same point from a Likelihoodist point of view. This is an interesting topic to return to.
Consider a model structure where we predict a quantity y as a function of x in background context b. As mentioned above, each value of x should be considered a possible parameter value within a mathematical model. Grant the existence of p(y|x,b) and p(x|b). Note b is only ever on the right hand side so need not be considered probabilistic (notation can/will be formalised further).
My first two closure assumptions are
1) p(y|x,b) = p(y|x) for all x,y,b
2) p(x|b) is given/known for all b
These establish a boundary between the explanatory variables x and their effect on y (for a class of models) and the external/environmental variables b and their effect on x. If these model schemata are satisfied by the model structure of interest then it’s fine to apply the usual methods of Bayesian parameter inference within this model structure. Each possible parameter value corresponds to one hypothetical model possibility. Note that these conditional probabilities only involve b on the right-hand side of the conditioning and integrate to one over the possible values on the left-hand side of the conditioning. This includes both integrating over y for statement (1) and integrating over x for statement (2), so is Bayesian to the extent that the parameter(s) x come with a probability distribution. No ‘or something else’ hypothesis is required for x, at least not one with any probability attached.
It helps to further assume a separation of environmental variables into ‘strictly irrelevant’ (not in x or experimentally-controlled) variables b” and ‘experimentally-controlled’/’experimental boundary’ variables b’. These are defined via p(x|b’,b”) = p(x|b’), where b” are the ‘irrelevant’ variables in (the vector) b and b’ are the experimentally controlled variables in b. This sharp division is useful to maintain unless/until we reach a contradiction. This is an idealisation, a mathematical assumption, and a crucial part of model building. It is likely not true but we will try to get away with it until we get caught. We are being bold by claiming ‘x are my theory variables, b’ are my controlled variables all other variables b” are explicitly irrelevant’. The experimentally-controlled bs – b’ – are ‘weakly relevant’ or ‘boundary’ variables in that they affect x but not y and are controlled. They allow us to say what p(x|b’) is.
We can make this another explicit closure condition by stating
(3) p(x|b) = p(x|b’,b”) = p(x|b’) for all b” called ‘irrelevant or fully external variables’ of the background vector b
The difference with the Bayesian catchall is that we don’t have a probability distribution over the background variables, b’ and b” making up b, we only condition on them. Thus we don’t violate any laws of probability by not leaving a ‘lump of probability’ behind. If we put forward a new model in which a previously ‘irrelevant’ variable is considered ‘relevant’ the new model is not related to the old model by any probability statements unless a mapping between the models is given. Functions with different domains are different functions and should not be (directly) compared.
An analogy for the parameter estimation within a model structure is a conservation of mass differential equation (where mass plays the role of probability; one could also simply directly consider a Master equation expressing conservation of probability) within a given domain, boundary conditions at the edge of the domain and all variables that aren’t ‘inside’ or ‘on the boundary’ assumed irrelevant. If the closure conditions are not satisfied then the model structure is misspecified, i.e. the problem is not well-posed, just as with a differential equation model lacking boundary conditions. The inference problem is then to see how probability ‘redistributes’ itself within the domain (over parameter values/model instances of interest) given new observations – again imagine a ‘probability fluid’ for example – subject to appropriate boundary and initial conditions and independence from the external environment. A good model structure has a large domain of applicability – the domain of b/set of values satisfying the model schema (1) & (2), as well as (3) if necessary – and we can only investigate this by varying b and seeing if the conditions still hold. This is Bayesian within the model since the model parameters x have probability distributions.
What is the domain of the ‘for all’?
A further clarification is needed [see the comment section for the origins of this]: the closure conditions are schematic/structural and only implicitly determine the domain of validity B for a given theory. That is, in the general scheme, b and B are placeholders; for a particular proposed theory we need to find particular b and B such that the closure conditions are satisfied. This has an affinity with the ideas of mathematical structuralism (without necessarily committing to endorsing the entire position, at least for now). For example, Awodey (2004, An Answer to Hellman’s Question), describes:
the idea of specifying, for a given…theory only the required or relevant degree of information or structure, the essential features of a given situation, for the purpose at hand, without assuming some ultimate knowledge, specification, or determination of the ‘objects’ involved…The statement of the inferential machinery involved thus becomes a…part of the mathematics…the methods of reasoning involved in different parts of mathematics are not ‘global’ and uniform across fields…but are themselves ‘local’ or relative…[we make] schematic statement[s] about a structure…which can have various instances
This lack of specificity or determination is not an accidental feature of mathematics, to be described as universal quantification over all particular instances in a specific foundational system as the foundationalist would have it…rather it is characteristic of mathematical statements that the particular nature of the entities involved plays no role, but rather their relations, operations, etc. – the ‘structures’ that they bear – are related, connected, and described in the statements and proofs of the theorems.
This can be seen as following in the (in this case, algebraic) ‘structuralist’ tradition of Hilbert (1899, in a letter to Frege):
it is surely obvious that every theory is only a scaffolding or schema of concepts together with their necessary relations to one another, and that the basic elements can be thought of in any way one likes…
…the application of a theory to the world of appearances always requires a certain measure of good will and tactfulness: e.g., that we substitute the smallest possible bodies for points and the longest possible ones, e.g., light-rays, for lines. At the same time, the further a theory has been developed and the more finely articulated its structure, the more obvious the kind of application it has to the world of appearances
So, here we are defining a model schema capturing the idea of the ‘closure of a model’ or, alternatively, a ‘closed model structure’, and meant to capture some notion of induction ‘within’ a model structure and falsification ‘outside’ a model structure. Hilbert’s last paragraph captures this second point.
Suppose we have a background of interest for which we want to create a theory. It may be/almost certainly is the case that there are (many) possible contexts/backgrounds for which we cannot find ‘good’ theories satisfying the closure conditions – e.g. the theories are either much too general or much too specific. This is why psychology is in some ways ‘harder’ than physics – it is very difficult to partition the large number of possibly relevant variables for predicting ‘target’ variables y into a small number of invariant theoretical contructs x, a small set of ‘controllable’ variables b’ and a large set of ‘irrelevant’ variables b”. If we wish to retain an ability to ‘realistically represent’ the phenomenon of interest captured by y, then most things will be ‘explanatory variables’ needing to be placed in x and/or controlled in b’. That is, we will have a very descriptive theory, as opposed to a very ‘causal’ theory. Note that the division (3) into ‘controlled’ and ‘irrelevant’ variables b’ and b”, respectively, tries to help with this, to some extent, but means that controlled lab experiments can be both quite reproducible within a lab but can fail to generalise outside it.
The closure conditions mean that we still know what a theory should look like, if it exists, though and this helps with the search.
Further interpretation, testing and relation to ‘stopping rules’
We see that
(1) is an assumption on mechanism ‘inside’ a domain – i.e. ‘x determines y regardless of context b’
(2) is an assumption on experimental manipulation – i.e. boundary conditions of a sort
(3) is a further division into ‘controlled’ and ‘irrelevant’ background/boundary variables, meaning all background effects pass through and are summarised by knowledge of the boundary manipulations
As emphasised these sort of assumptions are ‘meta-statistical’ closure assumptions but testable to the extent we can explore/consider different contexts (values of b). Another ‘structural’ analogy used here is how, in formal logic, axiom schema are used as a way to express higher-order logic (e.g. second-order logic) formulae as a collection of axioms within a lower-order logic (e.g. first-order logic). In fact this is one way of deductively formalising the other form of inductive inference within first-order logic – mathematical induction. Here, though, we likely have to work much harder to find good instances of the closure assumptions for particular domains of interest.
The analogy to physics problems with divisions of ‘inside the system’, the ‘boundary of the system’ and the ‘external environment of the system’ is clear. Closed systems are defined similarly in that context.
Statistically, these conditions can be checked to some extent by the analogue of so-called ‘pure significance testing’, that is without alternatives lying ‘outside of’ b’s domain. These essentially ask – ‘can I predict y given x (to acceptable approximation) without knowing the values of other variables?’ and ‘do I know how and which of my interventions/context/experimental set up affect my predictor x?’.
Things such as ‘stopping rules’ may be included as part of the variable b, so could affect the validity of assumption (1) and/or assumption (2). For example, a particular stopping rule may be construed as preserving (1) while requiring modification of (2) i.e. a different prior. Here the stopping rule is part of b’, the experimentally-controlled variables having an effect on x. Other stopping rules may be irrelevant and hence lie in b”. This point has been made by numerous Bayesians – I first came across it in Gelman et al.’s book and/or Bernardo and Smith’s book (hardly unknown Bayesians). Similar points to this (and others made in this post) can be found on the (slightly more polemical) blog here by the mysterious internet character ‘Laplace’.
A slightly subtle, but interesting, point is that if the model structure is misspecified then it may be corrected on that data in that context but this may invalidate its application in other contexts (a more formal explication can be given). Invariance of the relationship between y and x for all contexts b is crucial here. So, again, it’s really the closure assumptions doing most of the ‘philosophical work’ – this is elaborated on more below.
Recap so far
I think this is a fairly defensible sketch (note the word sketch!) of how a Bayesian may be able to be a Falsificationist. They provisionally accept two/three conditional probability statements which involve conditioning on (dividing with) a ‘boundary’ background domain of validity. The ‘background’ variables do not need a probability distribution over their domain as they are only ever conditioned on. To emphasise: probabilities (which are all conditional) integrate to one within (conditional on) a model structure/schema but the background variables do not need a probability distribution and the closure assumptions can be falsified.
As I see it then, the goal of a scientist is hence a ‘search’ problem [a la Glymour?] to find (e.g. by guessing, whatever) theories, the form of which satisfies these closure conditions for a desirable range of background contexts/divisions, along with more specific estimates of the quantities within these theories under more specific conditions of immediate interest. When the closure conditions are not satisfied for a given background then the theory is false (-ified) for that domain and any quantities estimated within that theory are meaningless.
Haven’t I seen this idea before?
If you’re a philosopher of science then this sounds very ‘Conjectures and Refutations’, no? Shades of the Kuhnian normal science/paradigm shift structure, too (as Gelman has noted on many occasions). If you’re a ‘causal modeller’ then you might think about Pearl and the concept of ‘surgery’ describing (possibly hypothetical) experimental interventions, as well as some related causal inference work by Glymour et al. (though I need to read more of this literature). If you’ve read any Jaynes/Cox you might recognise some kinship with Cox’s theorem and the derivation of probability theory from given axioms expressed as functional equations; see e.g. p. 19 of Jaynes’ PT:LoS where he mentions ‘interface conditions’ required to relate the behaviour of an ideal ‘reasoning robot’ – i.e. model of the inference process in the terms used here – to the ‘real world’. (Also, given my affinity for functional equations and ‘model schema’ I should really go back over this in more detail.) In fact, Jaynes explicitly states essentially the central point made here, e.g. p. 326 of PT:LoS –
The function of induction is to tell us not which predictions are right, but which predictions are indicated by our present knowledge. If the predictions succeed, then we are pleased and become more confident of our present knowledge; but we have not learned much…it is only when our inductive inferences are wrong that we learn new thing about the real world.
It is clear that Jaynes is saying the same thing as expressed here – use inductive reasoning (e.g. Bayesian parameter inference) inside a ‘closed’ model structure (see ‘interface conditions’ from PT:LoS cited above) until a contradiction is reached. At this point the closure conditions – the model structure conditions – are ‘inadequate’ and must be ‘respecified’ before the ‘within-model’ inference can be considered sound. Finally, as is apparent from the opening examples, if you come from a physical science background then it’s clear that many analogous ideas are present in the statistical mechanics/thermodynamics literature (Jaynes shows up here again, along with many others; I’d like to write more on this at some point as well).
Interestingly, many of these ideas also seem quite similar to ‘best practice’ ‘Frequentist’ methods. For example Spanos’ version of Mayo’s ‘Error Statistical’ perspective [in my understanding – see comment section] requires an adequate model structure, established with the help of general (Fisherian-style) tests, before (Neyman-Pearson/severe test) parameter estimation can be soundly carried out. We seem to differ mostly on specific formalisation and on the parameter estimation methods used within a structure. I know Glymour has written something on relating ‘Error Statistical’ ideas to the causal inference literature, though I haven’t looked at it in detail.
Finally, of note from an epistemological perspective, these are not ‘knowledge is closed under entailment’ assumptions. I’m generally against this a la Nozick. The closure here is different to that in the epistemological literature dealing with knowledge closure, though is perhaps related; it would also be interesting to look into this [update: see here for a start]. Note that Nozick’s proposed solution to that problem was effectively to go to a ‘higher level’ by relativising knowledge to methods, in a manner very similar to Mayo’s, and similar to the present approach in that I use higher-level model structures.
A brief example and why one might be only a ‘half-Bayesian’ – closure does the work!
As an example, Newton’s law f-ma=0 is a general scheme characterising an invariant relationship parameterised by ‘context’. Feyman’s lectures give a great discussion of this [insert link]. When that involves knowing ‘gravity is present and the relevant masses are known’ and I want to predict acceleration, then the expression for f is determined by background knowledge and is used to predict acceleration. When acceleration and mass are known relative to a background reference frame then the net force can be predicted. The rest of the background is assumed irrelevant. This relationship is nice because we can satisfy the two conditions I gave under a wide range of conditions.
A Bayesian would typically express what is known (given a model structure) – e.g. a range of reasonable mass values – in terms of a prior and then report predictions – e.g. the acceleration – in terms of predictive distributions. This is not really the central issue, however:
These closure assumptions don’t really have anything to do with being Bayesian or not – I believe Glymour and Pearl have said things along these lines (see ‘Why I am not a Bayesian’ and ‘Why I am only a half Bayesian’, respectively) – but are still perfectly compatible with a Bayesian approach.
If you don’t want to use Bayesian parameter estimation, fine, but the argument that it cannot be compatible with a Falsificationist approach to doing science is clearly wrong (to me anyway). Bayesian and Likelihoodist methods also happen to have particularly intuitive interpretations for parameter estimation within a model structure defined conditionally w.r.t. a background context. Furthermore, there are ‘Bayesian analogues’ of Fisherian tests (see BDA for examples) which are particularly useful for graphical exploration, so this does not present too much difficulty in principle.
Another recap
As I have said above, it is the scientist’s job to find particular theories with a structure satisfying the closure conditions, determine the range of backgrounds over which these conditions are satisfied and then estimate quantities within these conditional model structures. They may also seek to relate different theories by allowing background variables of one model to be primary variables of another and carrying out some sort of reduction and/or marginalisation/coarse-graining process.
There is no ‘catchall’ however! There are, instead, schematic ‘for all’ statements for which we need to determine (find) the truth sets – the range of values for which the quantifications hold – and hence determine the explanatory variables and domain(s) for which our theory/model structure is applicable. This defines the ‘closure’ of the model structure (paradigm) and allows us to proceed to the ‘normal science’ of parameter estimation. At any point we can work with a ‘temporary closure’ of B, i.e. a subset of B, that captures the range of conditions we are currently interested in or able to explore. The background variables b are usually further (assumed to be) divided into manipulable/boundary and irrelevant/fully external, and can be taken to parameterise various subsets of B.
And here seems a good place to close this post, for now.
Postscript
Mayo replies on her blog:
“The bottom line is that you don’t have inference by way of posteriors without a catchall. The issue of falsification is a bit different. You don’t have falsification without a falsification rule. It will not be deductive, that’s clear. So what’s your probabilistic falsification rule? I indicated some possible avenues.”
A short reply (for now):
Some of us are happy to use conditional probability as basic and posteriors where useful. Here posteriors do come in – over the parameters within the model structure. I haven’t shown this explicitly as it follows from the usual Bayesian parameter estimation procedures – as long as the closure conditions are assumed.
These closure conditions also allow you to pass from the prior predictive to posterior predictive distributions (see Gelman et al. or Bernardo and Smith for definitions) so do also allow (predictive) inference using a posterior. This was actually my original motivation for making these conditions explicit, as they were (to me) implicit in a number of Bayesian arguments. That this requires accepting two conditional probability statements is neither here nor there to me as far as ‘being Bayesian’ or not is concerned. As I mentioned in the original post I am not a ‘complete Bayesian’ for similar reasons – additional, but entirely compatible, assumptions are needed to complete the usual Bayesian account. I am hardly the first person to point this out. I also note that in principle it may give a motivated person some (philosophical) wiggle room to replace using Bayesian parameter estimation with another parameter estimation method, for a number of reasons I won’t go into here. All ‘power’ to them.
In terms of the need for a ‘falsification rule’: in my account these are needed for saying when the closure conditions fail to hold for a particular model/context. I briefly indicated which of the avenues suggested by Mayo I follow: essentially ‘pure significance’ tests (Fisherian style, without an alternative). I prefer to do these graphically, as done in Gelman et al.’s BDA. Fisher also recommended that these tests be ‘informal’ rather than governed by formal criteria like p < 0.05 (so, note to reproducibility people: you can’t [really] blame him for the widespread abuse of p-values!).
Rather than aiming to reject a particular model – as in the notorious ‘NHST’ procedure – the goal is usually to find a structure, characterising an ensemble of models, that satisfies closure. This is entirely analogous to Spanos’ (who is also an ‘Error Statistician’ along with Mayo) requirement of establishing model structure adequacy before parameter estimation is carried out. After these Fisherian-style tests of model structure, he uses Neyman-Pearson-style estimation; I use Fisherian-style tests of model structure in combination with Bayesian/Likelihood estimation.
There are a number of details left implicit or even completely absent here and I don’t have the time/motivation/ability to fill them all in right now. My advice for the moment to anyone interested would be to read more applied Bayesian work and look for where the closure conditions come in and how they are checked.
Fin.
Interesting post! Thanks for sending me the link.
You write:
> My closure assumptions are
> 1) p(y|x,b) = p(y|x) for all x,y,b
> 2) p(x|b) is given/known for all b
It seems that 1) will almost never be satisfied. To take a trivial example, the probability distribution over possible results y in a reaction time experiment under some experimental manipulation y will always depend on many background factors b, e.g. whether or not the experimental subject is sober.
Am I missing something?
Thanks for the response!
Let me see if I understand. So y is reaction time, x are a set of explanatory factors and b is background. To construct a theory of human reaction speed surely you either need x to include sobriety level or you can (hypothetically) construct a theory in which reaction time is independent of sobriety and then this goes into b?
So the theory connecting x to y might start from a certain point in the signalling pathways and run until the reaction event, and be independent of b (ie everything in the ‘causal chain’ x->y from some point in the signalling pathway to muscle twitch only depends on the starting state and not how the starting state was bought about from b->x). Your p(x|b) then describes the effect of the background context (drunk people) on the starting state values of x, ie b->x.
Does that make any sense?
Outside of relatively fundamental physics, I doubt that you’ll be able to pack everything that might make a difference into b. (The walls might cave in, a bird might fly in through the window, gravity might stop working, etc.) Instead of “for all,” you need something like “within this population, as long as no unusual ‘outside influences’ interfere,” which will unfortunately have to be rather vague.
[Edited]
Hi Greg,
I’ve added a further distinction between (assumed to be) manipulable b’ and irrelevant b” background variables in the vector b giving
(3) p(x|b) = p(x| b’,b”) = p(x|b’)
This is a ‘bold’ assumption in that it could be falsified by unusual events. We then have a choice of
– developing a (probably ad hoc) theory for unusual events by moving some b” variables into b’ or x and having a more complicated theory structure (in terms of number of theoretically-relevant variables)
– having no theory for unusual events (for the moment) and focusing on those which satisfy the closure. These are the ‘simple but general’ theories like the ideal gas.
Thus different theories have different divisions of x, b’ and b”. Whether the closure conditions are satisfied depends on your willingness to accept the closure conditions in a given situation which depends on how you define your observable y (for example).
I have more to say on this measurement issue at some point in the future, but suffice to say a ‘coarse’ y makes it easier to accept that the conditions are satisfied.
Thanks for taking the time to provide some feedback.
Hi Greg, see also here: https://omaclaren.wordpress.com/2015/10/06/model-schema-and-the-structuralist-interpretation-of-for-all/
That helps. Thanks!
Thanks for the link to your blog.
You say your falsification rules are Fisherian simple significance tests without alternatives: OK, that’s the answer I suspected (when I gave the list of choices on my blog). They’re not deductive but have various justifications in terms of error probabilities (whether inferentially or behavioristically justified). So the falsification part of “falsificationist-Bayesian” comes from the frequentist (or error statistical) side. Now with simple significance tests, you can at most infer a genuine discrepancy or problem somewhere, and what you detect depends upon the test statistic or directional departure chosen. Small P-value underdetermines the inference. Fisher had intuitions about the appropriate test statistics based on things like sufficiency, but Neyman and Pearson were right* that, without additional constraints, a probabilistic rejection rule is radically underdetermined. (*More than that, Neyman proved this.) They supplied criteria to remedy this, but one needn’t and shouldn’t adopt the extreme behavioristic interpretation many erroneously attribute to NP.
But, there’s one thing, you cannot go directly from a statistical falsification to an alternative model, insofar as there remain many, many rivals that could do about as well on your test. The severe tester deals with this in a manner formally akin to NP tests.
(In many popular uses of tests, the so-called NHST animal, it’s thought the alternative can be a substantive rival or research hypothesis. Going directly from statistical significance to those rivals is unwarranted and unreliable).
One other thing: I noticed you prefer to say your analysis is merely graphical and not really to perform a significance test, but the thing is,it’s an inference, however weak. (And a falsification is anything but weak.) It’s an “analysis” and you wouldn’t allow just any kind of eyeballing to warrant falsifying. There has to be criteria for the reasoning, graphical or not, unless maybe in cases where no statistics is needed at all. I discuss some of these points in my commentary on Gelman and Shalizi:
Click to access Comment%20onGelman&Shalizi_pub.pdf
Dear Mayo,
Thanks for your response. I don’t see the goal of the ‘tests without an alternative’ as being to infer an alternative.
I see them as part of establishing a ‘statistically adequate’ model class within which models of interest can be compared.
Their interpretation is in terms of how surprising the given data is for a given model or model class – a sort of self-consistency check. I also think ideas of ‘regularisation’ of the model class come in here but will leave that for another day.
I agree that there are many model classes of interest that are ‘adequate’ on a given dataset. That is why one should
– only compare parameters within a given, common, adequate structure
– accept that judgements of which ‘data features’ are used to assess the adequacy of a class of models is a fundamental starting point/assumption that is not (necessarily) justified in terms of other notions
Is this not the view of Spanos? He has said very similar things in his book and papers.
Where do you see the role of ‘severity’ with respect to misspecification testing?
I find it ambiguous whether you restrict severity in statistical practice to parameter estimation within an adequate model structure or see a role for it in misspecification testing too. Could you clarify?
You can’t establish a statistically adequate model without inferring a statistically adequate model, at some point. If you just infer there’s something wrong, it won’t do. And if you have to have a “closed system” as you say, then you won’t break out of it very readily. On severity and m-s tests, we, but mostly Spanos on his own, have written quite a lot on severity within m-s testing. I’m sure you can find it.
Hi Mayo,
This seems to be a central point of difference/misunderstanding/miscommunication, worth exploring. It seems to come up in most of our exchanges and also those I’ve seen between you and others such as Gelman.
How do you define an ‘adequate’ model structure? What do you see as the goal of establishing an ‘adequate structure’? Do you yourself distinguish between estimation/testing ‘inside’ a model structure and testing ‘outside’ a model.
I mentioned Spanos’ work because I have read it and it seems to be line with what I’m saying – Fisherian ‘outside’ and NP ‘inside’. Would you disagree with this? I may be wrong of course!
Oliver: Thanks for reminding me to get back to you.
“Their interpretation is in terms of how surprising the given data is for a given model or model class – a sort of self-consistency check. I also think ideas of ‘regularisation’ of the model class c”
Every data set is “surprising” in some way or other, so that’s not enough. What more is required? Moreover, if you’re going to use p-value reasoning for m-s testing, you cannot reject the justification of p-values for inferring discrepancies or systematic effects in general. So a “falsificationist Bayesian” who uses p-values (or analogous graphical checks) who denies the relevance of p-values, or claims they’re only good for a long-run behavior justification, is being inconsistent, at least on the face of it.
“I see them as part of establishing a ‘statistically adequate’ model class within which models of interest can be compared.”
If you’re merely comparing models, then it sounds like more of a comparative claim, or maybe a model selection activity, and the concern over exhausting the space rears its head. Bayesians turn to P-values, as I understand it, when they want to check model adequacy without a specific alternative.
For m-s testing and SEV, maybe look at Mayo and Spanos (2004), off my blog publication list on the left hand column.
Hi Mayo,
Thanks for your response.
Like I said – we have a scheme defining what is required of a ‘closed’ model class. Before we choose a model within this class – yes I interpret Bayesian statistical inference as a comparative account within a model class – we must find a particular instance of a ‘structural’ model that satisfies closure. Just as Spanos* requires.
To decide if a given structure is ‘adequate’ we need to decide if it satisfies the structural closure scheme I gave. To do this requires something like like a pure significance test of the closure assumptions using ‘test statistics’ or ‘data features’ or ‘discrepancy measures’. So, yes p-value-style reasoning. But this is ultimately an informal judgement that a set of assumptions about a model are ‘adequately satisfied’. Just like for Spanos*.
So I don’t reject p-value reasoning, but I do see a difference between deciding if the axioms of a mathematical model are adequately satisfied (via p-values, say, or some other informal judgement) as a generally different from working with the mathematical model itself. I didn’t say anything about the ‘long-run’.
Each to their own – see Hilbert’s quote in this post maybe:
https://omaclaren.wordpress.com/2015/10/06/model-schema-and-the-structuralist-interpretation-of-for-all/
and the topological metaphor here:
https://omaclaren.wordpress.com/2015/10/02/model-closure-and-formalism-in-economics-leading-to-a-topology-analogy/
*See the next comment for a few quotes from Spanos, who I take to be representative of an Error Statistician.
*Re: the comment above where I referred to similarities with Spanos’ (Error Statistical!) approach, here are some quotes from him:
In ‘Error and Inference’ (2010, co-edited with you) p. 324, discussing two exchanges between you and (Sir) David Cox in an earlier chapter of the same book and Cox’s influence on him:
> on a personal note, I ascertained the crucial differences between testing within (N-P) and testing outside the boundaries (M-S) of a statistical model and ramifications thereof, after many years of puzzling over what renders Fisher’s significance testing difference from N-P testing…I also came to appreciate the value of preliminary data analysis and graphical techniques in guiding and enhancing the assessment of model adequacy…
Similarly, here is Spanos in his econometrics textbook ‘Probability Theory & Statistical Inference’ (1999) p. 720-721:
> In view of such a comparison, it is generally accepted that the Neyman–Pearson formulation has added some rigor and coherence to the Fisher formulation and in some ways it has superseded the latter. The Fisher approach is rarely mentioned in statistics textbooks (a notable exception is Cox and Hinkley (1974)). However, a closer look at the argument that the main difference between the Fisher and the Neyman–Pearson approaches is the presence of an alternative hypothesis in the latter, suggests that this is rather misleading.
> The line of argument adopted in this book is that the Neyman–Pearson method constitutes a different approach to hypothesis testing which can be utilized to improve upon some aspects of the Fisher approach. However, the intended scope of the Fisher approach is much broader than that of the Neyman–Pearson approach. Indeed,…the Fisher approach is more germane to misspecification testing.
> As argued above, in the context of the Neyman–Pearson approach the optimality of a test depends crucially on the particular… power function… The search for answering question (14.49) begins and ends within the boundaries of the postulated statistical model. In contrast, a Fisher search…allows for a much broader scouring.
> At this stage the reader might object to the presence of an alternative hypothesis in the context of the Fisher specification. After all, Fisher himself denied the existence of the concept, as defined by Neyman and Pearson, in the context of his approach. However, even Fisher could not deny the fact that for every null hypothesis in his approach there is the implicit alternative hypothesis: the null is not valid. The latter notion is discernible in all of Fisher’s discussions on testing (see in particular Fisher (1925a,1956)).
> We can interpret Fisher’s objections as being directed toward the nature of the Neyman– Pearson alternative, and in particular the restriction that the alternative should lie within the boundaries of the postulated model. Hence, the crucial difference between the two approaches is not the presence or absence of an alternative hypothesis but its nature.
> In the case of a Fisher test the implicit alternative is much broader than that of a N–P test. This constitutes simultaneously a strength and a weakness of a Fisher test.
> An alternative but equivalent way to view the crucial difference between the Fisher and Neyman–Pearson approaches is in terms of the domain of search for the “true” statistical model in the two cases. The Neyman–Pearson approach can be viewed as testing within the boundaries demarcated by the postulated statistical model.
So, OK, ‘Fisherian’ tests of course have a ‘minimal’ alternative that the null is not valid. However it is important to note that this testing is of a qualitatively different nature to NP testing and is based on testing ‘outside’ a model structure rather than testing (or estimating) ‘within’ a model structure. In the terminology I used in the first post, it is based on testing the closure assumptions themselves.