The interesting thing, that I’ve discussed with other mathematicians, is the existence of various subcultures within mathematics even beyond the pure/applied divisions. In fact it is often pure mathematicians which might appreciate what mathematical biologists do while some applied mathematicians think we don’t prove enough ‘quantitative theorems’ about things like convergence rates etc.

All types are necessary I think. Biology is often so messy that choosing the right (abstract!) ‘Test statistic’ is crucial and helps set the level of abstraction.

]]>In ‘Error and Inference’ (2010, co-edited with you) p. 324, discussing two exchanges between you and (Sir) David Cox in an earlier chapter of the same book and Cox’s influence on him:

> on a personal note, I ascertained the crucial differences between testing within (N-P) and testing outside the boundaries (M-S) of a statistical model and ramifications thereof, after many years of puzzling over what renders Fisher’s significance testing difference from N-P testing…I also came to appreciate the value of preliminary data analysis and graphical techniques in guiding and enhancing the assessment of model adequacy…

Similarly, here is Spanos in his econometrics textbook ‘Probability Theory & Statistical Inference’ (1999) p. 720-721:

> In view of such a comparison, it is generally accepted that the Neyman–Pearson formulation has added some rigor and coherence to the Fisher formulation and in some ways it has superseded the latter. The Fisher approach is rarely mentioned in statistics textbooks (a notable exception is Cox and Hinkley (1974)). However, a closer look at the argument that the main difference between the Fisher and the Neyman–Pearson approaches is the presence of an alternative hypothesis in the latter, suggests that this is rather misleading.

> The line of argument adopted in this book is that the Neyman–Pearson method constitutes a different approach to hypothesis testing which can be utilized to improve upon some aspects of the Fisher approach. However, the intended scope of the Fisher approach is much broader than that of the Neyman–Pearson approach. Indeed,…the Fisher approach is more germane to misspecification testing.

> As argued above, in the context of the Neyman–Pearson approach the optimality of a test depends crucially on the particular… power function… The search for answering question (14.49) begins and ends within the boundaries of the postulated statistical model. In contrast, a Fisher search…allows for a much broader scouring.

> At this stage the reader might object to the presence of an alternative hypothesis in the context of the Fisher specification. After all, Fisher himself denied the existence of the concept, as defined by Neyman and Pearson, in the context of his approach. However, even Fisher could not deny the fact that for every null hypothesis in his approach there is the implicit alternative hypothesis: the null is not valid. The latter notion is discernible in all of Fisher’s discussions on testing (see in particular Fisher (1925a,1956)).

> We can interpret Fisher’s objections as being directed toward the nature of the Neyman– Pearson alternative, and in particular the restriction that the alternative should lie within the boundaries of the postulated model. Hence, the crucial difference between the two approaches is not the presence or absence of an alternative hypothesis but its nature.

> In the case of a Fisher test the implicit alternative is much broader than that of a N–P test. This constitutes simultaneously a strength and a weakness of a Fisher test.

> An alternative but equivalent way to view the crucial difference between the Fisher and Neyman–Pearson approaches is in terms of the domain of search for the “true” statistical model in the two cases. The Neyman–Pearson approach can be viewed as testing within the boundaries demarcated by the postulated statistical model.

So, OK, ‘Fisherian’ tests of course have a ‘minimal’ alternative that the null is not valid. However it is important to note that this testing is of a qualitatively different nature to NP testing and is based on testing ‘outside’ a model structure rather than testing (or estimating) ‘within’ a model structure. In the terminology I used in the first post, it is based on testing the closure assumptions themselves.

]]>Thanks for your response.

Like I said – we have a scheme defining what is required of a ‘closed’ model class. Before we choose a model within this class – yes I interpret Bayesian statistical inference as a comparative account within a model class – we must find a particular instance of a ‘structural’ model that satisfies closure. Just as Spanos* requires.

To decide if a given structure is ‘adequate’ we need to decide if it satisfies the structural closure scheme I gave. To do this requires something like like a pure significance test of the closure assumptions using ‘test statistics’ or ‘data features’ or ‘discrepancy measures’. So, yes p-value-style reasoning. But this is ultimately an informal judgement that a set of assumptions about a model are ‘adequately satisfied’. Just like for Spanos*.

So I don’t reject p-value reasoning, but I do see a difference between deciding if the axioms of a mathematical model are adequately satisfied (via p-values, say, or some other informal judgement) as a generally different from working with the mathematical model itself. I didn’t say anything about the ‘long-run’.

Each to their own – see Hilbert’s quote in this post maybe:

and the topological metaphor here:

*See the next comment for a few quotes from Spanos, who I take to be representative of an Error Statistician.

]]>“Their interpretation is in terms of how surprising the given data is for a given model or model class – a sort of self-consistency check. I also think ideas of ‘regularisation’ of the model class c”

Every data set is “surprising” in some way or other, so that’s not enough. What more is required? Moreover, if you’re going to use p-value reasoning for m-s testing, you cannot reject the justification of p-values for inferring discrepancies or systematic effects in general. So a “falsificationist Bayesian” who uses p-values (or analogous graphical checks) who denies the relevance of p-values, or claims they’re only good for a long-run behavior justification, is being inconsistent, at least on the face of it.

“I see them as part of establishing a ‘statistically adequate’ model class within which models of interest can be compared.”

If you’re merely comparing models, then it sounds like more of a comparative claim, or maybe a model selection activity, and the concern over exhausting the space rears its head. Bayesians turn to P-values, as I understand it, when they want to check model adequacy without a specific alternative.

For m-s testing and SEV, maybe look at Mayo and Spanos (2004), off my blog publication list on the left hand column.