Overview/disclaimer: Can you do (semi-formal) statistics without assuming, even ‘temporarily’ that a model is ‘true’? Can estimation be done without implying an equivalent hypothesis testing formulation? Here’s a very rough sketch of one attempt. The account is, I hope, mostly a ‘positive’ proposal rather than a critique of existing practice. You have to look for the notes that aren’t played to see the ‘negative’ side/critique…
What is (formal/semi-formal) statistics? A modification of Fisher: Statistics is the study of data, measurements and/or individuals in aggregation.
We call these things that statistics studies ‘aggregates’, ‘datasets’ or ‘populations’ for short. Data in aggregation, i.e. populations, can possess properties distinct from unaggregated data, i.e. individuals. This is an important, if often neglected, feature of statistics: see e.g. Simpson’s paradox, ecological fallacies etc. Also note that in our case a finite sample is a valid aggregate or population in and of itself (but adding more data of course produces a new aggregate distinct from the original).
Statistics aims to both summarise and quantify aggregates and to account for the variation in these summaries across different aggregates. In particular, a statistic, estimator or learning map (we consider these to be different names for the same thing) is a function designed with the purpose of reducing aggregates (lying in the function’s domain) to a summary (lying in the function’s range).
Here we call the space within which these functions take values parameter space, and the domain containing populations/aggregates the data space.
We use the term ‘parameter space’ regardless of whether the output values of the estimator map are ‘model labels’/parameters of models, whether they represent values of an estimator evaluated at a ‘true population’ (which we don’t assume exists) or neither of these. They do not need to be real-valued either – they can be e.g. function-valued, density-valued, image-valued etc.
A common example of an estimator is mapping a set of non-co-linear (x,y) pairs to a straight line summary. This can be obtained by a formal process such as minimising a least-squares criterion or more informally such as drawing a line by hand but should, I would argue, give the same output given the same input (i.e. be deterministic; stochastic estimators can be represented by deterministic but measure-valued functions). A less typical example would be to map a picture of a human face to an emoji, in which case the estimator is emoji-valued.
The important thing in practice is that the resulting ‘parameter’ should represent a reduction of the data population and be an informative or interesting summary of the data. What is ‘interesting’ is judged by the scientists studying the subject but, because of the nature of the aggregation process, this often takes a common form across particular discipline boundaries – e.g. a mean or mode (of a population of numerical measurements) can be interesting in a variety of situations. A related phenomonen is what physicsts call ‘universality’.
Again, the domains of these functions are populations (aggregations) and the ranges are interesting summaries (often, but not always, real-valued).
The purpose of an estimator, considered as a function of datasets, is to
a) provide a useful/informative/interesting (to the analyst) reduction of a given dataset/aggregation (e.g. (x,y) pairs to a straight line) and to
b) be a stable reduction in the sense that when evaluated on ‘similar’ datasets (input) it gives a ‘similar’ output (estimate value).
‘Similar’ is a semi-formal concept typically defined by examples and procedures. A common way to illustrate the meaning of a ‘similar dataset’ is via resampling and related procedures such as bootstrapping, jackknifing, data splitting, data perturbation etc. in which an analyst gives a constructive procedure, illustrating what they mean by ‘similar dataset’, which produces new example datasets given a starting example dataset. One could potentially define this slightly more formally as a (often stochastic/probabilistic) mapping from example datasets to new example datasets, but the mapping must be provided by the analyst. A dataset may also include ‘regime indicators’ representing partially recorded information, and similarity measures can include differences in these.
The notion of similar is also made more concrete by introducing an explicit metric, distance and/or topology on examples: examples are more similar the closer they are in distance and more dissimilar the further they are away in distance. The notion of distance should be chosen to accurately reflect the analyst judgements of ‘similar’ as above; however, as before, some distances have wide applicability due to the nature of aggregations: e.g. certain statistical distances like the Kolmogorov metric capture useful notions of probabilistic convergence.
A stable estimator is one which produces estimates (summaries) of ‘similar’ datasets that are also ‘similar’ in a quantitative (or qualitative) sense. This requires another notion of ‘similar’ for parameter space (the output/range space of estimators). This is perhaps most usefully carried out by defining two metrics and/or topologies – one for data space and one for parameter space. In this way, stable means, in essence, that the estimator is a continuous function between these two spaces.
Directly plotting the estimator values (when possible to plot) obtained from a variety of ‘similar’ datasets is a useful way to visualise its stability/instability. This is similar to a ‘sampling distribution’ in frequentist statististics but need not have this interpretation. Instead I suggest the term distribution of variation.
Stability has the ‘predictive’ consequence that, if future datasets are ‘similar’ to the current dataset in the sense defined above, then the estimator evaluated on the present dataset is, by definition, similar to the estimate that would be found by evaluating the estimator on future datasets.
If, however, future datasets are not similar in the sense considered by the analyst then there is no guarantee of stability and/or prediction. Stability/continuity wrt a chosen notion of similarity is the feature that dictates predictive guarantees.
‘Overfitting’ is a form of instability that results from inadequate attention to defining stability wrt ‘similar’ datasets and focusing too much on one particular dataset. Instability implies overfitting and overfitting implies instability. Prediction can only be reliably guaranteed when the future is ‘similar’ to the past in a specified sense and estimators are designed with this similarity in mind.
Overfitting is prevented, to the extent that it is possible, by ensuring the estimator is stable with respect to similar datasets. Of course, it must also provide an ‘interesting’ summary of the present dataset – e.g. mapping everything to zero is stable but uninteresting.
This leads to a trade-off which here is just another form of what is called the bias-variance trade-off in statistical machine learning. The particular trade-off between ‘bias’ and ‘variance’ in SML theory is just one example of what is a somewhat more general and ‘universal’ phenomenon, however.
In short: in general the amount of information retained by an estimator when evaluated on a given dataset ‘conflicts with’ or ‘trades-off against’ the stability of this estimator when evaluated on other ‘similar’ but distinct datasets. Statistics is about determining data reductions (estimators) that balance retaining interesting information and stability.
A grab bag of somewhat related reading
Tukey, J.W. (1997). More honest foundations for data analysis. Journal of Statistical Planning and Inference, 57(1), 21-28. (h/t C. Hennig).
Tukey, J.W. (1993). Issues relevant to an honest account of data-based inference, partially in the light of Laurie Davies’ paper. Princeton University, Princeton. (link)
Davies, P.L., (2014). Data analysis and approximate models: Model choice, Location-Scale, analysis of variance, nonparametric regression and image analysis. CRC Press.
Poffio, T., Rifkin, R., Kukherjee, S., & Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature, 428(6981), 419.
Liu, K., & Meng, X. L. (2016). There Is Individualized Treatment. Why Not Individualized Inference?. Annual Review of Statistics and Its Application 3:1, 79-111.