## 1.Introduction

### 1.1.The problem

In their discussion of observational data analysis, Mosteller and Tukey (1977), page 328, said standard errors ‘cannot be expected to show us the indeterminacies and uncertainties we face’. More recently, a prize winning paper by Maclure and Schneeweiss (2001) described how random error is but one component in a long sequence of distortive forces leading to epidemiologic observations and is often not the most important. Yet conventional analyses of observational data in the health sciences (as reviewed, for example, in Rothman and Greenland (1998), chapters 12–17) can be characterized by a two-step process that quantifies only random error—

- (a) employ frequentist statistical methods based on the following assumptions, which may be grossly violated in the application but are not testable with the data under analysis:
- (i) the study exposure is randomized within levels of controlled covariates (sometimes replaced by a practically equivalent assumption of ‘no unmeasured confounders’ or ‘ignorability of treatment assignment’);
- (ii) selection, participation and missing data are random within levels of controlled covariates;
- (iii) there is no measurement error (occasionally, an unrealistically restrictive error model is used to make a correction, which can do more harm than good; see Wacholder
*et al.*(1993));

- (b) address possible violations of assumptions (i)–(iii) with speculative discussions of how each might have biased the statistical results. If they like the statistical results from the first step, researchers will argue that these biases are inconsequential, rarely offering evidence to that effect (Jurek
*et al.*, 2004). However if they dislike their results they may focus on possible biases and may even write whole articles about them (e.g. Hatch*et al.*(2000)).

In practice, the second step is often skipped or fails to address more than one or two assumptions (Jurek *et al.*, 2004). The assumptions in the first step can be replaced by the slightly weaker assumption that any biases from violations of (i)–(iii) cancel, but appeal to such cancellation seems wishful thinking at best.

Paul Meier (personal communication) and others have defended conventional results (derived under step (a)) as ‘best case’ scenarios that show the absolute minimum degree of uncer-tainty that we should have after analysing the data. Unfortunately, the above assumptions are far too optimistic, in that they produce misleadingly narrow interval estimates precisely when caution is most needed (e.g. in meta-analyses and similar endeavours with potentially large policy impact, as illustrated below). Worse, the illusory precision of conventional results is rarely addressed by more than intuitive judgments based on flawed heuristics; see Section 4.3.

Another defence is that conventional results merely quantify random error. This defence overlooks the fact that such quantification is hypothetical and hence questionable when no random sampling or randomization has been employed and no natural random mechanism has been documented. Conventional (frequentist) statistics are still often touted as ‘objective’, even though in observational epidemiology and social science they rarely meet any criterion for objectivity (such as derivation from a mechanism that is known to be operative in the study). This belief has resulted in an unhealthy obsession with random error in both statistical theory and practice. A prime example, which is often lamented but still very much a problem, is the special focus that most researchers give to ‘statistical significance’—a phrase whose very meaning in observational studies is unclear, owing to the lack of justification for conventional sampling distributions when random sampling and randomization are absent.

The present paper is about the formalization of the second step to free inferences from dependence on the highly implausible assumptions that are used in the first step and the often misleading intuitions that guide the second step. Although I limit the discussion to observational studies, the bias problems that I discuss often if not usually arise in clinical trials, especially when non-compliance or losses occur, and the methods described below can be brought to bear on those problems.

### 1.2.An overview of solutions

An assessment of uncertainty due to questionable assumptions (uncertainty analysis) is an essential part of inference. Formal assessments require a model with parameters that measure departures from those assumptions. These parameters govern the bias in methods that rely on the original assumptions; hence I shall call the parameters *bias parameters*, the model for departures a *bias model* and departures from a particular assumption a *bias source*.

Statistical literature on bias models remains fragmented; most of it deals with just one bias source, and the bias model is often used only for a sensitivity analysis (which displays bias as a function of the model parameters), although occasionally it becomes part of a Bayesian analysis. In contrast, the literature on risk assessment and decision analysis has focused on accounting for all major sources of uncertainty (Morgan and Henrion, 1990; Crouch *et al.*, 1997; Vose, 2000; Draper *et al.*, 2000). Most notable in the health sciences are the confidence profile method (Eddy *et al.*, 1992), which incorporates bias models into the likelihood function, analyses based on non-ignorable non-response models with unknown bias parameters (Little and Rubin, 2002), and Monte Carlo sensitivity analysis (MCSA), which samples bias parameters and then inverts the bias model to provide a distribution of ‘bias-corrected’ estimates (Lash and Silliman, 2000; Powell *et al.*, 2001; Lash and Fink, 2003; Phillips, 2003; Greenland, 2003a, 2004a; Steenland and Greenland, 2004).

### 1.3.Outline of paper

The next section gives some general theory for bias modelling that encompasses frequentist (sensitivity analysis), Bayesian and MCSA approaches. The theory gives a formal perspective on MCSA and suggests ways to bring it closer to posterior sampling. In particular, it operationalizes the sequential bias factor approach (Maclure and Schneeweiss, 2001) in a form that approximates Gibbs sampling under certain conditions and explains the similarity of Bayesian and MCSA results that are seen in published examples (Greenland, 2003a; Steenland and Greenland, 2004). Section 3 analyses 14 studies of magnetic fields and childhood leukaemia, extending a previous analysis (Greenland, 2003a) by adding new data, providing more detail in illustration and extending the bias model to include classification error. Classification error is a large source of uncertainty due to an absence of data on which to base priors, and due to the extreme sensitivity of results over reasonable ranges for the bias parameters. Section 4 discusses some problems in interpreting and objections to bias modelling exercises; it argues that many of the criticisms apply with even more force to conventional analyses, and that the status of the latter as expected and standard practice in observational research is unwarranted. That section can be read without covering Sections 2 and 3, and I encourage readers who are uninterested in details to skim those two sections and to focus on Section 4.