An asymptotic threshold of sufficient randomness for causal inference

For sensitivity analysis with stochastic counterfactuals, we introduce a methodology to characterize uncertainty in causal inference from natural experiments. Our sensitivity parameters are standardized measures of variation in propensity and prognosis probabilities, and one minus their geometric mean is an intuitive measure of randomness in the data generating process. Within our latent propensity‐prognosis model, we show how to compute, from contingency table data, a threshold, T , of sufficient randomness for causal inference. If the actual randomness of the data generating process is greater than this threshold, then causal inference is warranted. We demonstrate our methodology with two example applications.

constrained optimization problem over the space of variable, propensity-prognosis distributions, in order to compute a threshold, T, of sufficient randomness for causal inference.If the actual randomness of the data generating process exceeds that threshold, that is, if η > T, then causal inference is warranted.Remarkably, full randomization is not required.
In Section 3, we show how to compute T from observed contingency table data.In particular, we show that T can be determined from a measure of association known as the ϕ coefficient; see Theorem 1. Two example applications are described in Section 4. In Section 5, we further discuss our introduced methodology and future directions.

| Comparison to previous work
Sensitivity analysis with our randomness threshold T is most similar to the sensitivity analysis of Rosenbaum ((2002), Chap. 4).For pairs of apparently identical individuals, Rosenbaum (2020) considers the odds ratio of their propensity probabilities for treatment, and his sensitivity parameter Γ is the least upper bound of these odds ratios.While Rosenbaum utilizes a sup-norm, our methodology utilizes a 2-norm.Our proposed methodology is complementary, and our emphasis on randomness is consistent with Fisher's view of randomization as the "reasoned basis" for causal inference (Fisher, 1935;Rosenbaum, 2020, p. 37).
Methods for dual and simultaneous sensitivity analysis are presented in Gastwirth et al. (1998).A simultaneous sensitivity analysis considers how an unmeasured covariate relates to both treatment and outcome.The propensity score summarizes how measured covariates relate to treatment, while a prognostic score summarizes how measured covariates relate to potential outcomes.Prognostic scores are discussed in Hansen (2008) and Leacy and Stuart (2013).Our propensity probabilities summarize how measured and unmeasured covariates relate to treatment, and our potential, prognosis probabilities summarize how measured and unmeasured covariates relate to the outcome.
We have published similar methodologies for sensitivity analysis of continuous data (Knaeble & Dutter, 2017;Knaeble et al., 2020), and there are some related publications describing additional methods of sensitivity analysis in the multiple regression setting (Cinelli & Hazlett, 2019;Frank, 2000;Hosman et al., 2010;Oster, 2017).The seed of the proposed methodology of this paper was the mathematical symmetry in the analysis of Knaeble et al. (2020).Based on a series of interviews with practicing epidemiologists, the methodologies of Knaeble and Dutter (2017) and Knaeble et al. (2020) have been adapted and modified for practical application during analysis of categorical data, resulting in the methodology of this paper.
The methodology of Knaeble and Dutter (2017) deals with two coefficients of determination, while the methodology of Ding and VanderWeele (2016) deals with two relative risks (risk ratios).In VanderWeele and Ding (2017), the E-value is introduced as the minimum strength of association on the risk ratio scale that an unmeasured confounder would need to have with both the treatment and the outcome to fully explain away a specific treatment-outcome association.While the E-value is a minimum strength of association, the threshold T is a maximum value for the randomness of the data generating process in the absence of causality.

| METHODS
Here, we describe the details of our model of the data generating process.In particular, we define a summary parameter, which we refer to as the randomness of the data generating process.We then formally state our null hypothesis of no causality.Then, under that null hypothesis, we proceed to define a threshold of sufficient randomness for causal inference.That threshold can be computed from observed data.We conclude this methods section with a description of what we call causality testing.

| Our model of the data generating process
We denote the treatment or exposure with e and the outcome, response, or disease with d.It is understood that the exposure e is an indicator for any general event thought to be causal and that the disease d is an indicator for any subsequent event possibly caused by the exposure.We assume an infinite population of independent but nonidentical individuals.Each individual is characterized by a triple ðπ i , r 0i , r 1i Þ, where for each i we have 0 < π i , r 0i , r 1i < 1.For each individual we model their exposure with e i $ Bernoulliðπ i Þ, their potential outcome in the absence of exposure with d i ðe i ¼ 0Þ $ Bernoulliðr 0i Þ, and their potential outcome in the presence of exposure with d i ðe i ¼ 1Þ $ Bernoulliðr 1i Þ.The expected prognosis for individual i is given by We assume for each i that that is, conditional on any given individual, potential outcomes are independent of natural, random exposure assignment.Those assumptions in (2) are more related to the stable unit treatment value assumption (SUTVA) (Imbens & Rubin, 2015, p. 10) than the assumption of ignorable treatment assignment conditional on a covariate (Rosenbaum, 2020, p. 71).We discuss how (2) relates to SUTVA and the exclusion restriction (Angrist et al., 1996, Section 3.2) in Section 5.2.We write μ for the real but latent distribution of ðπ i , r 0i , r 1i Þ values, and drop the i subscripts when analyzing μ.The distribution μ is a representation of the data generating process; see Figure 1.

| Randomness functional on a space of probability distributions
Define the unit cube C ¼ fðπ, r 0 , r 1 Þ : 0 < π, r 0 ,r 1 < 1g.We write PðCÞ to denote the space of probability distributions on C. We write m for a generic probability distribution on C. We define compare with (1).
For any fixed m PðCÞ, we define We then define the functionals and their geometric mean We refer to as the randomness of any m PðCÞ.We have 0 ≤ R 2 π ðmÞ, R 2 r ðmÞ < 1 and 0 < ηðmÞ ≤ 1. Intuitively, if a distribution m PðCÞ has ðπ, rÞ values concentrated near a point, then its randomness ηðμÞ will be larger.Our use of the geometric mean is explained in Section 5.5.

| The randomness of the data generating process
Here, we apply the functionals of Section 2.2 to the distribution μ, which represents the data generating process as described in Section 2.1.The resulting parameters R 2 π ðμÞ and R 2 r ðμÞ are coefficients of determination in the sense that they represent proportions of e-and d-variances explainable by measured and unmeasured covariates.We refer to as the randomness of the data generating process.
F I G U R E 1 On an infinite population the relative frequencies are determined from the distribution, μ, of ðπ, r 0 , r 1 Þ values; see Section 2.1 In Section 5.4, we describe one way to specify upper bounds u 2 π , u 2 r À Á that satisfy 1 > u 2 π ≥ R 2 π ðμÞ and 1 > u 2 r ≥ R 2 r ðμÞ.We then have ηðμÞ ≥ 1 À u π u r > 0. It is thus plausible to demonstrate the presence of randomness in the data generating process of an observational study, but specifying u 2 π , u 2 r À Á or ηðμÞ is best left to subject matter experts.Here we are focused primarily on quantifying what it means to have sufficient randomness for causal inference.We will compute a threshold of sufficient randomness from the observed data of a contingency table.That objective threshold can then be used as a sensitivity parameter.

| Our null hypothesis of no causality
For each individual i the parameters π i , r 0i , and r 1i are fixed and unknown.The population distribution of ðπ i ,r 0 , r 1 Þ values is represented by μ.Our sharp null hypothesis of no causality asserts that the proportion of individuals with r 0 ¼ r 1 is 100%.More formally, 2.5 | A threshold of sufficient randomness for causal inference Under our null hypothesis of no causality (see Section 2.4) the definition of (3) reduces to r :¼ r 0 ¼ r 1 .Define the unit square S ¼ fðπ,rÞ : 0 < π, r < 1g.We write PðSÞ for the space of probability distributions on S.
For j, k f0,1g, we write p jk ¼ Pðe ¼ j, d ¼ kÞ for the proportion of observations with e ¼ j and d ¼ k.We assume 0 < p jk < 1 for each j, k f0,1g.
We write v ¼ ðp 01 , p 11 , p 00 ,p 10 Þ for the vector of observed relative frequencies.We write P v ðSÞ for the space of those probability distributions m on S satisfying We refer to any probability distribution m P v ðSÞ as a noncausal explanation of the observed data in v.Because P v ðSÞ & PðCÞ, the definitions of Section 2.2 can be applied to any m P v ðSÞ.The randomness of a noncausal explanation m P v ðSÞ is given by ηðmÞ ¼ 1 À R 2 ðmÞ.In Figure 2, we have plotted four examples of noncausal explanations of an arbitrary example of data v ¼ ð:10,:23,:24,:43Þ, and for each, we have computed its randomness.
The threshold, TðvÞ, of sufficient randomness for causal inference is The threshold, TðvÞ, is the maximum possible randomness of any noncausal explanation of the observed data.

| Causality testing
We now have a framework to conduct what we call a causality test.In Section 2.1, we defined our representation of the data generating process, μ, and in Section 2.3, we defined the randomness of the data generating process, ηðμÞ.In Section 2.4, we stated our null hypothesis of no causality.In Section 2.5, under that null hypothesis of no causality, we defined a threshold, TðvÞ, of sufficient randomness.That threshold is computed from the observed relative frequencies in v.The threshold is the maximum possible randomness of any noncausal explanation of the observed data.To conduct a causality test a subject matter expert need not know the fine details of the data generating process as represented by μ.It is sufficient for a subject matter expert to identify sufficient randomness for causal inference.It is sufficient to specify upper bounds r ðμÞ (see Section 5.4) so that 1 À u π u r > TðvÞ warrants causal inference.The condition ηðμÞ > TðvÞ warrants causal inference because it holds when the true randomness, ηðμÞ, of the data generating process, exceeds the maximum possible noncausal randomness, TðvÞ, given the observed data.Limitations of that warrant for causal inference are discussed in Section 5.2.

| MAIN RESULT
The following theorem gives an analytic solution to the optimization problem in (10) and hence an explicit formula for the threshold TðvÞ of sufficient randomness.In the formula, the quantity ϕ is a measure of association known as the ϕ coefficient.The ϕ coefficient is an analog of Pearson's correlation coefficient, ρ, but ϕ measures association between two dichotomous, categorical variables.
Theorem 1. Suppose Pðe ¼ 1Þ ð0,1Þ and Pðd ¼ 1Þ ð0,1Þ.Let The threshold, TðvÞ, of sufficient randomness for causal inference can be computed with the following formula: Proof.Write PðSÞ for the space of distributions on S ¼ fðπ, rÞ : 0 < π, r < 1g and P v ðSÞ & PðSÞ for the class of feasible distributions satisfying the constraints in (9).For a feasible distribution m P v ðSÞ, the latent covariance is given by Note that in (12c) we used the constraints of ( 9) and in (12e) we used the fact that p 01 þ p 11 þ p 10 þ p 00 ¼ 1.By the Cauchy-Schwarz inequality and ( 12), any m P v ðSÞ satisfies By the constraints in (9), any m P v ðSÞ satisfies where p e ¼ Pðe ¼ 1Þ and p d ¼ Pðd ¼ 1Þ.Thus, by the definitions of ( 4) and ( 5), we have that any m P v ðSÞ satisfies It remains to show that there exists a feasible probability distribution, m P v ðSÞ, that attains the lower bound in (13).Write the signum function sgnðzÞ ¼ , and define > > > > : : We claim that the lower bound in ( 13) is attained by the two point-mass distribution where δðπ, rÞ is the Dirac delta distribution and We first claim that ðπ 1 , r 1 Þ S and ðπ 2 , r 2 Þ S so that To see ( 14), we first compute Because ðπ, rÞ ð0,1Þ 2 , we have that Ð S πrdm < minfp e , p d g, from which the claimed upper bound in ( 14) follows.We also have that Ð S πrdm > 0, so that σ πr > À p e p d .Finally, from the identity σ πr ¼ Ð S ð1 À πÞð1 À rÞdm À ð1 À p e Þð1 À p d Þ and the observation that Ð S ð1 À πÞð1 À rÞdm > 0, we have that σ πr > À ð1 À p e Þð1 À p d Þ. Putting these two inequalities together gives the claimed lower bound in ( 14).With σ ed ≠ 0, we have and from ( 14), we then have We now verify that m ?satisfies the constraints in (9).We first compute which verifies (9b).We then compute This gives us that which verifies (9a), (9d), and (9c).

| EXAMPLE APPLICATIONS
Here, we demonstrate how to compute TðvÞ from contingency table data using Theorem 1.Our first example application addresses the question of whether vaccination for Covid-19 saved lives.The second example application address the question of whether marijuana has been a gateway drug leading to more dangerous drug use.These topics were selected to demonstrate certain aspects of our methodology in settings that are familiar to many readers.In Appendix A, we describe how we obtained and prepared the data.The results of these example applications are not meant to be interpreted as providing support for or against any scientific conclusions.

| Vaccination and Covid-19
The randomized controlled trial of the BNT162b2 mRNA Covid-19 Vaccine demonstrated effectiveness of the vaccine to prevent cases of Covid-19 and also severe cases of Covid-19, but the sample size was not large enough to determine whether or not the vaccine saved lives (Polack et al., 2020).Later and larger observational studies over a set time period have demonstrated that vaccinated individuals are at lower risk of death than unvaccinated individuals (Scobie et al., 2021), but in those later studies, vaccination was no longer randomly assigned.Data from a susceptible population are shown in Table 1.

| Marijuana and hard drugs
The Population Assessment of Tobacco and Health (PATH) Study in the United States recorded various habits that individuals engaged in over their lifetimes (US NIH et al., 2016).The study recorded for each individual whether or not they had ever used various drugs.A strong association was observed between marijuana use and the use of harder drugs, such as cocaine, methamphetamine, speed, and heroin.Some relevant data are shown in Table 2.

| DISCUSSION
We have reduced the problem of causal inference to a problem of specifying the randomness of the data generating process.Specifying the randomness, ηðμÞ, of the data generating process is a refinement of the act of specifying whether a study is a randomized experiment or an observational study.In a randomized experiment, we have ηðμÞ ¼ 1, and any association may warrant causal inference.At the other extreme, in an observational study, with a highly deterministic data generating process, we have ηðμÞ ≈ 0, and causal inference is complicated (Pearl, 2015).
Between those extremes is a continuum of studies parameterized by ηðμÞ.We have shown how to objectively compute a threshold, TðvÞ, of sufficient randomness, from contingency table data.A warrant for causal inference is provided by ηðμÞ > TðvÞ.

| Finite-population adjustment
Our main result of Section 3 is an asymptotic result.On an infinite population, for a noncausal explanation to be valid, the equality constraints of (9) should hold exactly, under H 0 , by the law of large numbers.However, on a finite population, a noncausal explanation may be valid if the equations of ( 9) are only approximately satisfied.To determine whether a finite population size is sufficiently large, we recommend resampling with replacement from the observed data in v to produce a large number w of synthetic samples fv s g w s¼1 each of size n.From each of those synthetic samples, we may compute Tðv s Þ with the formula of Sections 3. The distribution of the resulting Tðv s Þ-values gives insight into whether n is sufficiently large.Within the context of our example application of Section 4.1, after computing Tðv s Þ from w ¼ 10,000 synthetic samples, the 95% quantile of the distribution of resulting Tðv s Þ values was 0.76, which is only 0.01 above our previously computed TðvÞ ¼ 0:75 value.Within the context of our example application of Section 4.2, after computing Tðv s Þ from w ¼ 10,000 synthetic samples, the 95% quantile of the distribution of resulting Tðv s Þ values was 0.60, which is only 0.02 above our previously computed TðvÞ ¼ 0:58 value.

| Limitations of our warrant for causal inference
Recall that e i $ Bernoulliðπ i Þ, d i ðe i ¼ 0Þ $ Bernoulliðr 0i Þ, and d i ðe i ¼ 1Þ $ Bernoulliðr 1i Þ.We have assumed independent individuals, and for each individual i, we have assumed also that e i ⫫ ðd i ðe i ¼ 0Þ, d i ðe i ¼ 1ÞÞ; see (2).These assumptions imply the stable unit treatment value assumption (SUTVA), which has two parts (Imbens & Rubin, 2015, p. 10).The first part requires that the potential outcomes for any individual do not vary with the treatments assigned to other individuals.The second part requires that there are no different forms or versions of each exposure, which lead to different potential outcomes.Our e i ⫫ ðd i ðe i ¼ 0Þ,d i ðe i ¼ 1ÞÞ assumption is valid if nature does not randomly assign multiple exposures.We are thinking here of so-called natural, natural-experiments (Rosenzweig & Wolpin, 2000) that exploit naturally random events as instrumental variables.The concept of excludability (Gerber & Green, 2012, Section 2.7.1) is related, as is the exclusion restriction of Angrist et al. ((1996), Section 3.2).The exclusion restriction requires that an instrument has no effect on the outcome except through exposure.There is a notion of true randomness (Pironio et al., 2010), and some say that a random event can occur without a cause (Robins et al., 2015, p. 334).We argue that the exclusion restriction is more likely to be satisfied when the process leading to exposure is more random.
We recommend that researchers first identify the random process and then define the exposure to be "that which was randomly assigned." We cannot be sure which component of a compound exposure is causal.However, this problem of compound exposures can occur also within a randomized experiment in the form of compound treatments (Hernán & VanderWeele, 2011).
T A B L E 2 A contingency table showing an observed association (RR = 11.4) between marijuana use and hard drug use.

Hard drug use
No Yes

| Timing
Analysts are free to specify the times at which propensity and prognosis probabilities are defined.For a given individual, there should be a welldefined interval of time during which their exposure could occur, and their propensity probability for the exposure should be defined at some time prior to the start of that time interval.Likewise, there should be a well-defined interval of time during which the disease (or outcome) could occur.
Individual prognosis probabilities should be defined before exposure.Note that these times may vary across individuals.The randomness can be increased by pushing the times of definition for propensity and prognosis probabilities far into the past.Also, a long time interval between an exposure and an outcome may increase the randomness, as in for instance studies of the relative age effect or birthdate effect (Muller & Page, 2015).Increased randomness supports causal inference, but there may be a tradeoff.Increasing the lengths of the time frames may call into question the assumptions in (2); see Section 5.2.

| Transporting upper bounds of coefficients of determination from studies of identical twins
We may estimate u 2 π ,u 2 r À Á to satisfy u 2 π ≥ R 2 π and u 2 r ≥ R 2 r by analyzing data from studies of monozygotic twins.One twin-study of the exposure e allows us to estimate u 2 π to satisfy u 2 π ≥ R 2 π .Another twin-study of the outcome d allows us to estimate u 2 r to satisfy u 2 r ≥ R 2 r .In each study, the basic idea is the same: Within any twin pair, discordant exposures (or outcomes) are evidence for randomness in the data generating process.Because monozygotic twins are indistinguishable at the time of zygotic splitting, it is not possible at that time (see Section 5.3) to predict which twin of a discordant pair will have which exposure (or outcome).Among a population of monozygotic twin pairs, the higher the proportion of discordant pairs, the lower u 2 π (or u 2 r ).

| Why the randomness is defined with the geometric mean
The space of observable, relative frequencies of a contingency table is parametrizable by the proportions Pðe ¼ 1Þ and Pðd ¼ 1Þ, and the observed covariance σ ed .We may therefore require of a noncausal explanation m P v ðSÞ (see Section 2.5) that π ¼ Pðe ¼ 1Þ and r ¼ Pðd ¼ 1Þ.Given fixed ðπ, rÞ, spurious σ ed is a monotonic function of the covariance between π and r, which we denote with σ πr .Under our null hypothesis of no causality, we thus have σ πr fixed as well.That fixed covariance factors: where ρ πr is the correlation between π and r.With σ e and σ d fixed at their observed values, we see that spurious association is parametrized by the product R π R r ρ πr (c.f.Knaeble et al., 2020;Knaeble & Dutter, 2017).Knowledge of bounds on ρ πr may rule out certain noncausal explanations that have infeasible ρ πr values and thus facilitate more powerful causality testing.However, we think that subject matter experts may be less certain about ρ πr and more certain about R 2 π and R 2 r .Thus, ηðμÞ ¼ 1 À R π ðμÞR r ðμÞ can be more easily compared with ηðmÞ ¼ 1 À R π ðmÞR r ðmÞ for any non causal explanation m P v ðSÞ.
Recall that the threshold, TðvÞ, is defined as the maximum possible randomness of any noncausal explanation of the observed data in v; see Equation ( 10).We note here that Equation (10) does not have a unique solution.In our proof of Theorem 1, we constructed a solution consisting of a distribution with two point-masses.That solution can be modified to produce additional solutions for any specified and feasible ðR 2 π , R 2 r Þ values that satisfy 1 À R π R r ¼ TðvÞ.Thus, information is not lost by using the single parameter R

| Scientific context
Our proposed methodology for sensitivity analysis with stochastic counterfactuals is flexible, and it has been designed to be applicable.However, it has not been designed to stand alone.We recognize that causal inference requires mutually supporting strands of evidence (Rosenbaum, 2015).Our proposed methodology has been designed to complement classic approaches to causal inference with demonstrated utility; see Pearce et al. (2019).
To improve reader intuition, we describe here a concrete example of a class of data generating processes where our proposed methodology readily applies.There have been numerous studies of the birthday effect, also known as the relative age effect (Muller & Page, 2015).Students or athletes are often divided into arbitrary cohorts, and observational studies have discovered that the relatively older children born just after the cutoff date have higher chances of success later in life when compared with the relatively younger children born just before the cutoff date (Fukunaga et al., 2013;Musch & Grondin, 2001).Here, the treated individuals are those who are relatively older and the outcome is success later in life.Because much time passes between birth and success later in life, it is appropriate to apply our model with its stochastic counterfactuals.
We may define our propensity probabilities shortly after conception on a population of individuals conceived roughly nine months prior the the cutoff date.In this setting, it is reasonable to assume smaller R 2 π ðμÞ and R 2 r ðμÞ values, and thus a larger ηðμÞ value; see Equation ( 7).With known ηðμÞ, we may compute TðvÞ from the observed data and check to see if ηðμÞ > TðvÞ warrants causal inference.This is a novel form of causality testing that is similar to traditional hypothesis testing based on p values.In traditional hypothesis testing, the size of a test is the probability of falsely rejecting the null hypothesis, and the size is specified before looking at any data.Likewise, we recommend prospective specification of ηðμÞ before looking at the data to compute TðvÞ.Specifying ηðμÞ for observational causality testing is loosely analogous to specifying the size of a traditional hypothesis test.Although the former is based on knowledge of the data generating process and the latter is arbitrary, both are subjective acts.However, like computation of a test statistic, the computation of TðvÞ from the observed data is objective.Here, in this paper, we have emphasized the objective act of computing the threshold TðvÞ from observational data; see Theorem 1 of Section 3.

| Beyond the sharp null hypothesis of no causality
The condition ηðμÞ > TðvÞ is not a necessary condition for causal inference.With a highly deterministic process the randomness ηðμÞ is low and the condition ηðμÞ > TðvÞ may be difficult to satisfy even though causality may be present.Also, if we have r 1 > r 0 for some individuals and r 1 < r 0 for other individuals, then association may be lacking and TðvÞ could be larger, and again the condition ηðμÞ > TðvÞ may be difficult to satisfy.It is possible to conduct causality tests of more general null hypotheses (Caughey et al., 2021).

| Future work
There are many ongoing aspects to this work on causality testing.We are working to incorporate the principle of maximum entropy for more powerful tests of causality.Also, there are conservative, finite-population corrections that support warrants for causal inference with ð1 À αÞ Â 100% confidence.Those corrections are detailed refinements of the resampling described here in Section 5.1.Additionally, there are ways to generalize the methodology to analyze continuous response data.Finally, there are ways to incorporate measured covariate data into the analysis.
The problem of covariate selection (Ding et al., 2017;Pimentel et al., 2016;Wooldridge, 2016) can be viewed through the framework of our propensity-prognosis model for novel insight.

| Conclusion
Here, we have utilized a propensity-prognosis model of the data generating process to define a measure of stochasticity of the data generating process, which we refer to as the randomness of the data generating process, ηðμÞ.We have shown how to compute the maximum possible randomness of any noncausal explanation for the observed data, which we refer to as a threshold of sufficient randomness for causal inference, TðvÞ.
We have thus introduced a novel, necessary condition, ηðμÞ > TðvÞ, for causal inference from observational data.The summary parameter ηðμÞ=TðvÞ may be utilized to summarize the evidence for causality in an observational study.