On optimal rerandomization designs

Blocking is commonly used in randomized experiments to increase efficiency of estimation. A generalization of blocking removes allocations with imbalance in covariate distributions between treated and control units, and then randomizes within the remaining set of allocations with balance. This idea of rerandomization was formalized by Morgan and Rubin (Annals of Statistics, 2012, 40, 1263–1282), who suggested using Mahalanobis distance between treated and control covariate means as the criterion for removing unbalanced allocations. Kallus (Journal of the Royal Statistical Society, Series B: Statistical Methodology, 2018, 80, 85–112) proposed reducing the set of balanced allocations to the minimum. Here we discuss the implication of such an ‘optimal’ rerandomization design for inferences to the units in the sample and to the population from which the units in the sample were randomly drawn. We argue that, in general, it is a bad idea to seek the optimal design for an inference because that inference typically only reflects uncertainty from the random sampling of units, which is usually hypothetical, and not the randomization of units to treatment versus control.


MAHALANOBIS-BASED RERANDOMIZATION
Following the notation in Morgan and Rubin (2012), consider a RCT with n units in the sample, indexed by i, with n 1 to be assigned to treatment and n 0 to be assigned to control; for simplicity, n 1 = n 0 . Let W i = 1 or W i = 0 if unit i is assigned treatment or control, respectively, and define W = (W 1 , …, W n ) � . Furthermore, let X be the n × K matrix of fixed covariates in the sample (x i , i = 1, …, n), with observed sample covariance cov(X).
Because n 1 = n 0 , there are n n 1 = A possible treatment allocation (assignment) vectors labelled W j = ( W j 1 , …, W j n ) � , j = 1, …, A, where card () = A, that is, the cardinality of the set  of possible allocations. The Mahalanobis distance for allocation j is where Morgan and Rubin (2012) proposed accepting the jth allocation when its treatment assignment vector W j satisfies where a is a positive constant. By the central limit theorem, and supported by experience with real examples for moderate n, the sample means of the covariates will be approximately normally distributed across random samples, so that M j ∼ 2 K (Morgan & Rubin, 2012). Letting we see that a is determined from the choice of p a . Because the number of rerandomizations is geometrically distributed, the expected number of randomizations needed to obtain an acceptable allocation is 1∕p a , which means that, for instance, with p a = 0.001, the average number of randomizations before drawing an allocation that fulfils the criterion is 1000. Morgan and Rubin (2012) show that when M j ∼ 2 K , then, due to the spherical symmetry of the multivariate normal distribution, with This result implies that the variance of the covariate mean differences across allocations in  a is reduced relative to its variance across the allocations in  by the factor a , and the percent reduction in variance of each of the covariates in X (or of any linear combination of them) is equal to 100(1 − a ).
Let Y i (w ) be the potential outcome under treatment w for unit i. Under the stable unit treatment value assumption (SUTVA, Rubin, 1980), the observed outcome when i is assigned W i is equal to . Let ̂ CR and ̂ RR be the estimators defined in Equation (4) under complete randomization (i.e. when the W i are randomly drawn from ) and Mahalanobis-based rerandomization (i.e. when the W i are randomly drawn from  a ), respectively. These estimators are unbiased for SATE and also for PATE under random sampling of the n units from the population. For convenience, we also define the corresponding estimators for a specific sample s: ̂ CR s and ̂ RR s , respectively. Let Y( w) = (Y 1 ( w), Y 2 (w), …, Y n (w) ) � , w = 0, 1, and let R 2 be the squared multiple correlation between Y(0) and X. Under the assumption that (i) the residual in the linear projection of Y(0) ;0 < a < 1.
(4) = Y 1 − Y 0 on X is normally distributed and that (ii) treatment effects are additive (so that R 2 is also the squared multiple correlation between Y(1) and X), the percentage reduction in variance (PRIV) of ̂ RR s and ̂ RR versus the corresponding estimators under complete randomization is where V n (. ) denotes the variance of the estimators (Morgan & Rubin, 2012). From this expression together with Equations (1) and (3), it becomes clear that the variance reduction from Mahalanobis-based rerandomization relative to complete randomization is decreasing in p a , the strictness of the rerandomization criterion, and non-increasing in K, the dimension of X. For additional properties of Mahalanobisbased rerandomization, see Morgan and Rubin (2012) and Li et al. (2018).

| THE SPECIAL CASE OF MINIMIZING THE VARIANCE
The mean difference estimate for sample s and allocation j is  (Imbens & Rubin, 2015;Morgan & Rubin, 2012). For sample s, the variance of the estimators, V n (̂ CR s ) and V n (̂ RR s ), only depend on the treatment assignment mechanisms that is the experimental designs.
For the super population, as assumed in Kallus (2018), where the expectation is over all random samples with fixed sample sizes. The variance of the estimators for inference to PATE are and respectively. The first term of the variance decomposition (7) is the expected variance of the estimator, and the second term is simply the variance of SATE across random samples. Clearly, only the first term differs between the two designs. From these results it follows that, in line with Kallus (2018), an optimal rerandomization design for the inferences to PATE should minimize the first term in Equation (7). Optimal designs are obtained by minimizing the maximum conditional variance of the mean difference estimator under the assumption that conditional means of the outcomes under treatment and control can be estimated (Kallus, 2018). Under this assumption, the pure strategy optimal design (PSOD) finds a single optimal allocation, that is, the allocation in the second stage is deterministic unless there are many allocations achieving the minimum. Inspired by Cochran and Cox (1957), section 4.36 and Kallus (2018) constructed example 1, to illustrate theorem 1 in Wu (1981), or theorem 1 in Kallus (2018), which are the basis for his optimal design. These theorems show that complete randomization is minimax if the covariates X are independent of the outcome. In this example, X and Y are deterministically generated according to where i = 1, …, n. † Note that with this data generating process, X has a different distribution for each n, that is larger n not only give a larger sample but the larger sample arises from a distribution with larger variance.
Let ̂ B and ̂ P be the mean difference estimators under blocking and pairwise-matched allocations (see Kallus, 2018, p. 97, for details), respectively. Under this data generating process, Kallus (2018) shows results that are in agreement with Cochran and Cox (1957), Equation (4.3).
Using Mahalanobis-based rerandomization, the optimal design is obtained by minimizing the variance in the observed covariates X (Equation 2), which is obtained by letting a ≡ min  M(W j , X), which, in large samples, implies a ≃ 0. With an additive treatment effect, this criterion implies that the PRIV of ̂ RR is equal to 100 × R 2 .
Including only X (i.e. not any transformation) in the Mahalanobis distance criterion Kallus (2018) implies that V n (̂ RR |p a = 0) ≡ 4, where p a = 0 is the minimal limiting acceptance criterionthat is the variance when restricting the allocations to those with M j = 0 only. The statement that V n (̂ RR |p a = 0) ≡ 4 for all n is not correct as will be shown below. This mistake, however, exposes a useful basis for discussing the foundation for inference to the experiment's sample under randomization inference and the implication this has for an optimal design for the inference to the population.

| Inference to sample in the experiment
The mistake of Kallus (2018) stems from the incorrect assumption that the allocation W = (0, 1, 0, 1, …, 0, 1 ) � uniquely minimizes the Mahalanobis distance for all n, and therefore the variance calculations are, incorrectly, based solely on this allocation. In fact, an experiment with n 1 = n 0 , using the Mahalanobis distance balance measure, card ( Opt ) ≥ 2, that is there exist at least two allocations, not a unique smallest Mahalanobis distance. This follows because, using the Mahalanobis distance, every allocation has a mirror allocation with 1's and 0's exchanged but with the same imbalance. Thus, the minimum number of allocations with the smallest imbalance is two (a pair of mirror allocations). For any balance measure that fulfils the mirror property (Johansson & Schultzberg, 2019;Kapelner et al., 2019), there is no single unique best allocation. As is shown in the online Appendix, with n = 2 the only pair of allocations in this example have M 1 = M 2 = 4, and therefore V 2 (̂ RR | p a = 0) is not defined for n = 2. With n = 4 there exists one pair of allocations with M j = 0, for which V 4 (̂ RR | p a = 0) = 4. For n > 4 there is more than one pair of X = (2 0 , 2 1 , 2 2 , …, 2 n∕2−1 , −(2 0 ), −(2 1 ), −(2 2 ), …, −(2 n∕2−1 )) 0) + † The notation on the data generation is ambiguous in Kallus (2018). The data generating process presented here is chosen to reproduce the results in Kallus (2018).
allocations that have Mahalanobis distances equal to zero, and the variance decreases in n. With n = 8 and n = 16, V 8 (̂ RR |p a = 0) = 4∕3 and V 16 (̂ RR | p a = 0) = 4∕7, respectively. The convergence rate is, thus, slower than for V n (̂ CR ) but it is in contrast to the claim V n (̂ RR |p a = 0) ≡ 4 for all n.
In this special case, the only source of randomness is the randomization mechanism for assigning treatments. Under the sharp null hypothesis, for example the treatment effect is zero for all units, the value of any test-statistic is known for all possible random allocations. The exact p-value associated with any test statistic is a simple function of the percentile of the observed test-statistic in the histogram of the statistics' value across all possible allocations. This implies that theoretical asymptotic variances are helpful tools only for comparing efficiency in designs where treatment assignment is randomized within a set of allocations sufficiently large for such an asymptotic argument to be appropriate. In this case, for inference with level α = 0.05, comparisons of variances is only useful with n ≥ 8. With n = 8, it follows that card ( ) = 8 4 = 70, which implies that the smallest possible p-value in a two-sided test is less than 0.05. A variance comparison between, for example rerandomization and complete randomization should be performed for a rerandomization design where a is chosen such that 2∕card ( a ) ≤ , where α is the desired level of the inference. This is the reason why Morgan and Rubin (2012) states that one should not use too small a value of p a . Kallus (2018, p. 94) refers to the rerandomization design proposed by Morgan and Rubin (2012) as the 'historically haphazard practice of rerandomization'. To the best of our understanding, the argument for this statement seems to be based on the belief that the Mahalanobis-based rerandomization minimizes the linear projection of Y on X due to a structural assumption of a linear relation between Y on X. However, as pointed out in Morgan and Rubin (2012), by including interactions and non-linear functions of covariates in the Mahalanobis distance, non-linear dependencies also can be considered in Mahalanobis-based rerandomization. To exemplify the potential importance of including transformations, we introduce both X and X 2 into the Mahalanobis distance criterion in the example above.
With both X and X 2 in the Mahalanobis distance criterion, there are no allocations with M j = 0 for n ≤ 16, and the optimal criterion is instead p * ≡ minp a : card ( Opt ) > 0. With p a = p * as the criterion, card ( A Opt ) = 2 for n ≤ 16, and for these single pairs of allocations, V 2 (̂ RR |p a = p * ) = 4, Thus, these 'optimal' designs have smaller variance than all the other designs. However, restricting randomization to one single pair, implies that randomization inference has essentially no power. To enable inference, the rerandomization criterion must be increased, thereby allowing for 'non-optimal' allocations. For example, when the inclusion criterion is set to p a = 0.1, we obtain the number of allowed allocations equal 2,8, and 1,288 for n = 4, 8, and 16 respectively. For these values of n,, we get V 4 (̂ RR |p a = 0.1) = 0,V 8 (̂ RR |p a = 0.1) = 0.5, and  Li et al. (2018) show that the asymptotic distribution of the mean difference estimator after Mahalanobis-based rerandomization is generally non-normal, but rather, the asymptotic distribution is a linear combination of a normal distributed variable and a truncated normal variable. Furthermore, the asymptotic sampling variances and quantile ranges of the mean difference estimator are reduced relative to when the estimation is based on complete randomization. The result in Li et al. (2018) has the important implication that using standard asymptotic inference results with a rerandomization design leads to sometimes highly conservative inference, with possible gains in power when using appropriate asymptotic results. To illustrate these points in a finite sample setting, a small simulation was conducted (see the online Appendix for details). Data are generated as where x ij , j = 1, 2, 3 and i = 1, …, n are independent, identically and either normally distributed with mean 0 and variance 1 (i.e. x ij ∼ N(0, 1), ∀ j, i) or exponentially distributed with rate 1 (i.e. x ij ∼ exp(1), ∀ j, i). The sampling error, i , i = 1, …, n, is iid and  (0, 2 ), where 2 is chosen to obtain R 2 = 0.2 and 0.5 in the data generating process (8). The sample size is varied; n = 50, 100, 200 and 400.

| Inference to the population
Based on the mean difference estimator, the power and size (nominal level 5%) of the Fisher exact test (FRT) and the asymptotic test derived in Li et al. (2018) (denoted the LDR below) under Mahalanobis-based rerandomization (with p a = 0.01) are compared to the t -test, both under complete randomization and under rerandomization.
under the alternative. The results are similar across distributions of the covariates. As expected, the t-test under rerandomization is highly conservative whereas the LDR test has too large size for all n < 400. Using the FRT and the LDR (when n = 400), we can see that with R 2 = 0.5, there is roughly 50% increase in power for both type of covariates. With R 2 = 0.2 there is 20% improvement in power.
Using both simulated and real data, Kallus (2018) compares the empirical variances of the mean difference estimator under different designs. Given that the asymptotic sampling distribution of the estimator is only known under the Mahalanobis criterion, it is not obvious that these comparisons of empirical variances are valid procedures for evaluating the relative efficiency of the different designs. Furthermore, for the Mahalanobis distance metric, Kallus (2018) only allows the raw covariates in the Mahalanobis criterion. It is likely that by including interactions and non-linear terms of the covariates, the variances under the Mahalanobis-based rerandomization would have been reduced as was the case in the previous example.
Schultzberg and Johansson (2020) shows that when the experimental units are randomly sampled from a super population, it is possible to draw inference to the units in this super population when choosing the best pair of allocations despite no possibility of drawing meaningful inference to the units in the sample (as illustrated in the previous example). Moreover, if the Mahalanobis criterion is used to find the best allocations, the asymptotic sampling distribution is known. However, in experiments on people, they usually choose whether or not to participate or they are selectively chosen, which means that when valid inference is the goal, it is a bad idea to choose the best pair of allocations because the inference only reflects uncertainty from random sampling. In other words, the ability to introduce a fully known stochastic mechanism in the design, from which exact inference can be based, should not be sacrificed for the usually small, often negligible, gain in efficiency achieved by choosing the best pair of allocations, rather than choosing randomly from a small set of the nearly best allocations. acceptable balance on these covariates. Morgan and Rubin (2012) were first to formalize a procedure for rerandomization by suggesting the Mahalanobis distance as the criterion. Kallus (2018) suggests finding the 'optimal' design for the inferences to the PATE.
The Kallus (2018) optimal designs are obtained by minimizing the maximum conditional variance of the mean difference estimator under the assumption that conditional means of the outcomes under treatment and control can be estimated. Under this assumption the PSOD aims to find a single optimal allocation from which the mean difference estimate is obtained.
For inference to SATE, the cardinality of  a , should be large enough to allow the exact Fisher randomization test (FRT) to have non-trivial power. A variance comparison between, for example rerandomization and complete randomization, is only meaningful when 2∕card ( a ) ≤ , where α is the desired level of the inference.
With random sampling in the experiment asymptotic inferences to the PATE is in theory possible when card(  a ) = 1. However, the only criterion for which the sampling distribution of the mean difference estimator is known is the Mahalanobis criterion. Also, it is in general a bad idea to use rerandomization designs with a minimum rerandomization criterion and/or select the final assignment deterministically, as suggested in Kallus (2018), as such designs only reflect uncertainty from potential random sampling. Instead, designs 'optimal' for the inference to SATE should also be used for inferences to the units of the population. In other words, the ability to introduce a fully known stochastic mechanism in the design, under which exact inference can always be based, should not be sacrificed for the, often negligible, gain in efficiency achieved by choosing the best allocation(s), rather than choosing randomly from a smaller set of the nearly best allocations, as implied by a wellchosen rerandomization criterion.
As an illustration of the problem with standard asymptotic theory with rerandomization designs and the potential with the Fisher randomization test and correct asymptotics (Li et al., 2018), a small Monte Carlo study is conducted. We find that given correct inferential methods substantial gains in power can be made using rerandomization in comparison to complete randomization