Doubly Robust Uniform Confidence Band for the Conditional Average Treatment Effect Function

In this paper, we propose a doubly robust method to present the heterogeneity of the average treatment effect with respect to observed covariates of interest. We consider a situation where a large number of covariates are needed for identifying the average treatment effect but the covariates of interest for analyzing heterogeneity are of much lower dimension. Our proposed estimator is doubly robust and avoids the curse of dimensionality. We propose a uniform confidence band that is easy to compute, and we illustrate its usefulness via Monte Carlo experiments and an application to the effects of smoking on birth weights.


Introduction
In this paper, we propose a doubly robust method to estimate the heterogeneity of the average treatment effect with respect to observed covariates of interest. To describe our methodology, we consider the potential outcome framework. Let Y 1 and Y 0 be potential individual outcomes in two states, with treatment and without treatment, respectively. For each individual, the observed outcome Y is Y = DY 1 + (1 − D)Y 0 , where D denotes an indicator variable for the treatment, with D = 0 if an individual is not treated and D = 1 if an individual is treated. We assume that independent and identically distributed observations {(Y i , D i , Z i ) : i = 1, . . . , n} of (Y, D, Z) are available, where Z ∈ R p denotes a p-dimensional vector of covariates.
Suppose that a researcher is interested in evaluating the average treatment effect conditional on only a subset of covariates X, which is of a substantially lower dimen- That is, we are interested in a case where d p.
The main object of interest in this paper is the conditional average treatment effect function (CATEF); namely: (1.1) When d ≥ 3, it is difficult to plot g(x), not to mention low precision due to the curse of dimensionality. Hence, for practical reasons, we focus on the case that d = 1 or d = 2, while p is often of a much higher dimension.
To achieve identification of the CATEF, we assume that Y 1 and Y 0 are independent of D conditional on Z (known as the unconfoundedness assumption): where ⊥ denotes the independence. For (1.2) to be plausible in applications, applied researchers tend to consider a large number of covariates Z. Note that in our setup, the treatment may be confounded in the sense that the treatment assignment may not be independent of the potential outcome variables given X only. To satisfy the unconfoundedness condition, a much larger set of conditioning variables Z needs to be employed. Different roles of covariates between X and V are noted in the recent literature.
For example, Ogburn, Rotnitzky, and Robins (2015) consider a similar issue in the context of the local average treatment effect (LATE) of Imbens and Angrist (1994). Ogburn, Rotnitzky, and Robins (2015) emphasize that conditioning on a large number of covariates Z may be required to make it plausible that the binary instrument is valid. In their empirical example, Ogburn, Rotnitzky, and Robins (2015) revisit the analyses of Poterba, Venti, and Wise (1995) and Abadie (2003) to examine whether participation in 401(k) pension plans increases household savings. In their example, the vector of covariates Z for the identifying assumption consists of income, age, marital status, and family size, whereas the variable of interest X is income. Abrevaya, Hsu, and Lieli (2015) also consider the case of investigating the effect of smoking during pregnancy on birth weights. They are interested in estimating (1.1) with X being the age of mother; however, as noted in Abrevaya, Hsu, and Lieli (2015), it is unlikely that conditioning only on the age of mother would achieve the unconfoundedness assumption with nonexperimental data. As a result, it is necessary to consider a high-dimensional Z, including the age of the mother.
The fact that a high-dimensional Z needs to be employed for (1.2) to be plausible in an application makes a fully nonparametric estimation approach impractical because of the curse of dimensionality. For example, the propensity score is not nonparametrically estimable in moderately sized samples, if the dimension of Z is high. One obvious alternative is to use a parametric model for the propensity score; however, it may lead to misleading results if the parametric model is misspecified.
With the aim of providing a practical method and, at the same time, reducing sensitivity to model misspecification, we propose to use a doubly robust method based on parametric regression and propensity score models. Our estimator of the CATEF is doubly robust in the sense that it is consistent when at least one of the regression model and the propensity score model is correctly specified. Specifically, we first estimate CATEF(Z) using a doubly robust procedure: we estimate a parametric regression model of the outcome on Z for each treatment status and a parametric model for the probability of selecting into the treatment given Z; we then combine the parametric estimation results in a doubly robust fashion to construct an estimate of CATEF(Z). We then obtain an estimate of CATEF(X) by adopting the local linear smoothing of CATEF(Z). As a result, we avoid high-dimensional smoothing with respect to Z but mitigate the problem of misspecification by both the doubly robust estimation and low-dimensional nonparametric smoothing with respect to X.
We emphasize that we are willing to assume parametric specifications for the propensity score and regression models as functions of Z to avoid the curse of dimensionality, but not for CATEF(X). One may consider parametric estimation of CATEF(X), as Ogburn, Rotnitzky, and Robins (2015) estimate their LATE parameter using least squares approximations. However, note that even if the parametric specification of CATEF(Z) is correct, the resulting specification of CATEF(X) may not be correctly specified since, for example, E[Z|X] is possibly highly nonlinear. To avoid this misspecification, we estimate CATEF(X) nonparametrically.
Because the CATEF is a functional parameter, as a tool of inference, we propose to use a uniform confidence band for the CATEF. Our construction of the uniform confidence band is based on some analytic approximation of the supremum of a Gaussian process using arguments built on Piterbarg (1996), combined with a Gaussian approximation result of Chernozhukov, Chetverikov, and Kato (2014) and an empirical process result of Ghosal, Sen, and van der Vaart (2000). Our method is simple to implement and does not rely on resampling techniques. This paper contributes to the literature on doubly robust estimation by demonstrating that the doubly robust procedures are useful for estimating the CATEF. In this paper, we focus on the so-called augmented inverse probability weighting estimator that was originally proposed by Robins, Rotnitzky, and Zhao (1994) for the estimation of the mean (see also Robins and Rotnitzky, 1995;Scharfstein, Rotnitzky, and Robins, 1999). Their estimator appears to be the first estimator to be recognized as being doubly robust. Since then, many other alternative doubly robust estimators have been proposed in the literature. For example, the inverse probability weighting regression adjustment estimator (Kang and Schafer, 2007;Wooldridge, 2007Wooldridge, , 2010 is widely known and has been implemented in statistical software packages. See the introduction of Tan (2010) for a comprehensive summary of other doubly robust estimators. Doubly robust estimators have been advocated for use in many different areas of application: See, for example, Lunceford and Davidian (2004) for medicine, Glynn and Quinn (2010) for political science, Wooldridge (2010) for economics, andKang (2008) for psychology. There are also doubly robust estimators available for different settings including instrumental variables estimation (Tan, 2006;Okui, Small, Tan, and Robins, 2012) and estimation under multivalued treatments (Uysal, 2015).
It would not be difficult to extend our method to allow these other doubly robust estimators and to consider different settings. However, to keep the analysis simple, in this paper, we focus on the augmented inverse probability weighting estimator of the CATEF.
The CATEF is mathematically equivalent to "V -adjusted variable importance" of van der Laan (2006), who proposes it as a measure of variable importance in prediction. van der Laan (2006) proposes a doubly robust estimator of V -adjusted variable importance. Contrary to ours, he considers the projection of the V -adjusted variable importance on a parametric working model and does not consider a nonparametric estimation. Moreover, a uniform confidence band is not examined in van der Laan (2006).
In a recent paper, Abrevaya, Hsu, and Lieli (2015) consider the estimation of the CATEF 1 ; however, there are two main differences of this paper relative to Abrevaya, Hsu, and Lieli (2015). First, we propose the doubly robust procedure to estimate the CATEF. Abrevaya, Hsu, and Lieli (2015) consider the inverse probability weighting estimator. The inverse probability weighting estimator suffers from model misspecification when the propensity score model is misspecified and from the curse of dimensionality when it is estimated nonparametrically. Second, we present a method to construct a uniform confidence band, whereas Abrevaya, Hsu, and Lieli (2015) only provide a pointwise confidence interval.
The remainder of the paper is organized as follows. Section 2 presents the doubly robust estimation method, Section 3 gives an informal description of how to construct a two-sided, symmetric uniform confidence band when the dimension of X is one, and Section 4 deals with a general case and provides formal theoretical results. In Section 5, the results of Monte Carlo simulations demonstrate that in finite samples, our doubly robust estimator works well, and the proposed confidence band has desirable coverage properties. Section 6 gives an empirical application, and Section 7 concludes.
The proofs are contained in Appendix B.

Conditional on Covariates of Interest
In this section, a doubly robust method for estimating the CATEF is proposed. We first estimate the CATEF for all the covariates using a doubly robust method. We then obtain the CATEF for the covariates of interest using a nonparametric approach. Define: where π(z) is the propensity score and µ j (z) for j = 0, 1 are called regression functions.
The following lemma gives regularity conditions under which g(x) is identified.
Remark 1. In this paper, we focus on cases in which X is continuous. When X is discrete, the CATEF can be estimated by the sample average of ψ(W,θ) using the sub-sample for each possible value of X and an estimatorθ of θ 0 . Moreover, constructing a confidence band is standard when X takes a finite number of values.
2.1. Parametric Estimation of θ. For concreteness, we consider the following estimation procedure for θ 0 . However, how θ 0 is estimated does not alter our results provided that the rate of convergence is sufficiently fast so that Assumption 1(7) given below is satisfied. For each j = 0, 1, we estimate α j by least squares: We estimate β by maximum likelihood (e.g., probit or logit): Remark 2. When the dimension of Z is not too high, an alternative to parametric estimation of ψ(W, θ 0 ) is to estimate its nonparametric counterpart via local polynomial estimators as in Rothe and Firpo (2016). However, this would not work when the dimension of Z is sufficiently high (see related remarks in Rothe and Firpo (2016)).
The latter is the case we focus on in the paper.
2.2. Local Linear Estimation of g. We consider a local linear estimator of g(x).
Assume that g(x) is twice continuously differentiable. For each x = (x 1 , . . . , x d ), the local linear estimator of g(x) can be obtained by minimizing: is a kernel function on R d and h n is a sequence of bandwidths. More specifically, letĝ(x) = e 1γ (x), whereγ(x) ≡ arg min γ∈R d+1 S n (γ) and e 1 is a column vector whose first entry is one, and the rest are zero.
2.3. Effect of First Stage Estimation. In our setting, we can carry out inference as if θ 0 were known. This result would not be a surprise given that our first-stage estimation is parametric and our second-stage estimation is nonparametric: the rate of the convergence in the first-stage estimation is faster than that of the second stage.
This feature of no first-order effect of the first-stage estimation in the second stage turns out to be more general than our setup. It is indeed closely related to doubly robustness.
If we model g(x) parametrically or more generally approximate g(x) by linear projection, it can be estimated by running an OLS of ψ(W,θ) on X. Because of the built-in feature of double robustness, it can be shown that the limiting distribution of the OLS estimator of ψ(W,θ) on X is equivalent to that of the infeasible OLS estimator of ψ(W, θ 0 ) on X. Furthermore, even if we estimate π(·) and µ j (·) (j = 0, 1) nonparametrically when the dimension of Z is moderate, there will be no estimation effect from the first stage as well. For example, see Chen, Hong, and Tarozzi (2008), Rothe and Firpo (2016) and Chernozhukov, Escanciano, Ichimura, and Newey (2016) among others for related results.

An Informal Description of a Uniform Confidence Band
In this section, we provide an informal description of how to construct a two-sided, symmetric uniform confidence band. For simplicity, we focus on the leading case where d = 1. Let I ≡ [a, b] denote an interval of interest for which we build a uniform confidence band. Assume that I is a subset of the support of X. We use nonbold x to mean that x is one-dimensional.
Algorithm. Carry out the following steps to construct a (1 − α) uniform confidence band.
(1) Obtainĝ(x) using a local linear estimator with a bandwidth h n such that: where h is a commonly used optimal bandwidth in the literature (for example, the plug-in method of Ruppert, Sheather, and Wand (1995) which is explained in Appendix A). We use the Gaussian kernel in our simulations and empirical application.
(2) Obtain the pointwise standard errorŝ(x)/(nh n ) 1/2 ofĝ(x) by constructing a feasible version of the asymptotic standard error formula: wheref X is the kernel density estimator: andσ 2 (x) is the conditional variance function estimator: (3) To compute a critical value c(1 − α), define: Note that λ = 0.5 if K(·) is the Gaussian kernel. 3 Let: a n ≡ a n (I) = 2 log(h −1 n (b − a)) + 2 log . Now set the critical value for the two-sided symmetric uniform confidence band by: (4) For each x ∈ I, we set the two-sided symmetric confidence band: We make some remarks on the proposed algorithm. In step (1), the factor n 1/5 × n −2/7 is multiplied in the definition of h n to ensure that the bias is asymptotically negligible by undersmoothing. In step (2), one can estimate f X and σ 2 (x) ≡ Var [ψ(W, θ 0 )|X = x] using the standard kernel density and regression estimators with the same kernel function K(·) and the same bandwidth h and also with an estimator of θ 0 . In step (3), we may restrict the bandwidth such that h n (b−a) (which is satisfied asymptotically), thereby imposing the condition that log(h −1 n (b−a)) is positive. The critical value proposed in step (3) is strictly positive if α is not too close to one or if n is large enough.
Remark 3. It is straightforward to modify the algorithm above to construct one-sided symmetric confidence bands. Define a new critical value by Then, for each x ∈ I, we set the one-sided symmetric confidence bands: Remark 4. When x is more than one dimension, the algorithm may be revised as follows. Obviously, we need to use multivariate kernels and pointwise standard errors should be adjusted because the rate of convergence becomes nh d n . The value of λ stays the same when we use a product kernel. For example, if K is the product Gaussian kernel, then λ = 0.5. The formulas of a n and c(1 − α) need to be changed. a n is the largest solution to the following equation: where mes(I) is the Lebesgue measure of I. When d = 2, the critical value has the form c(1 − α) ≡ a n + c/a n , where c is the smallest value that satisfies When d = 3, we have that c(1 − α) ≡ a n + c/a n , where c is the smallest value that We note that in this paper, we assume that d < 4 (see Assumption 1).
Remark 5. We may compare our proposal with the critical value based on the (1 − α) quantile of the Gumbel distribution, which is given by: Note that: which is strictly positive for small α but converges to zero as a n diverges. Hence, we expect that in finite samples, the confidence band based on c ∞ (1 − α) is too wide and has a higher coverage probability than the nominal level. It is shown in the next section that the critical value based on the Gumbel distribution is accurate only up to the logarithmic rate, where our proposed critical value is precise in a polynomial rate. This is because our proposal uses a higher-order expansion of Piterbarg (1996), whose approximation error is of a polynomial rate. See Theorem 2 in Section 4 for details.
Remark 6. Our construction of critical values is based on a simple analytic method that is easy to compute. Alternatively, one may rely on bootstrap methods to compute critical values for the uniform confidence band. For example, see Claeskens and Keilegom (2003) for smoothed bootstrap confidence bands and Chernozhukov, Lee, and Rosen (2013) for multiplier bootstrap confidence bands. Chernozhukov, Chetverikov, and Kato (2013) show that in general settings including high dimensional models, Gaussian multiplier bootstrap methods yield critical values for which the approximation error decreases polynomially in the sample size. Roughly speaking, both our simple analytic correction and multiplier bootstrap methods yield critical values that are accurate at polynomial rates. A refined theoretical analysis is necessary to determine which type of the critical value is better asymptotically.
Remark 7. The proposed confidence band can be used to test whether the CATEF is constant. Suppose that our null hypothesis is that g(x) is constant in I. This null hypothesis can be written as g(x) = g I , where g I = E[g(x)|x ∈ I]. Since g I can be estimated at the parametric ( √ n ) rate and the estimator thus converges faster than g(x), we can ignore the estimation error for g I . We reject the constancy of g(x), if the confidence band does not include the estimate of g I for some x ∈ I.

Asymptotic Theory
In this section, we establish asymptotic theory. Let U ≡ ψ(W, θ 0 ) − g(X) and let . . , n. Letŝ 2 (x) be the estimator of the asymptotic variance ofĝ(x). Let s 2 n (x) denote the population version of the asymptotic variance of the estimator: Assume that the d-dimensional kernel function is the product of d univariate kernel for each j. We make the following assumptions.
where a j < b j , j = 1, . . . , d, and I is a strict subset of the support of X.
(2) The distribution of X has a bounded Lebesgue density f X (·) on R d . Furthermore, f X (·) is bounded below from zero with continuous derivatives on I.
symmetric around zero, and six times differentiable.
(7) inf n≥1 inf x∈I s n (x) > 0 and s n (x) is continuous for each n ≥ 1. Furthermore, for some constant c > 0.
Most of the assumptions are standard. Condition (2) of Assumption 1 rules out discrete covariates. If all regressors are discrete, then the estimation problem reduces to a parametric estimation problem. In this case, one may consider a multiple testing approach as in Lee and Shaikh (2014) by defining subpopulations with observed cells of discrete covariates. If some covariates are discrete and others are continuous, then one may use a smoothing approach proposed in Li and Racine (2004).
Condition (5) assumes that the kernel function has finite support. This assumption is for the simplicity of the paper and can be dropped at the expense of complicated proofs. It also assumes that the kernel function is differentiable. This assumption is crucial and excludes, for example, the uniform kernel. One of the bandwidth conditions in h n (that is, η > 1/(d + 4) in condition (6)) imposes undersmoothing, so that we can ignore the bias asymptotically. The rule-of-thumb bandwidth proposed in Section 3 satisfies the required rate conditions.
Remark 8. Note that d < 4 is necessary to ensure that η < 1/(2d) and η > 1/(d + 4) can hold jointly. It is possible to extend our asymptotic theory to the case that d ≥ 4 using a higher-order local polynomial estimator under stronger smoothness conditions. In this paper, we limit our attention to the local linear estimator since we are mainly interested in low dimensional x.
Remark 9. An estimator ofŝ 2 (x) is readily available. For example, we may consider 2) for its concrete form for the one-dimensional case. Alternatively, we may set For either estimator, it is straightforward to verify condition (8) of Assumption 1 using the standard results in kernel estimation.
Let a n ≡ a n (I) be the largest solution to the following equation: where mes(I) is the Lebesgue measure of I; that is, mes(I) = d j=1 (b j − a j ) and: The following is the main theoretical result of our paper.
Theorem 2. Let Assumption 1 hold. Then there exists κ > 0 such that, uniformly in t, on any finite interval: Notice that the approximation error is of a polynomial rate. As a result, a critical value based on the leading term of the right-hand side of (4.5) provides a better approximation than one based on the Gumbel approximation. The result in Theorem 2 may be of independent interest for constructing the uniform confidence band in nonparametric regression beyond the scope of estimating the CATEF in our context.
Remark 11. In a setting different from here, Lee, Linton, and Whang (2009) propose analytic critical values based on Piterbarg (1996) in order to test for stochastic monotonicity, compare its performance with the bootstrap critical values in their Monte Carlo experiments, and find that both perform well in finite samples. However, the discussions in Lee, Linton, and Whang (2009)  The conclusion of the theorem can be simplified for special cases. In particular, if d = 1, then: where a n is the largest solution to mes(I)h n −1 λ 1/2 (2π) −1 exp(−a 2 n /2) = 1. Also, if d = 2, then: where a n is the largest solution to mes(I)h n −2 λ 2 (2π) −3/2 a n exp(−a 2 n /2) = 1.
Remark 12. It is standard to obtain pointwise confidence intervals based on normal approximations. Recall that our two-sided symmetric uniform confidence band has the form:ĝ where c(1 − α) is obtained from Theorem 2. To obtain two-sided symmetric pointwise confidence intervals, we just need to replace c(1 − α) with the usual normal critical value Φ −1 (1 − α/2), where Φ(·) is the standard normal cumulative distribution function. The pointwise confidence interval given in (4.6) is different from the one resulting from Abrevaya, Hsu, and Lieli (2015, Theorem 2) in terms of the formula forŝ(x) in (4.2). In their case, they need to estimateσ 2 (x) using Remark 13. A one-sided version of the uniform confidence band is readily available.
Combining Theorems 14.1 and 14.2 of Piterbarg (1996) with arguments identical to those used in the proof of Theorem 4.7 yields the following proposition. Under Assumption 1, there exists κ > 0 such that, uniformly in t, on any finite interval: as n → ∞. Note that the only differences between (4.5) and (4.7) are that (i) there is no absolute value on the left side of the equation in (4.7) and (ii) there is no factor 2 inside the exponential function in (4.7). Hence, for example, if d = 1, then: 4.1. Construction of critical values. We use the leading term on the right-hand side of (4.5) as a distribution-like function to construct a uniform confidence band.
For example, if d = 1, we may construct a critical value c(1 − α) that satisfies: This yields the critical value presented in the Algorithm of Section 3. Similarly, if d = 2, we can use: n . In finite samples, it might be useful to impose monotonicity of F n,j (·) by rearrangement (see, e.g., Chernozhukov, Ferndádez-Val, and Galichon (2009)).
Remark 14. Theorem 2 implies that: Thus, one may construct analytical critical values based on the Gumbel distribution.
However, this approximation is accurate only up to the logarithmic rate in view of Theorem 2.

Monte Carlo Experiments
In The data generating process is the following. The vector of covariates Z = (X 1 , . . . , X p ) is generated by: Z ∼ N (0, I p ), where I p is the p-dimensional identity matrix. The potential outcomes are generated by: where v ∼ N (0, 1) and v is independent of Z. The treatment status D is generated by: where U ∼ U [0, 1], U is independent of (Z , v) and Λ is the logistic function. Thus, the propensity score is π(Z) = Λ p k=p/2 X k / p/2 . The observed outcome is Y = DY 1 .
The parameter of interest is the CATEF for X = X 1 . In our specification, the CATEF can be written as: CAT EF (x 1 ) = 10 + x 1 / √ p.
We examine the performance of various statistical procedures regarding this CATEF.

Model specification.
To estimate and conduct statistical inferences on CAT EF (x 1 ) using our doubly robust procedure, we need to specify a model for the regression µ j (z) for j = 0, 1 and a model for the propensity score π(z). We consider two regression models and two propensity score models. One of two models is correctly specified, but the other model is misspecified. We note that our doubly robust procedure is predicted to work well provided that at least one of the regression model and the propensity score model is correctly specified.
We first discuss the model specifications for the regression part. The first regression model is: This model is correctly specified. The coefficients are estimated by OLS using (1, X 1 , . . . , X p ) as the explanatory variable. The second regression model, which is misspecified, is: The model is estimated by OLS using (1, X 1 , . . . , X p/2 ) as explanatory variables. This model is misspecified because it suffers from sample selection bias introduced by omitting the second half of the regressors which affects the treatment status.
We also consider two models for propensity score. The model for propensity score is: The misspecified model is: Similarly to the case of the regression part, misspecification is introduced by omitting the second half of the regressors. The models for propensity score are estimated by maximum likelihood.
We estimate CAT EF (x 1 ) for x 1 ∈ {−1, −0.5, 0, 0.5, 1} and compute the mean bias ("MEAN"), standard deviation ("SD"), the average of standard error for CAT EF (x 1 ) ("ASE"), and the root mean squared error ("RMSE"). The local linear regression is conducted with the Gaussian kernel, and the preliminary bandwidth (ĥ in Algorithm (1)) is chosen by the method of Ruppert, Sheather, and Wand (1995). We also compute the "BIAS", "SE" and "RMSE" of the corresponding inverse probability weighting estimators and the regression adjustment estimators. Note that the difference between the proposed method and those alternative methods arises only in the estimation of ψ(W, θ 0 ) and the other steps are the same.
We examine the coverage probability of the uniform confidence band for CAT EF (x 1 ) for the range −1 ≤ x 1 ≤ 1. The nominal coverage probabilities that we consider are 99%, 95% and 90%. We compute the empirical coverage ("CP"), the mean critical value ("Mcri"), and the standard deviation of critical value ("Sdcri"). We also compute the coverage probabilities of the confidence band based on the critical values computed by the Gumbel approximation ("GCP").

5.3.
Results. Tables 1 and 2 summarize the results on the properties of the estimators. In both tables, DR refers to our doubly robust method, whereas IPW and RA correspond to the inverse probability weighting and regression adjustment methods, respectively. The proposed doubly robust estimator of the CATEF works well in finite samples. As the theory indicates, the proposed estimator exhibits small bias provided that at least one of the regression model and the propensity score model is correctly specified. We find that the regression adjustment estimator is very precise when the regression model is correctly specified. However, it suffers from substantial bias when the regression model is misspecified. The inverse probability weighting estimator also suffers from model misspecification. Moreover, its standard deviation is much larger than those of the doubly robust and regression adjustment estimators.
When both models are misspecified, all three estimators suffer from heavy bias. The inverse probability weighting estimator has the largest RMSE because its distribution is more diverse than those of the other two estimators. All the estimators have larger standard deviations when x = 1 and x = −1 compared to those in other points. This is because the number of observations around x = 1 or x = −1 is expected to be smaller than that around, for example, x = 0 which is the center of the distribution.
On the other hand, the magnitude of the bias does not vary much across data points.
The standard error for the proposed doubly robust estimator is slightly smaller than the standard deviation, but the difference is not large.
Tables 3 and 4 summarize the finite sample properties of the proposed uniform confidence band. The results show that our uniform confidence band has a reasonably good coverage property provided that one of the models is correctly specified. When both models are misspecified, the size distortion is heavy. We observe that the size distortion is heavier when the regression model is misspecified than that in the case The confidence band based on the Gumbel approximation is very conservative.
The results of the Monte Carlo simulation confirm that the proposed doubly robust estimator indeed works well in finite samples provided that one of the regression and propensity score models is correctly specified. The proposed uniform confidence band also has good coverage properties.

An Empirical Application
We apply our uniformly valid confidence band for the CATEF for the effect of maternal smoking on birth weight where the argument of the CATEF is the mother's age. Our aim here is to illustrate our confidence band in comparison with alternative confidence bands. We first discuss the background of this application and the datasets used. We use two different data sets: the dataset from Pennsylvania and that from North Carolina. We then compute various confidence bands for the CATEF and discuss the results.
While the purpose of this application is to illustrate our uniformly valid confidence band and not to present new insights on the effect of smoking, it is still informative to discuss the background of this application. Many studies document that low birth weight is associated with prolonged negative effects on health and educational or labor market outcomes throughout life, although there has been a debate over its magnitude. See, e.g., Almond and Currie (2011) for a review. Maternal smoking is considered to be the most important preventable negative cause of low birth weight (Kramer, 1987). There are many studies that evaluate the effect of maternal smoking on low birth weight (Almond and Currie, 2011). The program evaluation approach is employed by, for example, Almond, Chay, and Lee (2005), da Veiga and Wilder (2008) and Walker, Tekin, and Wallace (2009), and panel data analysis is carried out by Abrevaya (2006) and Abrevaya and Dahl (2008). Here, we are interested in how the effect of smoking changes across different age groups of mothers. Walker, Tekin, and Wallace (2009) examine whether the effect of smoking is larger for teen mothers than for adult mothers and find mixed evidence. Abrevaya, Hsu, and Lieli (2015) also consider this problem in their application.
6.1. Pennsylvania data. The first dataset consists of observations from white mothers in Pennsylvania in the USA. The dataset is an excerpt from Cattaneo (2010) and is obtained from the STATA website ("http://www.stata-press.com/data/r13/ cattaneo2.dta"). Note that the dataset was originally used in Almond, Chay, and Lee (2005). We restrict our sample to white and non-Hispanic mothers, and the sample size is 3754. The outcome of interest (Y ) is infant birth weight measured in grams. The treatment variable (D) is a binary variable that is equal to 1 if the mother smokes and 0 otherwise. The set of covariates Z includes the mother's age, an indicator variable for alcohol consumption during pregnancy, an indicator for the first baby, the mother's educational attainment, an indicator for the first prenatal visit in the first trimester, the number of prenatal care visits, and an indicator for whether there was a previous birth where the newborn died. We are interested in how the effect of smoking varies across different values of the mother's age. Therefore, X is mother's age in this application.
To estimate the CATEF, we use linear regression models for the regression part and a logit model for propensity score. The explanatory variables used in the regression models and the logit model consist of all the elements of Z, the square of the mother's age, and the interaction terms between the mother's age and all other elements of Z.
We estimate the CATEF in the interval between ages 15 and 35.
We compute the following three 95% confidence bands for the CATEF. "Our CB" is the confidence band proposed in this paper. Because X is univariate in this application, we follow the algorithm in Section 3. We use the Gaussian kernel. The preliminary bandwidth (ĥ) is chosen by the method of Ruppert, Sheather, and Wand (1995). "Gumbel CB" is the confidence band in which c(1 − α) in the algorithm is replaced by that based on the Gumbel approximation (see Remark 5). "PW CB" is a pointwise valid confidence band where we replace c(1 − α) in the algorithm by the corresponding value from the standard normal distribution (i.e., 1.96). This provides a valid confidence interval for each point of the CATEF. However, its uniform coverage rate would be smaller than 95%. propose lay between "Gumbel CB" and "PW CB". While this band is wider than "PW CB", it is much narrower than "Gumbel CB" and is uniformly asymptotically valid. We see from this figure that our confidence band is informative while being uniformly valid.
The estimated CATEF is decreasing from 15 to around 25 years of age. It is rather stable for the range above 25 years of age. All confidence bands indicate that the CATEF is estimated imprecisely near the ends of the range. Nonetheless, the estimated CATEF indicates that smoking may not have a strong impact when the mother is young. The CATEF is estimated relatively precisely in the middle of the range. For the range between 20 and 30 years of age, even the band based on the Gumbel approximation, which is the widest, does not contain 0. This result provides robust evidence that smoking has a negative impact on birth weight at least for mothers who are 20 to 30 years old. In this particular dataset, the statistical evidence against a constant smoking effect is somewhat weak. The confidence band that is valid only in a pointwise sense may provide an impression that the smoking effect depends on the mother's age. However, the uniformly valid confidence band that we propose marginally contains the straight line that is equal to the ATE estimate. This result illustrates that there is a caveat when we use pointwise confidence intervals, as well as the importance of using uniformly valid confidence bands.  Abrevaya, Hsu, and Lieli (2015) and obtained from Robert Lieli's website ("http: //www.personal.ceu.hu/staff/Robert_Lieli/cate-birthdata.zip). We restrict our sample to white and first-time mothers, and the sample size is 433,558. As in the case of the Pennsylvania data, the outcome is infant birth weight measured in grams and the treatment variable is an indicator for smoking status. The set of covariates Z includes those used in the analysis of the Pennsylvania data, except an indicator for the first baby because we focus on first-time mothers, and in addition, it includes indicators for gestational diabetes, hypertension, amniocentesis and ultra sound exams. Again, X is mother's age in this application. The specification for the estimation of the CATEF is the same as before.
The purpose of using this much larger dataset is to examine the effect of the sample size. Our method involves nonparametric kernel regression and it might require a large sample size to yield a reliable result. For example, the result from the Pennsylvania data indicates that the effect of smoking is very small for very young mothers. One might argue that such a result is an artifact of small sample size. The other issue is that the confidence bands obtained using the Pennsylvania data are somewhat wide.
We hope that using this larger dataset provides us with narrower confidence bands and more informative statistical results. Figure 2 plots the estimated CATEF and the three 95% confidence bands for the range between 15 and 35 years of age. Note that the scale of the vertical axis is different from Figure 1. We now obtain much narrower confidence bands. The widths of the three (uniform, point-wise and Gumbel) confidence bands are still different.
The estimated CATEF for young mothers is negative and statistically different from 0. The previous result that it is close to 0 may be considered as an artifact of small sample size. The estimated CATEF is decreasing from around 17 to around 29 years of age. For the range above 30 years of age, we obtain relatively wide confidence bands.
We reject the null hypothesis of no effect of smoking on birth weights uniformly over 15-35 years of age. These confidence bands do not support the hypothesis that the CATEF is constant because the ATE line exceeds the confidence bands.
One might argue that the difference in the results may stem from the fact that the North Carolina data contains richer information and we use a larger set of covariates.
We reexamine the North Carolina data based on the same set of covariates as that for the Pennsylvania data, except an indicator for the first baby. Figure 3 plots the estimated CATEF and confidence bands obtained using this set of covariates. The results in Figure 3 are qualitatively very similar to those in Figure 2. We thus believe that the difference between the results from the Pennsylvania data and the North Carolina data are not from the difference in the covariates but from the difference in the sizes of these two samples.
We thus interpret our findings to indicate that the different results come from the difference in sample size yet our confidence bands reasonably quantify the uncertainty from small sample size. While two data-sets yield different estimates of CATEF, the confidence bands from the Pennsylvania data include the estimated CATEF and the confidence bands from the North Carolina data.
While we use the same data set as that used in Abrevaya, Hsu, and Lieli (2015), it is somewhat difficult to compare their results with ours because of differences in the implementations. In particular, the bandwidths are very different. Our choice of bandwidth is around 0.2, while theirs are between 1.4-11.2. Nonetheless, we make several remarks. Using small bandwidths (1.4 and 2.8), Abrevaya, Hsu, and Lieli (2015) observe almost no effect for young mothers and a large negative effect for 25-30 years old mothers. We do not observe such a large difference in the effect across different age groups. Our confidence band is as tight as their confidence band obtained with bandwidth 11.2 even though we use a much smaller bandwidth and our confidence band is uniform. This is possibly because we use an AIPW method which yields a more efficient estimate than an IPW method does.

Conclusion
In this paper, we propose a doubly robust method for estimating the CATEF. We consider the situation where a high-dimensional vector of covariates is needed for identifying the average treatment effect but the covariates of interest are of much lower dimension. Our proposed estimator is doubly robust and does not suffer from the curse of dimensionality. We propose a uniform confidence band that is easy to compute, and we illustrate its usefulness via Monte Carlo experiments and an application to the effects of smoking on birth weights.
There are a few topics to be explored in the future. First, it would be useful to consider the issue of asymptotic biases of the proposed estimator without relying on undersmoothing. For example, it might be possible to extend the approach of Hall and Horowitz (2013) that avoids undersmoothing for our purposes. Second, it would be an interesting exercise to develop a method for estimating the quantile treatment effects conditional on covariates. Third, it is possible to extend our approach to the local average treatment effect. As mentioned in the Introduction, Ogburn, Rotnitzky, and Robins (2015) consider conditioning on Z to achieve identification, but they estimate the local average treatment effect, say LATE(X), as a function of X. However, their specification of LATE(X) is parametric. Our approach can be adapted to specify LATE(X) nonparametrically and to develop a corresponding uniform confidence band. Fourth, this paper does not cover marginal treatment effects that can be identified using the method of local instrumental variables developed by Heckman andVytlacil (1999, 2005). It would be interesting to develop a uniform confidence band for the marginal treatment effects.
Appendix A. The direct plug-in bandwidth selector of Ruppert, Sheather, and Wand (1995) In this section, we give a brief description of the direct plug-in bandwidth selector of Ruppert, Sheather, and Wand (1995) for local linear regression. We focus on the case of the Gaussian kernel and univariate regressor. Note that this bandwidth can be computed with the "dpill" function in the "KernSmooth" package for R (Wand, 2015).
In the following, we denote the dependent variable by ψ i and the regressor by X i . We consider estimating E[ψ|X = x] for x ∈ [a, b] for some a and b. In our implementation, we use a = min 1≤i≤n X i and b = max 1≤i≤n X i .
Step 1: We divide the sample into N blocks and estimate a quartic regression model for each block. The number of blocks is chosen by minimizing the Mallows' C p : where RSS(N ) is the residual sum of squares based on a blocked quartic fit over N blocks, and N max = max{min( n/20 , 5), 1}.

Letm
(2) Q andm (4) Q be the estimates of the second and fourth derivative of the regression function from the blocked quartic fit. Let (2) where X j is the set of X i belonging to the j-th block. Letm Q be the estimated regression curve from the blocked quartic fit. Let Step 2: We estimate a local cubic regression model using the following bandwidth:

Letm
(2) C be the estimate of second derivative of the regression function from the local cubic regression. Let We estimate a local linear regression model using the following bandwidth: . Letm L be the estimated regression curve from this local linear regression. Let where w ij is the (1, j)-th element of (X 1,i W i X 1,i ) −1 X 1,i W i , X 1,i is the n × 2 matrix whose first column is a vector of ones and the j-th element of whose second column is X j − X i , W i is the diagonal matrix whose j-th element is K{(X j − X i )/ĝ 2 }/ĝ 2 and K is the kernel function.
Step 3: The bandwidth is computed as: Proof of Lemma 1. Because DY = DY 1 and Y 1 and D are independent of each other conditional on Z, write: Suppose that β 0 satisfies E [D|Z] = π(Z, β 0 ) almost surely. Then the right-hand side of (B.1) reduces to: Suppose now that α 10 satisfies E [Y 1 |Z] = µ 1 (Z, α 10 ) almost surely. Then the righthand side of (B.1) again reduces to: Analogously, we have E [ψ 0 (W, α 00 , β 0 ) The remainder of the appendix gives the proof of Theorem 2. We first establish the linear expansion of the local linear estimator.
for some positive constant c > 0.
Proof of Lemma 3. LetΨ and Ψ 0 denote the n-dimensional vectors such thatΨ = the n-dimensional vector of regression functions evaluated at data points, and U := (U i ) n i=1 the n-dimensional vector of regression errors. Let e 1 denote the (d + 1) × 1 vector whose first element is one and all others are zeros. Writê We first consider the leading stochastic term T n1 (x). As in equation (2.10) of Ruppert and Wand (1994), we have that By the standard empirical process results (see e.g., Pollard (1984) or van der Vaart and Wellner (1996)) combined with the usual change of variables, we have that uniformly in x ∈ I, Throughout the remainder of the proof, we let r n (x) = O p (n −c ) , uniformly in x, be a sequence that can be different in different places for some constant c > 0. Then as in (2.11) of Ruppert and Wand (1994), we have that where I d is the d-dimensional identity matrix. The little o p (·) terms in equation (2.11) of Ruppert and Wand (1994) are pointwise; however, (B.3) holds uniformly in x ∈ I with polynomially decaying terms r n (x) under our assumptions.
Again using the standard empirical process result and the method of change of variables, uniformly in x ∈ I. Therefore, we have shown that We now move on the other remainder terms. The proof of Theorem 2.1 (in particular, equation (2.3)) of Ruppert and Wand (1994) implies that T n2 (x) = O(h 2 n ) uniformly in x ∈ I. The condition that nh d+4 n → 0 at a polynomial rate in n corresponds to the undersmoothing condition. It is straightforward to show that (nh d n ) 1/2 R n (x) = O(n −c ) uniformly in x for some constant c > 0 due to Assumption 1(9) that max (nh d n ) 1/2 |ψ(W i ,θ) − ψ(W i , θ 0 )|i = 1, . . . , n = O p (n −c ) for some constant c > 0 Note that by conditions (7) and (8) of Assumption 1, we have that inf n≥1 inf x∈I s n (x) > 0 and sup x∈I |ŝ 2 (x) − s 2 n (x)| = O p (n −c ). Hence, the lemma follows from (B.4) immediately. Define: We now use the result of Chernozhukov, Chetverikov, and Kato (2014) to obtain Gaussian approximations. Define: Chernozhukov, Chetverikov, and Kato (2014) established an approximation of W n by a sequence of suprema of Gaussian processes. For each n ≥ 1, letB n,1 be a centered Gaussian process indexed by I with covariance function: Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014) establishes the following approximation result.
Proof of Lemma 4. To apply Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014), we first note that Assumption 1 implies that all the regularity conditions for Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014) are satisfied. They are: (2) K(·) is a bounded and continuous kernel function on R d , and such that the (3) The distribution of X has a bounded Lebesgue density p(·) on R d .
(5) C I ≡ sup n≥1 sup x∈I |c n (x)| < ∞. Moreover, for every fixed n ≥ 1 and for every x m ∈ I → x ∈ I pointwise, c n (x m ) → c n (x).
Then the desired result is an immediate consequence of Proposition 3.2 of Chernozhukov, Chetverikov, and Kato (2014) with a singleton set G = {U } and with q = 4 (using their notation) in verifying condition (B1)' of Chernozhukov, Chetverikov, and Kato (2014).
We now show that the Gaussian field obtained in Lemma 4 can be further approximated by a stationary Gaussian field.
Lemma 5. Let Assumption 1 hold. Then for every n ≥ 1, there is a tight Gaussian random variableB n,2 in ∞ (I n ) with mean zero and covariance function: for s, s ∈ I n ≡ h −1 n I, and there is a sequence of random variables such thatW n,2 = d sup x∈IB n,2 (h −1 n x) and as n → ∞: Proof of Lemma 5. This lemma can be proved as in the proof of Lemma 3.4 of Ghosal, Sen, and van der Vaart (2000). Let: As in Remark 8.3 of Ghosal, Sen, and van der Vaart (2000) and in the proof of Lemma 3.4 of Ghosal, Sen, and van der Vaart (2000), we can regard Gaussian processesB n,1 andB n,2 as Brownian bridges B n (φ n,x ) and B n (ϕ n,x ), respectively, in the sense that EB n (g) = 0 and E[B n (g)B n (g )] = cov(g, g ) for g = φ n,x , g = φ n,x or g = ϕ n,x , g = ϕ n,x .
Define δ n (x) := B n (φ n,x ) − B n (ϕ n,x ). Note that δ n (x) is also a mean zero Gaussian process with: Note that: In addition, we can show that there exists a constant C > 0 such that: Then arguments similar to those used in the proof of Lemma 3.4 of Ghosal, Sen, and van der Vaart (2000) yield the desired result.
Proof of Theorem 2. First note that a n = O( √ log n ) because h n = Cn −η . Lemmas 4 and 5 together imply that: Note thatB n,2 , defined in Theorem 5, is a homogeneous Gaussian field with zero mean and the covariance function ρ d (s). Because of the assumption on K(·), the covariance function ρ d (s) has finite support and is six times differentiable. The latter property implies that the Gaussian processB n,2 is three times differentiable in the mean square sense (see, e.g., Chapter 4 of Rasmussen and Williams (2006)). Then by Theorem 14.3 of Piterbarg (1996) and also by Theorem 3.2 of Konakov and Piterbarg (1984), there exists κ > 0 such that uniformly in t, on any finite interval: as n → ∞, where a n is obtained as the largest solution to the equation: a d−1 n e −a 2 n /2 = 1, Λ 2 is the covariance matrix of the vector of the first derivative of the Gaussian field B n,2 : and [·] is the integer part of a number. Simple calculation yields √ detΛ 2 = λ d/2 with λ defined in (4.4).      Note: "CATEF" = the estimated CATEF; "our CB" = the uniformly valid confidence band proposed in this paper; "PW CB" = the confidence band that is valid only in a pointwise sense; "Gumbel CB" = the uniformly valid confidence band based on the Gumbel approximation; "ATE" = the estimated value of the average treatment effect. Note: "CATEF" = the estimated CATEF; "our CB" = the uniformly valid confidence band proposed in this paper; "PW CB" = the confidence band that is valid only in a pointwise sense; "Gumbel CB" = the uniformly valid confidence band based on the Gumbel approximation; "ATE" = the estimated value of the average treatment effect. Note: "CATEF" = the estimated CATEF; "our CB" = the uniformly valid confidence band proposed in this paper; "PW CB" = the confidence band that is valid only in a pointwise sense; "Gumbel CB" = the uniformly valid confidence band based on the Gumbel approximation; "ATE" = the estimated value of the average treatment effect.