Estimation of the optimal surrogate based on a randomized trial

A common scientific problem is to determine a surrogate outcome for a long‐term outcome so that future randomized studies can restrict themselves to only collecting the surrogate outcome. We consider the setting that we observe n independent and identically distributed observations of a random variable consisting of baseline covariates, a treatment, a vector of candidate surrogate outcomes at an intermediate time point, and the final outcome of interest at a final time point. We assume the treatment is randomized, conditional on the baseline covariates. The goal is to use these data to learn a most‐promising surrogate for use in future trials for inference about a mean contrast treatment effect on the final outcome. We define an optimal surrogate for the current study as the function of the data generating distribution collected by the intermediate time point that satisfies the Prentice definition of a valid surrogate endpoint and that optimally predicts the final outcome: this optimal surrogate is an unknown parameter. We show that this optimal surrogate is a conditional mean and present super‐learner and targeted super‐learner based estimators, whose predicted outcomes are used as the surrogate in applications. We demonstrate a number of desirable properties of this optimal surrogate and its estimators, and study the methodology in simulations and an application to dengue vaccine efficacy trials.


Introduction
A common scientific problem is to determine a surrogate outcome for a long-term outcome so that future randomized studies can restrict themselves to only collecting the surrogate outcome. We consider a study where we observe n independent and identically distributed observations of a random variable consisting of baseline covariates, a treatment, a vector of candidate surrogate outcomes measured at or before an intermediate time point, and the outcome of interest at a final time point. We assume that the treatment is randomized, conditional on the baseline covariates. The goal is to use these data to produce a candidate surrogate that is maximally promising for use in future trials for estimation and testing of a mean contrast treatment effect on the final outcome. We define an optimal surrogate for the current study as the function of the true data generating distribution collected by the intermediate time point that satisfies the Prentice definition of a valid surrogate endpoint and that optimally predicts the final outcome: this optimal surrogate is an unknown parameter. In Section 2, we show the highly desirable property that the optimal surrogate automatically satisfies the Prentice definition, with one appealing consequence that this optimal surrogate guarantees avoidance of the disastrous "surrogate paradox" [defined as (i) the effect of the treatment on the surrogate is positive, (ii) the surrogate and outcome are strongly positively correlated, but (iii) the effect on the treatment on the outcome is negative] (VanderWeele, 2013) cannot occur. In addition, the average causal effect on the optimal surrogate has the same interpretation as the average causal effect on the clinical endpoint, such that, appealingly, the surrogate effect has the same interpretation as the clinical effect.
In Section 3, we give conditions under which the optimality of the surrogate (and thus its Prentice-validity) is invariant to changes in the joint distribution of the covariates, treatment, and intermediate outcomes. This describes "transportability assumptions" under which the average treatment effect on the optimal surrogate in the new trial (optimized in the current trial and applied in the new trial) equals the average treatment effect on the final outcome in the new trial. Consequently, in a thought experiment where the current trial has infinite sample size such that the optimal surrogate itself is measurable and is used as the surrogate in the new trial, a (1 − α)% confidence interval for the optimal surrogate treatment effect parameter is also a (1 − α)% confidence interval for the clinical treatment effect parameter.
In practice, an estimate of the optimal surrogate must be used as the actual surrogate endpoint. In Section 4, we present a super-learner estimator of the optimal surrogate, thereby incorporating the state of the art in machine learning and nonparametric estimation in an asymptotically optimal way. The cross-validated mean squared error can be used as an objective measure of performance of the surrogate in predicting the final outcome, and the literature provides a confidence interval for the true mean squared error of the super-learner estimator when applied to the training samples in the cross-validation scheme (e.g., van der Laan, et al., 2013), and is implemented in the SuperLearner R package. In Section 5 we further propose to update the super-learner fit of the optimal surrogate to solve an estimating equation [via targeted minimum lossbased estimation (TMLE)] that ensures that the estimator of the effect of treatment on this targeted estimated optimal surrogate is an asymptotically linear and efficient estimator of the average causal effect of treatment on the outcome of interest in the current trial. Whereas the TMLE update is advantageous compared to the untargeted super-learner estimator of the optimal surrogate given its asymptotic efficiency for the clinical parameter of interest θ 0 , it does not improve the ability to generalize inferences to new settings, such that the super-learner alone is a sound strategy for generating promising candidate surrogate endpoints.
Our objective is to develop a most-promising surrogate outcome based on a clinical outcome study with possibly highdimensional candidate surrogates; in future work we plan to address the related important objective of using the developed surrogate outcome as an endpoint in a future study to make inference (i.e., construct confidence intervals) on the causal effect of treatment in that setting without measuring the clinical outcome (future work is needed because inference based on nonparametric super-learning is a hard problem). However, in Web Appendix A we discuss approaches to inference for the future study based on the previously developed estimated optimal surrogate, accounting for the estimation error. We stress that because the assumptions needed for bridging clinical efficacy based on a surrogate endpoint to a new setting (stated in Theorem 2) are generally difficult to verify, it is recommended that wherever possible (e.g., not prohibited by ethics) future efficacy trials assess efficacy directly based on the true clinical endpoint; moreover this manuscript is about searching for a promising surrogate and does not address surrogate validation that is also of critical importance. In Section 6 we apply the proposed approach to two dengue vaccine efficacy trials. Web Appendix G studies the proposed approach in two simulations and Section 7 concludes with remarks.

Connection of the Optimal Surrogate Framework to
Other Surrogate Frameworks The newly proposed framework does not fit squarely into any of five existing frameworks for surrogate endpoints-the Prentice (1989) replacement endpoint framework, the controlled direct and indirect causal effects framework (Robins and Greenland, 1992;Joffe and Greene, 2009), the principal stratification framework (Frangakis and Rubin, 2002), the meta-analysis framework (Daniels and Hughes, 1997;Buyse et al., 2000), and the causal selection diagram framework (Pearl and Bareinboim, 2011). It is more similar to the Prentice, meta-analysis, and causal selection diagram frameworks, in being based purely on statistical parameters that are estimable under the basic assumptions typically made in randomized clinical trials. In particular, it aligns most closely with the Prentice framework by taking as its starting point the excellent Prentice definition of a valid surrogate endpoint. In fact, the optimal surrogate is constructed to guarantee satisfaction of the Prentice definition, a unique advantage compared to previous approaches. Under standard assumptions of randomized trials, if the estimated optimal surrogate is consistent for the optimal surrogate as attained via nonparametric learning, then for large sample size trials it must approximately satisfy the Prentice definition. Web Appendix B elaborates the connections of the optimal surrogate framework with the other surrogate frameworks.
The optimal surrogate approach also breaks new ground by searching for promising surrogates based on supervised nonparametric statistical learning. While historically pre-selected univariable or low-dimensional vector candidate surrogates are considered, the proposed approach allows all collected baseline and intermediate response data to potentially contribute to the optimal surrogate, selected and combined through unbiased machine learning, and not requiring parametric modeling assumptions.

Statistical Formulation of Estimation of an Optimal Surrogate
where W is a vector of baseline covariates, A is a binary treatment assigned at baseline, and S is a vector of intermediate outcomes measured at (or before) some time point τ, and Y is the final univariate outcome of interest measured at a final time point after τ. We assume A is randomized conditional on W.
With S a and Y a potential outcomes under each treatment a, let X = (W, S 0 , S 1 , Y 0 , Y 1 ) denote the full-data structure, with probability distribution P X,0 . The observed data distribution P 0 of O is determined by the full-data distribution P X,0 and the conditional distribution g 0 of A, given X, where g 0 (a | X) = g 0 (a | W). The statistical model for P 0 makes at most some assumptions about the conditional distribution g 0 of A given W. For example, if it is a randomized trial, then g 0 is known. Thus the statistical model M for P 0 only (possibly) constrains g 0 , but puts no assumptions on the marginal distribution of W nor on the conditional distribution of (S, Y ), given A, W.
In future studies, one hopes to replace the final outcome Y by a so-called surrogate outcome measured by the intermediate time point τ. At first, we consider candidate surrogates as true unknown parameters, where we refer to any real-valued function (W, A, S) → ψ(W, A, S) ∈ IR as a candidate surrogate, representing a function of the true observed data generating distribution P 0 and of the random variables (W, A, S) collected by time τ. If one wants to consider surrogates that depend on S only through a subset/summary of the S, then the setting is simply applied to S defined by this subset. The key question is now how are we going to define a good surrogate, defined in terms of P 0 ? To start with, we want the surrogate S ψ ≡ ψ(W, A, S) to be a valid surrogate in the actual study, according to the Prentice definition: 1}. This guarantees that in this particular study involving sampling from P 0 , a test for H ψ 0 : E 0 (S ψ 1 − S ψ 0 ) = 0, which controls the type-I error at level α, yields a test for H 0 : E 0 (Y 1 − Y 0 ) = 0 with type-I error control at level α, where the latter test is simply defined by rejecting H 0 if and only if H ψ 0 is rejected. Importantly, by estimating E 0 (Y 1 ) and E 0 (Y 0 ) separately, our approach applies for a general treatment effect contrast.
We also need a criterion depending on P 0 that can be used to rank valid surrogates based on the data O 1 , . . . , O n , and to define a P 0 -optimal surrogate with respect to that criterion. In this manner, we not only select a P 0 -valid surrogate but a P 0 -optimal one in the class of P 0 -valid surrogates. We would like to select the criterion such that the P 0 -optimal surrogate is not only optimal under P 0 with respect to this criterion, but that being P 0 -optimal implies that the validity of the optimal surrogate is invariant to a variety of possible changes in the data generating experiment. Or, even better, we would like that the P 0 -optimal surrogate is also a P-optimal surrogate (and thus valid) under a variety of P's different from P 0 .
For these purposes, our proposed criterion is the following full-data mean squared error: (1) That is, our goal is to minimize the weighted mean square prediction error for predicting the actual counterfactual outcome of interest, across the different treatment values, with constraint that the solution must satisfy the Prentice definition as stated above. The idea is that if a participant is assigned treatment A = a and one uses as surrogate outcome S ψ a = ψ(W, a, S a ), then one wants that surrogate outcome to be a good approximation of the future outcome Y a . Depending on the future use of the surrogate, this particular weighting scheme g 0 (a | W) could be replaced by another weighting scheme. Given a class of possible surrogate functions ψ(), the P 0 -optimal surrogate in this class is defined as We focus on the nonparametric class consisting of all functions of (W, A, S). In this case, the choice of weight in MSE P X,0 (i.e., g 0 (a | W)) does not affect the optimal solution: that is, the optimal surrogate will be optimal for each choice of weight. The P 0 -optimal surrogate ψ F 0 is given by which is a standard solution to a minimization problem that is the same under and not under the Prentice definition constraint. The conditional randomization assumption implies that the full-data MSE equals the observed data MSE: As a consequence, ψ F 0 is identifiable from P 0 and can also be defined as: In other words, due to the randomization of A, we . It also follows that E P 0 (ψ 0 (W, a, S a ) | W) = E P 0 (Y a | W), which demonstrates that the treatment-specific counterfactual mean of the P 0 -optimal surrogate equals the treatment-specific counterfactual mean of the outcome. This shows that an average causal effect of treatment on the P 0 -optimal surrogate equals the desired average causal effect of treatment on the outcome. We state this as a theorem.
Theorem 1. Assume positivity: P 0 (A = a|W) > 0 a.e. for a ∈ {0, 1}. Then the minimizer of the counterfactual mean squared error ψ → MSE P X,0 (ψ) over all functions (W, A, S) → ψ(W, A, S) satisfying the Prentice definition of a valid surrogate endpoint is given by: We call this the P 0 -optimal surrogate. We also note that the counterfactuals of this P 0 -optimal surrogate are given by: This shows that the P 0 -optimal surrogate has the perfect properties of a valid surrogate in the actual P 0study. Moreover, if each treatment is considered separately, then the minimizer of ψ a → MSE P X,a,0 (ψ a ) over all func- where MSE P X,a,0 is the a th term in the sum MSE P X,0 (ψ) in (1). Therefore, the P 0 -optimal surrogate is the same whether one minimizes the overall MSE in (1) or minimizes the treatmentspecific MSEs separately (as we do in the application and simulations).
In practice, of course, the optimal surrogate cannot be used as a study endpoint, rather it must be estimated and the fitted values used. The statistical estimation problem for the original trial is now defined: we observe n i.i.d. O ∼ P 0 ∈ M, the target parameter mapping is defined by : is the true value we aim to learn from the data.

Conditions on the New Study P under which
the P 0 -Optimal Surrogate is Also the P-Optimal Surrogate

Invariance of the P 0 -Optimal Surrogate to Changes in the Distribution of (W, A, S)
The following theorem is a trivial consequence of the fact that Nonetheless, it demonstrates that the P 0 -optimal surrogate is also the Poptimal surrogate in any study P that only differs in the joint distribution of (W, A, S), and preserves the conditional randomization of treatment. We assume both the current and future studies are randomized studies for data structures (W, A, S, Y ) and (W * , A * , S * , Y * ) with probability distribution P 0 and P, respectively.
Theorem 2 gives sufficient conditions to make the P 0optimal surrogate still a valid surrogate in a new randomized study that differs in the marginal distribution of W, in the conditional distribution of A given W, and in the conditional distribution of S given A, W.

Generalizability when the Surrogate Completely
Blocks the Effects of Both Treatments If the new study considers a whole different treatment than in the current study, then its effect on the outcome will be different and one would thus expect that the conditional mean of Y , given W, A, S, will be modified as well. Therefore, the conditions on the new study P in the previous theorem essentially exclude studies that evaluate a new treatment. However, there is an important exception where Equal Conditional Means may more easily hold. The following theorem is merely a special case of the previous theorem, but its implication is that if the outcome Y only depends on the treatment through its effect on the surrogate vector S (i.e., Prentice's "full mediation" criterion), then the new study can even consider a different treatment as long as it also only affects Y through S again. That is, if S is rich enough that it blocks the effect of the future treatment on the outcome, then the P 0 -optimal surrogate can also be used in future studies evaluating different treatments, under a simpler Equal Conditional Means assumption that conditions on (W, S) but not on A.
Theorem 3. In addition to the conditions of Theorem 2, assume E 0 (Y |W, A, S) = E 0 (Y |W, S) [and thus also assume E P (Y * |W * , A * , S * ) = E P (Y * |W * , S * )]. Then, the P-optimal surrogate equals the P 0 -optimal surrogate and E P (Y * |

How to Define the Surrogate in a Future Study when the Transportability Assumptions Fail?
Typically it is not reasonable to assume that the intermediate variable S completely blocks the effect of treatment (current and new) on the outcome, and even if it did, Equal Conditional Means may not hold. Web Appendix A discusses how E P 0 (Y | W, A, S) may still often be a good candidate surrogate for such a future study, and discusses implications about differences between E P (Y * | W * = w, A * = a, S * = s) and E P 0 (Y | W = w, A = a, S = s).

Super-Learning of the P 0 -Optimal Surrogate
Estimation of the P 0 -optimal surrogate is a standard prediction problem. That is, we estimate E 0 (Y | W, A, S) with a minimizer of the risk of a loss: ψ 0 = arg min ψ P 0 L(ψ), with Pf ≡ f (o)dP(o). For example, one could use squared error To construct an optimal estimator among any given class of candidate estimators, we use loss-based super-learning. The oracle inequality for the cross-validation selector guarantees that the estimator is asymptotically at least as good as any candidate in the set of candidate estimators (van der Laan, Polley, and Hubbard, 2007;van der Laan and Rose, 2011). We summarize how super-learner is used, with details provided in Web Appendix D. Super-learner operates by specifying a library of candidate estimators, and for each one computing the cross-validated risk (CV-RISK) [formula (1) in Web Appendix D] using squared error loss L(·) to be consistent with our proposed criterion (1) for the optimal surrogate. The discrete superlearner estimator is the candidate estimator with smallest CV-RISK and the super-learner is the convex combination of candidate estimators with smallest CV-RISK. Estimation of CV-RISK involves re-running the whole super-learner on learning samples and averaging estimates of the conditional risk on test samples.
One can also define a cross-validated R 2 (CV-R 2 ) taking values between 0 and 1 based on CV-RISK [formula (2) in Web Appendix D] that provides a universal measure of the strength of a given estimated surrogate , allowing us to compare different candidate surrogate estimators within and across studies. For example, one might construct a super-learner δ based on δ-specific subsets (W δ , A, S δ ) of the complete (W, A, S), where δ is a measure of the complexity of the resulting surrogate as a function of (W, A, S). One could now plot CV-R 2 of δ against δ for a sequence of δvalues, and the user can decide on a choice of δ taking into account both complexity and strength of the surrogate. This analysis is practically important given that all of the variables (W δ , S δ ) used in the estimated optimal surrogate need to be collected in a future trial to use this surrogate in that trial; in practice some variable sets may be selected based on their high likelihood of being collected.

The Targeted Estimated Optimal Surrogate
Captures All Information about Outcome for the Sake of Estimation of the Average Treatment Effect One could estimate the optimal surrogate E 0 (Y | W, A, S) based on any model for the conditional mean. If (W, S) is moderate-to-high dimensional, then it is typically infeasible to attain a consistent estimator of E 0 (Y | W, A, S) based on a particular parametric model, because of insufficient knowledge. Accordingly the super-learner estimator is advantageous for maximizing the chance of achieving consistent estimation and providing the most accurate finite-sample estimation. In this section, we provide a result that updating the initial superlearner estimator through TMLE yields a targeted estimate of the P 0 -optimal surrogate that captures all information about the clinical outcome in the following sense. If one would use this targeted estimate as the actual outcome of interest in the current study, and one estimates the average treatment effect on this surrogate with an efficient TMLE based on the reduced data in the current study that ignores the clinical outcome, then this TMLE estimate is an efficient estimator of the average treatment effect on the actual clinical outcome.

The Targeted Estimate of the P 0 -Optimal Surrogate
Using TMLE Suppose Y is binary or continuous in (0, 1). Let ψ n be the super-learner estimator of ψ 0 (W, A, S) = E 0 (Y | W, A, S). Consider the submodel Logitψ # n ( ) = Logitψ # n + H gn , where H gn (W, A, S) = (2A − 1)/g n (A | W), and g n is an estimator of g 0 (A | W). In a randomized clinical trial (RCT), we might set g n = g 0 . Let n = arg min P n L(ψ # n ( )) be the MLE, where P n is the empirical distribution of the n observations and (2) is the log-likelihood loss function. This n is easily calculated with a standard univariate logistic regression of Y on H gn , incorporating an offset. Let ψ # n = ψ # n ( n ) be the corresponding estimator of ψ 0 , which is a TMLE (indicated by the superscript #) for reasons that we summarize below. This estimator ψ # n does not have a closed-form solution unless the super-learner library is very simple, but this does not matter for the purpose of achieving a most predictive surrogate given its values are easily calculated.
TMLE is a general approach that allows one to target an initial estimator of a data distribution or parameter thereof in such a way that this targeted version will solve a user-supplied estimating equation (van der Laan and Rose, 2011). In a typical application of TMLE, one targets the initial estimator to solve the efficient influence curve equation for the target parameter of interest so that the resulting substitution estimator is an asymptotically efficient estimator. In the above case, we depart from this objective, instead using the TMLE solely as a technical procedure to make the estimator solve the equation which is the crucial equation that we will need later for a main result (Theorem 4) that a TMLE of the average treatment effect (ATE) on the estimated optimal surrogate ψ # n is also a TMLE of the ATE on Y and is thus asymptotically linear and efficient for the ATE on Y .

The Targeted Estimate of the P 0 -Optimal Surrogate
is Optimal in the Current Study. Suppose, we use this ψ # n (W, A, S) in place of the final outcome Y , and, based on the reduced data ( ). Under conditions, this TMLE is an efficient estimator of this data adaptive target parameter θ ψ # so that its asymptotic properties follow from the well-known theory for TMLE.
: First, we note that an efficient estimator of EY 1 − EY 0 can ignore S so that it suffices to work with (W, A, Y ) (in our set-up with complete data on Y the efficient influence curve is the same with or without S). LetQ 0 n be an initial estimator ofQ 0 = E 0 (Y | W, A) based on (W, A, Y ). Let L(Q) be the log-likelihood loss (2), LogitQ 0 n ( ) = LogitQ 0 n + H gn be the least favorable submodel, and n = arg min P n L(Q 0 n ( )) be the MLE of the fluctuation parameter . The TMLE ofQ 0 is defined asQ 1 n =Q 0 n ( n ) and the TMLE of the average treatment effect Due to the TMLE-update step we have thatQ 1 n solves the score equation and, as a result, the TMLE Q 1 n = (Q W,n ,Q 1 n ) (with Q W,n the empirical distribution of W) solves the efficient influence curve equation n (W i , a)) +Q 1 n (W i , a) − θ TMLE,a n , and θ TMLE,a n = 1 n n i=1Q 1 n (W i , a) depends on bothQ 1 n and Q W,n . If we replace H gn by a two dimensional (H 0 gn , H 1 gn ) with H a gn = I(A = a)/g n (a | W), then the updatedQ 1 n =Q 0 n ( n ) (where n is now a two dimensional parameter) also yields a TMLE for the bivariate parameter (EY 0 , EY 1 ) (the above TMLE targets the difference), which solves 0 = P n D eff,a ( Q 1 n , g n ) for each a = 0, 1. We use such treatmentspecific TMLEs because in the application we estimate non-additive difference treatment effects (i.e., relative risk EY 1 /EY 0 ). These equations are standard TMLE equations (e.g., defined in van der Laan and Rose, 2011, p. 527-529), and are the basis for the double robustness and asymptotic efficiency of the where one might again use super-learning. Let us denote this estimator withQ #0 n . This is nothing else than an estimator ofQ 0 (W, A) = E 0 (E 0 (Y | W, A, S) | W, A), which estimates the inner expectation E 0 (Y | W, A, S) with ψ # n and then estimates the outer expectation with a regression of ψ # n on (W, A). One now defines the submodel LogitQ #0 n ( ) = LogitQ #0 n + H gn , and defines n1 = arg min Q(W,A)) . This TMLEQ #1 n =Q #0 n ( n1 ) solves the following score equation (analog to (4)): Now we utilize the fact that ψ # n was targeted so that it solves the equation (3). Equation (3) combined with the score equation (6) implies thatQ #1 Thus, this TMLE Q 1 n = (Q W,n ,Q #1 n ) also solves the efficient influence curve equation for θ 0 : based on the original data (W, A, S, Y ), with the only twist that it uses a special initial estimatorQ #0 n ofQ 0 (as discussed above, involving first regressing Y on W, A, S and then regressing that fit on W, A). This proves that θ TMLE ψ # n which we defined as a TMLE of the treatment effect on the estimated optimal surrogate-is also a double robust efficient substitution estimator of the clinical treatment effect of Theorem 4. Consider the estimator ψ # n of the optimal surrogate ψ 0 = E 0 (Y | W, A, S) and the TMLE θ TMLE n ) and let f P 0 = f (o) 2 dP 0 (o). Assume 1) D eff (Q 1 n , g n ) falls in a P 0 -Donsker class with probability tending to 1; 2) Q #1 n −Q 0 P 0 g n − g 0 P 0 = o P (1/ √ n) (so in an RCT, this only requires Q #1 n −Q 0 P 0 → 0 in probability); 3) for some δ > 0 min a∈{0,1} g 0 (a | W) > δ > 0 with probability 1.
Thus, even though θ TMLE ψ # n is based on a reduced data structure, it is asymptotically linear with influence curve equal to that of the TMLE θ TMLE n of θ 0 = E 0 (Y 1 − Y 0 ) based on the observed data (W, A, S, Y ). This is an important result since it establishes that in our original study the estimated optimal surrogate carries as much information as the outcome itself for the sake of estimation of the average clinical treatment effect (and for other contrasts of EY 0 and EY 1 ). This means that a is as narrow as a (1 − α)% confidence interval based on an efficient estimator of θ 0 using (W, A, S, Y ).
This result may be surprising given that the estimated optimal surrogate is based on the reduced data. In fact, if a super-learner estimator were used as the estimated optimal surrogate, without targeting the estimator, then the TMLE θ TMLE ψ # n would not be efficient for E 0 (Y 1 − Y 0 ). Specifically, the bias of a super-learner fit is larger than the inverse of root-n and this bias translates into the same order of bias for the ATE on Y . The key to achieve efficiency is therefore to use a targeted super-learner fit of the optimal surrogate designed so that the TMLE of the ATE on this targeted estimate is in fact an asymptotically linear estimator of the ATE on Y . However, this targeting is only possible if we use the actual observed outcomes Y , and the targeting is specific for the current data generating experiment and thus the TMLE of the ATE on our targeted surrogate based in a new study would not result in an asymptotically efficient estimator of the ATE on Y * . Nevertheless, it is an appealing property of the estimated optimal surrogate that in the current study it yields an asymptotically efficient estimator of the average clinical treatment effect.

Application to Two Dengue Vaccine Efficacy
Trials Two randomized, double-blinded, placebo-controlled, multicenter, Phase 3 trials of the identical recombinant, live, attenuated, tetravalent dengue vaccine (CYD-TDV) versus placebo were conducted in Asia (Capeding et al., 2014) and Latin America (Villar et al., 2015), respectively. These trials-referred to as CYD14 and CYD15-randomized 10,275 2-14 year-old children and 20,869 9-16 year-old children, respectively, in 2:1 allocation to vaccine:placebo, with immunizations administered at months 0, 6, and 12. The primary analyses assessed vaccine efficacy (VE) against symptomatic, virologically confirmed dengue (VCD) occurring at least 28 days after the third immunization through to the Month 25 visit. Based on a proportional hazards model, estimated VE was 56.5% (95% CI 43.8-66.4) for CYD14 and 64.7% (95% CI 58.7-69.8) for CYD15.   (2007) Note: a All learners were fit separately for each treatment group A = a for a ∈ {0, 1} as described in Section 6.1. This is explicitly stated here for SL.mean.
The trials measured, from Month 13 blood samples, neutralizing antibody titers to each of the four dengue serotypes contained in the CYD-TDV vaccine using two different assays [PRNT 50 and Microneutralization Version 2 (MNv2)]. Our analysis restricts to participants with Month 13 titer data, which were measured in a random sample of study participants and in all participants with the study endpoint. We use simple inverse probability weighted complete-case analysis to account for this sampling design. Each trial data set consists of baseline covariates W (age, sex, estimated frequencies of the 4 serotypes causing dengue disease in placebo recipients in the participant's country of residence), treatment A (1 = vaccine, 0 = placebo), S (several variables based on the eight Month 13 titer measurements), and Y , the indicator of occurrence of the VCD endpoint between Month 13 and Month 25. The analyzed cohorts are participants observed to be free of the VCD endpoint through to the Month 13 visit with (W, A, S) measured. We treat CYD14 as the current trial and CYD15 as the future trial, where in CYD15 we only include data from 9-14 year-olds to increase the credibility of the contained support assumption of Theorem 2.
We first calculate the targeted estimated optimal surrogate ψ # n (W, A, S) for the CYD14 trial, thus obtaining TMLEs θ TMLE,a ψ # n of each mean θ a ψ # n = E 0 (ψ # n (W, a, S a )) and of a vaccine efficacy contrast version of θ TMLE . Wald 95% confidence intervals for each θ a ψ # n are calculated by estimating the variance of each θ TMLE,a ψ # n by the sample variance of the efficient influence curve values D eff,a (Q 1 n , g n )(W i , A i , ψ # n (W i , A i , S i )) defined above. The delta method is then applied to obtain the variance of log(θ TMLE,1 ψ # n /θ TMLE,0 ψ # n ) and the resulting symmetric Wald 95% confidence limits are transformed to obtain the CI for VE ψ # n . The same approach to obtain Wald CIs is used for E 0 (Y 0 ), E 0 (Y 1 ), and surrogate outcome values for the n * CYD15 participants (with ψ # n (·) calculated from CYD14), and, based on the CYD15 data (W , estimate the treatment-specific surrogate means in CYD15: is the solution to 0 = P n * D eff,a (Q 1 n , g n ). Lastly, to check how well the estimated optimal surrogate performs in its use to estimate the clinical parameters in the new trial, we compare the TMLEs of the surrogate parameters to the TMLEs of E P (Y * 0 ), E P (Y * 1 ), and θ * 6.1. Targeted Super-Learner Estimate of ψ 0 = E 0 (Y |W, A, S) in the CYD14 Trial We applied super-learner with 7-fold cross-validation, separately for the vaccine and placebo groups. Table 1 displays the input variables, learner types, and pre-screening approaches applied to each learner type for estimating ψ 0 = E 0 (Y |W, A = a, S). Figure 1 shows point and 95% CI estimates of the crossvalidated MSEs (van der Laan, Hubbard, and Pajouh, 2013) for each individual statistical algorithm as well as for discrete super-learner and super-learner. A logistic regression model (glm) after variable screening that disallows PRNT 50 titers performs best (with the lowest CV-MSE) for each treatment group (Table 2). For both treatment groups the super-learner performs with similar, but slightly higher, CV-MSE. Classification accuracy is better for the vaccine than placebo group with CV-MSE of the super-learner 0.11 (95% CI 0.09-0.13) and 0.26 (95% CI 0.22-0.30), respectively.
Next, the TMLE ψ # n (W, A, S) was obtained from CYD14 data as described in Section 5. Figure  shows empirical reverse cdf plots of ψ # n (W * i , A * i = a, S * i ) for each treatment a ∈ {0, 1} by case-control status y ∈ {0, 1} in CYD15, showing diminution of classification accuracy of the estimated optimal surrogate built on CYD14 for the new study CYD15 (as expected). Table 6.2 compares estimates of θ a ψ # n (P) and of θ ψ # n (P) = VE ψ # n (P) to the estimates of E P (Y * 0 ), E P (Y * 1 ), and θ * P = VE * P = 1 − E P (Y * 1 )/ E P (Y * 0 ). The results show similar vaccine efficacy estimates, with VE TMLE ψ # n (P) = 66% (95% CI 58-72) and VE * P = θ TMLE n * (P) = 61% (95% CI 51-69). However, the estimates of the treatment-specific surrogate means overestimate the VCD disease rates in CYD15, especially for the placebo group. The discrepancy stems from imperfect adherence to the Theorem 2 assumptions. The diagnostic analysis in Web Appendices E-F supports that the assumptions were approximately satisfied, with only minor violations, which was made possible by the fact that CYD14 and CYD15 were essentially the same protocol implemented in two geographic regions.

Discussion
VanderWeele (2013) and discussants Joffe (2013) and Pearl (2013) suggest that a minimal requirement for an intermediate endpoint to be a useful surrogate endpoint is that it avoids the surrogate paradox, which can have disastrous consequences. Yet, VanderWeele (2013) shows that commonly used methods for surrogate endpoint evaluation generally do not guarantee avoiding this paradox. The first useful feature of the newly proposed approach is that it starts at this minimal requirement, defining the optimal surrogate in a way guaranteed to satisfy the Prentice definition of a valid surrogate within the original trial and thus avoid the paradox (and then the estimated optimal surrogate (EOS), which can be used as a surrogate endpoint in practice, satisfies the Prentice definition in large samples). As such the proposed approach responds to Pearl's (2013) question: "If we take the negation of the "surrogate paradox" as a criterion for "good" surrogate, why cannot we create a new, formal definition of "surrogacy" that (1) will automatically avoid the paradox?..." A second useful feature of the approach is that the treatment effect on the EOS has the same interpretation as the treatment effect on the clinical endpoint of interest. A third useful feature of the proposed approach is that the EOS-in being built by super-learner followed by a TMLE update-contains all information about the average clinical treatment effect in the original trial. A fourth useful feature is the approach's use of super-learner with its principled cross-validation approach to build and compare best models for estimating the optimal surrogate. Super-learner is useful for applications where multiple baseline covariates and/or intermediate response endpoints are measured, yet there is considerable uncertainty about how to best predict the study outcome from these collected data. Moreover, while we have focused on randomized studies, this framework also applies for generating promising candidate surrogates based on observational studies, with all of the results holding under the Empirical reverse cdfs of ψ # n (W * i , A * i = a, S * i ) for CYD15 participants by vaccine/placebo assignment A * = a ∈ {0, 1} and dengue outcome case/control status Y * = y ∈ {0, 1}, where ψ # n (·) was estimated from the CYD14 trial data. The results show that the surrogate better classifies dengue outcomes of participants in the original trial than in the new trial (not surprisingly).
additional (challenging) assumption that all confounders W of treatment assignment are measured and included in the super-learner. A challenge posed to the framework is that through superlearner the EOS may be based on a complicated combination of models that is hard to interpret. This underscores the importance of building multiple EOSs from different input variable sets ranging from single-variable to all-variable models, where cross-validation criteria allow principled selection of a most parsimonious EOS with near-optimal predictive performance. A related challenge is that researchers in future trials may not have access to the code used by the previous researchers to calculate the EOS. This may require use of an open research paradigm where web calculators are made available that input (W, A, S) values and output EOS values.
This article considers an ideal setting with no missing data and where the clinical outcome is never observed before the intermediate response endpoints are measured. Moreover, we used a particular loss function for defining optimal prediction. Future work is of interest to accommodate these issues. Theorems 2 and 3 provide conditions for using the EOS from an original trial to confer correct estimation of the clinical treatment effect in a new setting/trial based on this surrogate endpoint without measuring the clinical endpoint. The inference part of these results hold for an infinite original trial, Table 3 Comparison of inferences on the surrogate parameters θ a ψ # n (P) ≡ E P E P (ψ # n (W * , a, S * ) | W * , A * = a) for each a ∈ {0, 1} and VE ψ # n (P) = 1 − θ 1 ψ # n (P)/θ 0 ψ # n (P) based on (W * , A * , ψ # n (W * , A * , S * )) versus direct inferences on the clinical dengue endpoint parameters E P (Y * a ) and θ * P = VE * P = 1 − E P (Y * 1 )/E P (Y * 0 ) in CYD15. Included is a summary of enrollment numbers, incidence of VCD, and number of participants with measured titers for each study.

Surrogate parameters
Clinical parameters estimated by TMLEs a estimated by  such that additional research is needed to provide confidence intervals about the clinical treatment effect in a new setting accounting for the error in estimating the optimal surrogate; valid inference is straightforward if the EOS is modeled parametrically but not if modeled nonparametrically. Importantly, because in many practical applications the critical assumption of our Theorems 2 and 3 for making valid inferences for a new setting-Equal Conditional Means-is implausible or dubious, a utility of the theorems is in clarifying why direct clinical endpoint studies are generally needed. Additional research is of interest to allow deviations from the theorem assumptions. Moreover, additional research may consider applications where a set of randomized clinical efficacy trials are available that provide direct clinical endpoint data for estimating how the conditional means vary over settings, which could allow new transportability results under weaker assumptions. Dummy versions of the dengue application data sets and R code producing all of the (dummy) data results is provided in Web Appendix H.

Supplementary Materials
Web Appendices and Figures referenced in Sections 1 and 3-6 are available with this article at the Biometrics website on Wiley Online Library: (A) Inference on the clinical treatment effect in a future study based on the previously estimated optimal surrogate, accounting for estimation error and failure of the transportability assumptions; (B) A review of how the optimal surrogate framework compares to other surrogate evaluation frameworks; (C) A proof of Theorem 3; (D) the details of estimating the optimal surrogate via superlearner; (E)-(F) Expanded details of the Example (including on assumption diagnostics); (G) Two simulation studies of the proposed methodology; and (H) Dummy example data sets and R code producing all of the results for the dengue application (for the dummy data sets).