Measurement error in continuous endpoints in randomised trials: problems and solutions

In randomised trials, continuous endpoints are often measured with some degree of error. This study explores the impact of ignoring measurement error, and proposes methods to improve statistical inference in the presence of measurement error. Three main types of measurement error in continuous endpoints are considered: classical, systematic and differential. Corrected effect estimators for each measurement error type and corresponding confidence intervals using existing and new methods are proposed and tested in a simulation study. These methods combine information about error-prone and error-free measurements of the endpoint in individuals not included in the trial (external calibration sample). We show that if measurement error in continuous endpoints is ignored, the treatment effect estimator is unbiased when measurement error is classical, while Type-II error is increased at a given sample size. Conversely, the estimator can be substantially biased when measurement error is systematic or differential. In those cases, bias can largely be prevented and inferences improved upon using information from an external calibration sample, of which the required sample size increases as the strength of the association between the error-prone and error-free endpoint decreases and also depends on the method used to construct confidence intervals. Measurement error correction using already a small (external) calibration sample is shown to improve inferences and should be considered in trials with error-prone endpoints. Implementation of the proposed correction methods is accommodated by a new software package for R.


Illustrative examples
We introduce here two additional example trials from literature, hypothesize that these trial could also have used endpoints measured with error to illustrate how the use of an endpoint that is contaminated with error would affect trial inference. We assume that the original endpoints used in our example trials are measurement error free.

Example trial 2: energy expenditure
Poehlman and colleagues [1] studied the effects of endurance and resistance training on total daily energy expenditure in a randomised trial of young sedentary women. Participants were randomized to one of three six-month during exercise programmes: endurance training, resistance training or the control arm. Some controversy regarding the effect of exercise training on total energy expenditure (TEE) existed at the time of the start of the trial, partly because of the difficulty to assess daily energy expenditure [1]. Starting 72 hours after completion of the training program, TEE of the participants was measured by doubly labelled water during a ten day period, which is considered the gold standard in measuring energy expenditure in humans [2]. In short, the study found no evidence for an effect of resistance and endurance training (compared to placebo) on total energy expenditure. Post-trial, measured TEE was higher in the control arm than in the two intervention arms. Table 1 shows the decrease in TEE of the women exposed to the existence training programme versus the placebo arm.

Example trial 3: rheumatoid arthritis disease activity
The U-Act-Early trial tested the efficacy of a new treatment strategy for rheumatoid arthritis (RA) in patients with newly diagnosed RA [3] in a three-arm trial: tocilizumab plus methotrexate versus tocilizumab only versus methotrexate only, all as initial treatment. For endpoint assessment, this trial used a validated RA disease activity measure (the Disease Activity Score 28, DAS28) [4]) which is commonly used and recommended to measure endpoints in RA clinical trials [5,6]. In short, the trial showed that immediate initiation of tocilizumab with or without methotrexate is more effective than methotrexate alone to achieve sustained remission in newly diagnosed RA patients. The difference in mean DAS28 score in the tocilizumab plus methotrexate versus methotrexate only group after 24 weeks is shown in Table 1. The sample size of the former groups reported in Table 1 is based on measurements available at 24 weeks of follow up. A common alternative approach to measure energy expenditure (example trial 2) is by a accelerometer, that measures body movement via motion sensors to assess energy expenditure (e.g. [2]). As compared to double labelled water (example trial 2), the accelerometer is cheaper, but less accurate [2]. Lastly, instead of endpoint assessment by DAS28 (example trial 3), where assessment is done by trained medical staff [4], trials could alternatively use the patient-based RA disease activity score (PDAS), where endpoint assessment is done by the patient [7].
For the example trial in the paper and each of the aforementioned example trials here, in Table 1 we show to what extent the Type-II of a test for treatment effect changes when a hypothetical lower standard of endpoint measurement would have been used introducing classical measurement error. The table clearly shows the anticipated increase in Type-II error with increasing error at the same sample size.

Measurement error structures
Consider a two-arm randomized controlled trial that compares the effects of two treatments (X ∈ {0, 1}), where 0 may represent a placebo treatment or an active comparator. Let Y denote the true (or preferred) trial endpoint and Y * an error prone operationalisation of Y . We will assume that both Y and Y * are measured on a continuous scale. Throughout, we assume that Y * is measured for all i = 1, . . . , N randomly allocated patients in the trial. We assume that the effect of allocated treatment (X ∈ {0, 1}) on preferred endpoint Y is defined by the linear model where β Y defines the treatment effect on the endpoint, and ε has expected mean 0 and variance σ 2 . Throughout, we assume that X is fixed. Further, we assume that model 1 is inestimable from the observed data because the endpoint Y * instead of Y was measured. We will assume that the relation between Y and Y * is given by a linear model, where e is a random variable whose distribution is independent of ε, Y and X. The parameters θ 0 and θ 1 define the relation between Y and Y * , where it is assumed that θ 1 does not equal 0. We assume that both parameters θ 0 and θ 1 are estimable only in the external calibration sample comprising individuals not included in the trial (j = 1, . . . , K).
Simple OLS regression estimators for β Y , α Y and σ 2 (the variance of the errors ε) in (1) are, respectively. In a two-arm trial, the interest is in making inferences about β Y , which cannot be directly estimated because in the trial the endpoint of interest Y was replaced by Y * . In the following we will show: a) thatβ Y * may be a poor estimator for β Y (section 3.1-3.4), and b) how adjustments tô β Y * using information from the calibration model described by (2) can improve inference about the treatment effect (section 4). As a starting point, in the following section relevant and known properties are defined for the special case that Y * = Y , which is then followed by the properties under different measurement error structures for Y * in subsequent sections.

No measurement error
Consider the hypothetical case that Y * is a perfect proxy for Y , i.e. Y * = Y . By using that Y = α Y + β Y X + ε, as defined in (1), it follows that: From standard regression theory (e.g. [9]), we know that if the errors ε satisfy the regular Gauss-Markov assumptions [9] and their variance is defined by σ 2 , the OLS estimatorsβ * Y ,α * Y , and s 2 (defined by 3, 4, and 6, respectively) are Best Linear Unbiased Estimators (BLUE) for β Y , α Y , and σ 2 , respectively.
Moreover, if the ε are independently and identically (iid) normally distributed, the OLS estimatorŝ β Y * andα Y * (defined in 3 and 4, respectively) are the Maximum Likelihood Estimators (MLE) of β Y and α Y , respectively. Note that the errors ε satisfy the Gauss-Markov assumptions if we assume that they are iid normally distributed with mean 0 and constant variance σ 2 . Hypotheses for the treatment effect β Y , can be defined by Under normality of the error terms ε, the OLS estimatorβ * Y defined in (3) is the MLE for β Y and s 2 is an unbiased estimator for σ 2 , the following is known for the Wald test: where, Assuming no measurement error in Y and X, under H 0 , T follows a Student's t distribution with N −2 degrees of freedom [9]. Under H A , T follows a Student's t distribution with N − 2 degrees of freedom and non-centrality parameter (β Y − β 0 )/ Var(β Y * ).

Classical measurement error
There is classical measurement error in Y * if Y * is an unbiased proxy for Y [10]: where E[e] = 0 and Var(e) = τ 2 and e mutually independent of Y , X, ε (in (1)). (1), it follows that: Given the aforementioned assumptions, the sum of e and ε, δ 1 = e+ε, has variance Var(δ 1 ) = σ 2 +τ 2 . It follows that if the errors δ 1 satisfy the Gauss-Markov assumptions,β Y * in (3) remains a BLUE estimator for β Y . Also,α Y * in (4) and s 2 in (6) remain BLUE estimators for α Y and the variance of δ 1 , respectively.
Further, if δ 1 is iid normally distributed with mean 0 and variance σ 2 + τ 2 , thenα Y * is the MLE for α Y andβ Y * is the MLE for β Y . Obviously, given that σ 2 > 0 and τ 2 > 0, the variance of the OLS regression estimatorβ Y * is larger if there is classical measurement error in the outcome compared to the case when there is no measurement error. Under the null hypothesis, the Wald test-statistic T defined in (7) still follows a Student's t distribution with N − 2 degrees of freedom. However, under the alternative hypothesis, the non-centrality parameter of T , (β Y − β 0 )/ Var(β Y * ), will be smaller in the presence of classical measurement error.
To summarize, in the presence of only classical measurement error, Type-II error for detecting any given treatment effect increases, Type-I error is unaffected and the treatment effect estimator is unbiased MLE under standard regularity conditions.

Heteroscedastic classical measurement error
In the preceding we assumed that the Gauss-Markov assumptions were met. But notably, in the case that the variance of the errors e in (9) varies per treatment arm, the errors are no longer homoscedastic (as needed to satisfy the Gauss-Markov assumptions) but heteroscedastic. In the case of this type of heteroscedastic classical measurement error, it can be shown that the variance of β Y * will be underestimated by the default estimator of the variance ofβ Y * defined by (8), affecting both Type-I and Type-II error.

Systematic measurement error
There is systematic measurement error in Y * , if Y * systematically depends on Y . Assuming this dependence is linear, the relation between Y * and Y can be defined as: where E[e] = 0 and Var(e) = τ 2 . Throughout, we assume systematic measurement error if θ 0 = 0 or θ 1 = 1 (and of course, θ 1 = 0 in all cases). We assume mutual independence between e and Y , X, ε ( in 1). Naturally, if θ 0 = 0 and θ 1 = 1 the measurement error is of the classical form. (1), it follows that: Given the aforementioned assumptions, δ 2 = θ 1 ε + e with expected variance θ 2 1 σ 2 + τ 2 . It follows that under the Gauss-Markov assumptions,β Y * defined in (3) Note that in this case s 2 is BLUE for θ 2 1 σ 2 + τ 2 , that is, depending on θ 1 , smaller or larger than σ 2 (the variance of the error terms if there is no measurement error).
If we further assume that δ 2 is iid normally distributed, we can conclude thatα Y * is the MLE for In the presence of any given treatment effect, T follows a non-central Student's t distribution with N − 2 degrees of freedom and non-centrality parameter (θ 1 β Y − β 0 )/ Var(β Y * ). Depending on the value of θ 1 , the non-centrality parameter will be smaller or larger than the non-centrality parameter in the absence of measurement error (see section 3.2).
In summary, if there is systematic measurement error in the endpoints, the Type-I error is unaffected under standard regularity conditions and hence testing whether there is no effect is still valid under the null hypothesis [11]). Type-II, however, is affected (it may increase or decrease) and the treatment effect estimator is a biased MLE.
By using the residuals ω i defined in (6), a heteroscedastic consistent estimator for the variance of β Y * is: which is known as the White estimator [12]. From standard regression theory, it is known that using the above defined estimator, T defined in (7) is still valid. Yet, under differential measurement error no longer θ 01 − θ 00 + (θ 11 − θ 10 )α Y + θ 11 β Y = 0 if β Y = 0. Thus, under the null hypothesis, T defined in (7) follows a Student's t distribution with N − 2 degrees of freedom and non-centrality parameter ( θ 01 −θ 00 +θ 11 α Y −θ 10 α Y +θ 11 β 0 −β 0 )/ Var(β Y * ). Consequently, Type-I error changes if there is differential measurement error in Y * and test about contrast under the null hypothesis are invalid [11]. Moreover, under the alternative hypothesis, T follows a non-central Student's t distribution with N − 2 degrees of freedom and non-centrality parameter ( θ 01 −θ 00 Depending on the values of the θ's and α Y , the non-centrality parameters will be smaller or larger than 0 and the non-centrality parameter if there is no measurement error, respectively (see section 3.2). Hence, Type-I error and Type-II error could increase or decrease if there is differential measurement error in Y * .
To summarize, Type-I error is not expected nominal (α) if there is differential measurement error in Y * (see also [11]). Also, similar to systematic error in Y * , Type-II error is affected (may increase or decrease) and the treatment effect estimator is a biased estimator.

Correction methods for measurement error in continuous endpoints
To accommodate measurement error correction, we assume that Y and Y * are both measured for a smaller set of different individuals not included in the trial (j = 1, . . . , K, K < N ), hereinafter referred to as the external calibration sample. In all but one case, it is assumed that only Y * and Y are measured in the external calibration sample. In the case that the error in Y * is different for the two treatment groups, it is assumed that the external calibration sample is in the form of a small pilot study where both treatments are allocated (i.e., Y * and Y are both measured after assignment of X).

Systematic measurement error
Using an external calibration set and assuming that the errors e in (10) are iid normal, the MLE of the measurement error parameters in (10) are: The superscript (c) is used to indicate that the measurement is obtained in the calibration set. From section 3.4, under systematic measurement error and assuming that ε in (1) and e in (10) iid normal and independent, the estimatorβ Y * defined in (3) is the MLE of θ 1 β Y and, the estimatorα Y * defined in (4) is the MLE of θ 0 + θ 1 α Y . Natural sample estimators for α Y and β Y are then whereθ 0 andθ 1 are the estimated error parameters from the calibration data set. From equation (13), it becomes apparent thatθ 1 needs to be assumed bounded away from zero for finite estimates ofα Y and β Y [13]. The first moment of estimatorsα Y andβ Y can be approximated by using multivariate Taylor expansions and assuming that (α Y * ,β Y * ,θ 0 ,θ 1 ) are normally distributed [13], −Ȳ (c) ) 2 , the total sum of squares of Y (c) . In conclusion, the estimatorsα Y and β Y are consistent. Formal derivations for the presented formulas are provided in the Appendix.
In the following we will focus on specifying confidence limits for the treatment effect estimatorβ Y defined in (13). We make use of the fact that this estimator is a ratio, which motivates the use of the Delta method, Fieller method and Zero-variance method [14]. We also present a non-parametric bootstrap method for specifying confidence limits [15].

Delta method
Assuming thatβ Y * andθ 1 are both normally distributed and applying the Delta method, the second moment ofβ Y can be approximated [11]. Formal derivations of the presented formulas are provided in Appendix A. The Delta method variance ofβ Y is given by: where S xx = i (X i −X) 2 , the total sum of squares of X. An approximation of the above defined variance, denoted by Var(β Y ), is provided by approximating θ 1 , θ 2 1 σ 2 + τ 2 , τ 2 and β Y respectively bŷ θ 1 , s 2 , t 2 andβ Y [11].
An approximate confidence interval for the estimatorβ Y is then given bŷ

Fieller method
A second method to construct confidence intervals for the estimatorβ Y in (13), described by Buonaccorsi, is the Fieller method [11,16]. In the case thatθ 1 is significantly different from zero at a significance level of α (that is, yy > t N −2 ), the (1 − α) confidence intervals ofβ Y are defined by the Fieller method by: A formal derivation can be found in Appendix A.
If the value ofθ 1 (i.e. θ 1 ) is known, the variance of the estimatorβŶ is equal to: Using the standard OLS regression framework the variance ofβŶ can be estimated by: By replacingθ 1 by θ 1 in the above, the quantity in (16) is in expectation equal to Var(βŶ ) (defined above). The quantity in (16) is used in the zero-variance method to construct confidence intervals for βŶ , by replacing Var(βŶ ) for Var(β Y ) in equation 14. In conclusion, this zero-variance approach will provide confidence intervals for the treatment effect estimator while assuming there is no variance in θ 1 (giving it its name zero-variance method). Although the zero-variance approach wins in terms of simplicity, it may underestimate the variability of the ratio since the variance inθ 1 is assumed zero.

Bootstrap
An alternative for defining confidence intervals for the corrected treatment effect estimatorβ Y is by using a non-parametric bootstrap [15]. We propose the following stepwise procedure: 1. Draw a random sample with replacement of size K of the calibration sample (Y * (c) , Y (c) ) to estimateθ 1 B defined in (12).
2. Draw a random sample with replacement of size N of the trial data (Y * , X) to calculate the corrected treatment effect estimate byβ

Differential measurement error
For corrections for endpoints that suffer from differential measurement error we will here assume the existence of a pilot trial, which serves as an external calibration set, where both treatments are allocated at random that serves as an external calibration set to estimate the measurement error model in (11). For notational convenience we rewrite the linear model in equation (11) in matrix form as: where E(e) = 0 and E(ee ) = Σ, a positive definite matrix, with τ 2 X on its diagonal. Further, θ = (θ 1 , θ 2 , θ 3 , θ 4 ) = (θ 00 , θ 01 − θ 00 , θ 10 , θ 11 − θ 10 ). In the external calibration set, the measurement error parametersθ can be estimated by,θ with variance, See [12] for a discussion on different estimators for the above defined variance. From section 2.5 it follows that natural estimators for α Y and β Y are, whereθ 00 ,θ 10 ,θ 01 andθ 11 are estimated from the external calibration set. Here it is assumed that botĥ θ 10 andθ 11 are bounded away from zero (for reasons similar to those mentioned in section 3.1).
From this, it is apparent that the estimatorsα Y andβ Y defined in (19) are consistent (details are found in the Appendix). In the subsequent sections we review the Delta method, zero-variance and propose a bootstrap for specifying confidence limits for the estimator of the treatment effect under differential measurement error of the endpoints.

Zero-variance method
The zero-variance method adjusts the observed endpoints , 1} andθ 0x ) andθ 1x derived from (18). In the zero-variance method the above defined adjusted values are regressed on the treatment variable X, yielding in estimatorsαŶ andβŶ , which are, respectively, equal to the estimatorsα Y andβ Y defined in (19). The variance of these estimators can be approximated with a heteroscedastic consistent covariance estimator (see [12] for an overview). Confidence intervals forβŶ are subsequently constructed by using formula 20. Similar to what is described in section 4.1.3 discussing the zero-variance method for systematic measurement error, this way of constructing confidence intervals neglects the variance of the θ's from the calibration data set, and will thus often yield in confidence intervals that are too narrow.

Bootstrap
We here alternatively propose a non-parametric bootstrap procedure to specify confidence limits. This entails the following steps: 1. Draw a random sample with replacement of size K of the calibration sample and estimateθ as defined in (18).

Measurement error depending on prognostic factors
Suppose that there is a prognostic factor S, and assume that, X|Y (non-differential measurement error) and S |= X (randomization is well-performed).
Suppose we want to estimate the effect of Y on X (i.e., β Y ), but instead of Y we have only measured the with measurement error contaminated Y * . If one is aware that there is a prognostic factor that confounds the relation between Y * and Y (and this factor is measured), one could decide to regress Y * on X and S. The regression of Y * on X and S equals, Thus, using the with measurement error contaminated endpoint Y * instead of the preferred endpoint Y will provide an unbiased estimation of β Y .
However, if one is not aware of the prognostic factor, one might naively regress Y * on X, which equals: In conclusion, by ignoring the prognostic factor and using the with measurement error contaminated endpoint Y * instead of the preferred endpoint Y , the regression of Y * on X still results in an unbiased estimation of β Y .

A1.1 Systematic measurement error
Obvious estimators for α Y and β Y are: These estimators can be approximated with a second order Taylor expansion by: Congruently, an approximation of the expected value of the estimatorβ Y is given by: Only using the first order Taylor expansion of the estimators, approximations of the variance ofα Y and β Y are respectively: