• Logistic regression;
  • Measurement error;
  • Sampling;
  • Surrogate variable;
  • Two-stage design

Summary. We compared several validation study designs for estimating the odds ratio of disease with misclassified exposure. We assumed that the outcome and misclassified binary covariate are available and that the error-free binary covariate is measured in a subsample, the validation sample. We considered designs in which the total size of the validation sample is fixed and the probability of selection into the validation sample may depend on outcome and misclassified covariate values. Design comparisons were conducted for rare and common disease scenarios, where the optimal design is the one that minimizes the variance of the maximum likelihood estimator of the true log odds ratio relating the outcome to the exposure of interest. Misclassification rates were assumed to be independent of the outcome. We used a sensitivity analysis to assess the effect of misspecifying the misclassification rates. Under the scenarios considered, our results suggested that a balanced design, which allocates equal numbers of validation subjects into each of the four outcome/mismeasured covariate categories, is preferable for its simplicity and good performance. A user-friendly Fortran program is available from the second author, which calculates the optimal sampling fractions for all designs considered and the efficiencies of these designs relative to the optimal hybrid design for any scenario of interest.