Optimal multiwave validation of secondary use data with outcome and exposure misclassification

Observational databases provide unprecedented opportunities for secondary use in biomedical research. However, these data can be error‐prone and must be validated before use. It is usually unrealistic to validate the whole database because of resource constraints. A cost‐effective alternative is a two‐phase design that validates a subset of records enriched for information about a particular research question. We consider odds ratio estimation under differential outcome and exposure misclassification and propose optimal designs that minimize the variance of the maximum likelihood estimator. Our adaptive grid search algorithm can locate the optimal design in a computationally feasible manner. Because the optimal design relies on unknown parameters, we introduce a multiwave strategy to approximate the optimal design. We demonstrate the proposed design's efficiency gains through simulations and two large observational studies.


INTRODUCTION
The ever-growing trove of patient information in observational databases, like electronic health records (EHR), provides unprecedented opportunities for biomedical researchers to investigate associations of scientific and clinical interest.
However, these data are usually error-prone since they are "secondary use data," i.e., they were not primarily created for research purposes (Safran et al., 2007).Ignoring the errors can yield biased results (Giganti et al., 2019), and the interpretation, dissemination, or implementation of such results can be detrimental to the very patients whom the analysis sought to help.
To assess the quality of secondary use data, validation studies have been carried out, wherein trained auditors compare clinical source documents (e.g., paper medical records) to database values and note any discrepancies between them (Duda et al., 2012).The Vanderbilt Comprehensive Care Clinic (VCCC) is an outpatient facility in Nashville, Tennessee that provides care for people living with HIV/AIDS (PLWH).Since investigators at the VCCC extract EHR data for research purposes, the VCCC validates all key study variables for all patients in the EHR.The VCCC data have demonstrated the importance of data validation, as estimates using the fully validated data often differ substantially from those using the original, unvalidated data extracted from the EHR (Giganti et al., 2020).
However, validating entire databases can be cost-prohibitive and unattainable; in the VCCC, full-database validation of around 4000 patients costs over US$60,000 annually.A cost-effective alternative is a two-phase design (White, 1982), or partial data audit, wherein one collects the original, error-prone data in Phase I and then uses this information to select a subset of records for validation in Phase II.This design greatly reduces the cost associated with data validation and has been implemented in cohorts using routinely collected data, like the Caribbean, Central, and South America network for HIV Epidemiology (CCASAnet) (McGowan et al., 2007).
CCASAnet is a large (∼50 000 patients), multi-national HIV clinical research collaboration.Clinical sites in CCASAnet routinely collect important variables, and these site-level data are subsequently compiled into a collaborative CCASAnet database that is used for research.One interesting question for CCASAnet investigators is whether patients treated for tuberculosis (TB) are more likely to have better treatment outcomes if their TB diagnosis was bacteriologically confirmed.TB is difficult to diagnose and treat among PLWH, and some studies suggest that those treated for TB without a definitive diagnosis are more likely to subsequently die (Crabtree-Ramirez et al., 2019).Key study variables can be obtained from the CCASAnet database, but the outcome and exposure, successful treatment completion and bacteriological confirmation, respectively, can be misclassified in the database.For more than a decade, partial data audits have been performed to ensure the integrity of the CCASAnet research database (Duda et al., 2012;Giganti et al., 2019;Lotspeich et al., 2020), and plans are currently underway to validate these TB study variables on a subset of records in the near future.Site-stratified random sampling has been the most common selection mechanism thus far, including a 2009-2010 audit of the TB variables.Now, we are interested in developing optimal designs that select subjects who are most informative for our study of the association between bacteriologic confirmation and treatment completion to answer this question with the best possible precision.
Statistical methods have been proposed to analyze data from two-phase studies like this, correcting for binary outcome misclassification and exposure errors simultaneously.These methods can largely be grouped into likelihood-or designbased estimators.The former include the maximum likelihood estimator (MLE) (Tang et al., 2015) and semiparametric maximum likelihood estimator (SMLE) (Lotspeich et al., 2021), while the latter include the inverse probability weighted (IPW) estimator (Horvitz & Thompson, 1952) and generalized raking/augmented IPW estimator (Deville, Sarndal & Sautory, 1993;Robins, Rotnitzky & Zhao, 1994;Oh et al., 2021b).Likelihood-based estimators are fully efficient when all models (i.e., analysis and misclassification models) are correctly specified, while design-based estimators tend to be more robust since they make fewer distributional assumptions (i.e., do not require specification of a misclassification model).We focus on full-likelihood estimators because we have full-cohort information and they offer greater efficiency.
Theoretical properties and empirical comparisons of these estimators have been discussed in detail before (e.g., McIsaac & Cook, 2014;Lotspeich et al., 2021).Thus, in this paper, we focus on designs, rather than estimation, for two-phase studies.
Given the resource constraints imposed upon data audits, efficient designs that target the most informative patients are salient.Closed-form solutions exist for the optimal sampling proportions to minimize the variances for some design-based estimators under settings with outcome (e.g., Pepe, Reilly & Fleming, 1994) or exposure error alone (e.g., Reilly & Pepe, 1995;McIsaac & Cook, 2014;Chen & Lumley, 2020).Optimal designs for likelihood-based estimators have also been considered in the setting of exposure errors alone, although the variances of these estimators do not lend themselves to closed-form solutions unless additional assumptions are made (Breslow & Cain, 1988;Holcroft & Spiegelman, 1999;McIsaac & Cook, 2014;Tao, Zeng & Lin, 2020).
Still, optimal designs have yet to be developed for two-phase studies with a misclassified binary outcome and exposure error, as needed for the CCASAnet TB study.Existing designs for our setting are limited to case-control (CC*) or balanced case-control (BCC*) designs based on the unvalidated outcome and exposure (Breslow & Cain, 1988) (we use the * here to differentiate these designs, which are based on error-prone data, from their traditional counterparts).In fact, many of the designs proposed for expensive variables (a setting similar to measurement error) are just CC* or BCC* sampling, or variations of them, since they target a particular prevalence for the variable of interest (Tan & Heagerty, 2022) or "balanced" numbers from each of the Phase I strata to sample in Phase II (White, 1982;Wang et al., 2017), respectively.While these designs are practical and can offer efficiency gains over simple random sampling, they were not derived to be optimal for any specific target parameter.Our goal is to compute optimal designs for likelihood-based estimators in the unaddressed setting of binary outcome and exposure misclassification.
Regardless of the estimator, optimal designs share common challenges; in particular, they require specification of unknown parameters.To overcome this, multi-wave strategies have been proposed that estimate the unknown parameters with an internal pilot study and use this information to approximate the optimal design (McIsaac & Cook, 2015;Chen & Lumley, 2020;Han et al., 2020;Shepherd et al., 2022).Instead of selecting one Phase II subsample, multi-wave designs allow iterative selection in two or more waves of Phase II.This way, each wave gains information from those that came before it.So far, multi-wave designs have only been used to adapt optimal designs for design-based estimators.We focus on designing multi-wave validation studies to improve the statistical efficiency of likelihood-based estimators under outcome and exposure misclassification.
Based on the asymptotic properties of the two-phase MLE for logistic regression, we derive the optimal validation study design to minimize the variance of the log odds ratio (OR) under differential outcome and exposure misclassification.
In the absence of a closed-form solution, we devise an adaptive grid search algorithm that can locate the optimal design in a computationally feasible and numerically accurate manner.Because the optimal design requires specification of unknown parameters at the outset and thus is unattainable without prior information, we introduce a multi-wave sampling strategy to approximate it in practice.Through extensive simulations, the proposed optimal designs are compared to CC* and BCC* sampling.Notable gains in efficiency can be seen not only with the optimal design, but also with the multi-wave approximation to it.Using the VCCC data, we evaluate the efficiency of various designs validating different subsets of the EHR data and compare them to the fully validated, full-cohort analysis.Finally, we implement our approach to design the next round of CCASAnet audits.

Model and data
Consider a binary outcome, Y , binary exposure, X, and covariates Z Z Z that are assumed to be related through the logistic Instead of Y and X, error-prone measures Y * and X * , respectively, are available in an observational database.Covariates Z Z Z are also available and error-free.An audit sample of size n of the N subjects in the database (n < N ) will have their data validated.
where Equation (1) captures the most general situation with complex differential misclassification in the outcome and exposure and without any assumptions of independence between variables, but it addresses other common settings as special cases.For classical scenarios of outcome or exposure misclassification alone, set X * = X or Y * = Y , respectively.For nondifferential misclassification, let (Keogh et al., 2020).If one has more specific knowledge perhaps from a previous audit, scientific context, or the literature, the models can be further customized.For example, if the exposure and covariates are assumed to be independent, then Importantly, these customizations do not affect the derivations of the optimal design that follow.
All observations follow additional logistic regression models.Model parameters are denoted together by θ θ θ; since we focus on estimating β, all other nuisance parameters are denoted by η η η such that θ θ θ = β, η η η T T .Given that (Y i , X i ) are incompletely observed, the observed-data log-likelihood for The distribution of V can be omitted because the Phase II variables are MAR.In other words, because is fully observed (in fact, fixed by design) it would only rescale l N (θ θ θ) by a constant, so omitting it from Equation (2) does not affect parameter estimation.The MLE θ θ θ = ( β, η η η T ) T can be obtained by maximizing Equation (2) (Tang et al., 2015).Our optimal design will obtain the most efficient MLE for β, the conditional log OR for X on Y .

Optimal design
Under standard asymptotic theory with N → ∞ and n/N → P where represents convergence in distribution and N N N (0 0 0, I(θ θ θ) −1 ) is a multivariate normal distribution with mean 0 0 0 and variance equal to the inverse of the Fisher information, I(θ θ θ).Partition I(θ θ θ) as The optimal design aims to minimize Var( β), which can be expressed as The elements of I(θ θ θ) are expectations taken with respect to the complete data, following from the joint density in Equation ( 1).Thus, they can be expressed as functions of the sampling probabilities, π y * x * z z z ≡ P (V = 1|Y * = y * , X * = x * , Z Z Z = z z z), and model parameters, θ θ θ.That is, for elements θ j and θ j of θ θ θ, I(θ j , θ j ) = S v (θ j ; y * , x * , y, x, z z z)S v (θ j ; y * , x * , y, x, z z z)P (y * , x * , y, x, z z z) where S v (•) and S v (•) are the score functions of validated and unvalidated subjects, respectively, and is the joint density of the error-prone and error-free variables (see Appendix for details).The sampling strata are defined by Y * , X * , and Z Z Z.
Note that Equation (4) implicitly assumes that the covariates Z are discrete.As will be seen in our applications, this is sometimes the case (e.g., Z is study site) but certainly not always (e.g., Z is a continuous lab value).If the covariates are continuous or multi-dimensional with many categories, one will need to simplify them to create sampling strata.
Specifically, the covariates may need to be discretized or have their dimensions reduced to make sampling feasible; such a strategy has been employed by others (e.g., Lawless, Kalbfleisch & Wild, 1999;Zhou et al., 2002;Tao, Zeng & Lin, 2020;Han et al., 2020).The resulting optimal design based on the simplified covariates may no longer be optimal for minimizing the variance based on the full, unsimplified covariates, but in most scenarios it should be a good approximation to the optimal design and better than classical designs like BCC* (Tao, Zeng & Lin, 2020).Moreover, the resulting discretized design should converge to the optimal one as the number of strata increases.In practice, there also exists a trade-off between the pursuit of optimality (with more strata) and ease of implementation (with fewer strata).
We see from Equations (3) and (4) that the optimal design corresponds to {π y * x * z z z } that minimizes Var( β) under the constraint where N y * x * z z z and n y * x * z z z are the sizes of the stratum (Y * i = y * , X * i = x * , Z Z Z i = z z z) in Phase I and Phase II, respectively.
Because {N y * x * z z z } are fixed, finding the optimal values of {π y * x * z z z } is equivalent to finding the optimal values of {n y * x * z z z }.Unfortunately, this constrained optimization problem does not have a closed-form solution, so we devise an adaptive grid search algorithm to find the optimal values of {n y * x * z z z }.

Adaptive grid search algorithm
The challenge at hand is one of combinatorics: of all the candidate designs that satisfy the audit size constraint and are supported by the available Phase I data (i.e., the stratum sizes {N y * x * z z z }), we need to find the one that minimizes Var( β).To locate the optimal design, we develop an adaptive grid search algorithm.Specifically, a series of grids are constructed at iteratively finer scales and over more focused candidate design spaces to locate the optimal design.The adaptive nature of our algorithm is necessitated by the dimensional explosion of the grid as the Phase I and Phase II sample sizes and number of sampling strata increase.
Let K denote the number of sampling strata and let m denote the minimum number that must be sampled in a stratum; m is needed to avoid degenerate optimal designs where one or more sampling strata are empty, rendering the estimation of the conditional distribution of Phase I given Phase II data impossible (Breslow & Cain, 1988).At the first iteration of the grid search, we form the grid G G G (1) with candidate designs comprised of stratum sizes n (1) y * x * z z z that satisfy constraints (5) and i.e., candidate stratum sizes fall between the minimum, m, and the maximum after minimally allocating to all K strata, n − Km + m (or the full stratum size N y * x * z z z if either of these is not possible).The number of subjects n y * x * z in each stratum varies by a fixed quantity s (1) between candidate designs.For example, we might consider sampling n (1) = 15, m = 10, and N y * x * z z z = 100.We then calculate Var( β) under each candidate design to identify the best one, i.e., the one that yields the smallest Var( β).Given the large space of legitimate designs in this initial search, we want to choose a reasonably large s (1) to keep the dimension of G G G (1)   manageable.Clearly, starting with a large s (1) will lead to a rough estimate of the optimal design, but this will be refined in subsequent iterations.
At the tth iteration (t > 1), we form the grid G G G (t) around the "s (t−1) -neighborhood" of the best design from the (t − 1)th iteration whose best stratum sizes are denoted by {n t) from candidate designs that satisfy constraints (5) and max(n This constraint is a refined version of ( 6), since we focus on a narrower space of designs surrounding the previous iteration's best design.Once again, the stratum sizes {n (t) y * x * z z z } vary by multiples of s (t) between candidate designs.
We adaptively choose s (t) < s (t−1) such that the grids {G G G (t) } become finer and finer during the iterative process.The choice of the step sizes s (1) , . . ., s (T ) will determine the computation time to complete the algorithm, but the grid search appears robust to these choices (Web Appendix A).We stop at s (T ) = 1, meaning that the final search was conducted at the 1-person level, and the best design at the last iteration (T ) is the optimal design, which we call the optMLE.
Figure 1 depicts a schematic diagram of an adaptive grid search with T = 3 iterations.In this hypothetical example, the Phase I sample size is N = 10 000 and there are K = 4 strata defined by (Y * , X * ) with Phase I stratum sizes 5297,1130,2655,918).The aim is to select n = 400 subjects for data validation in Phase II.Based on our simulations (discussed in Section 3.2), we set m = 10.Assume that reliable parameter estimates are available from a previous data audit which can be used in the grid search.At the first iteration, we construct G G G (1) with candidate designs that satisfy constraints ( 5) and ( 6), varying the stratum sizes {n y * x * } by multiples of s (1) = 15 between designs.Var( β) is minimized at 3.6283 × 10 −4 under the candidate design with Phase II stratum sizes {n (1) 01 , n (1) 10 , n (1) 11 ) = (10,115,85,190).At the second iteration, we form the grid G G G (2) from the 15-person-neighborhood around {n (1) y * x * } such that the candidate designs satisfy constraints ( 5) and ( 7), with stratum sizes varied by multiples of s (2) = 5 between designs.Var( β) is minimized under the same design as in the first iteration, i.e., {n y * x * }.At the third and last iteration, we form grid G G G (3) in the 5-person-neighborhood around {n (2) y * x * } such that the candidate designs satisfy constraints ( 5) and ( 7) with stratum sizes varied by multiples of s (3) = 1 between designs.Var( β) is minimized at 3.6281 × 10 −4 by Phase II sample sizes {n (3) 11,114,84,191), which is the optMLE design.We note that the minimum variance barely changed between iterations in this toy example; the algorithm proceeds anyway because the stopping rule is defined as the most granular grid search (i.e., a 1-person scale with s (T ) = 1).In practice, one may use other sensible rules that permit early stops to make the algorithm more computationally efficient, e.g., stop when the minimum variance from successive iterations changes by less than 1%.

Multi-wave approximate optimal design
The optMLE design derived in Section 2.2 relies on model parameters θ θ θ, which are unknown at the study outset.If available, historical data from a previous audit could be used to estimate θ θ θ.Otherwise, it would be difficult to implement the optMLE design in practice, so we propose a multi-wave sampling strategy to approximate it.Whereas traditional two-phase studies require all design-relevant information to be available in Phase I, multi-wave designs allow sampling to adapt as such information accumulates through multiple sampling waves in Phase II.
In one of the earliest multi-wave papers, McIsaac & Cook (2015) considered the most and least extreme multi-wave sampling strategies: (i) fully adaptive and (ii) two waves, respectively.Strategy (i) began with a small initial sample, and then the sampling strategy was re-computed after data were collected for each individual selected into Phase II (leading to nearly n waves), while strategy (ii) used just two waves and re-designed the study just once (after the initial sample).
McIsaac & Cook (2015) found that the fully adaptive designs offered great efficiency, but their implementation could be unrealistic in practice; meanwhile, the more practical two-wave strategy offered near-optimal efficiency.Therefore, in this manuscript we primarily consider two waves of sampling, labeled as Phase II(a) and Phase II(b), with the corresponding sample sizes denoted by n (a) and n (b) , respectively (n (a) + n (b) = n).
McIsaac & Cook (2015) also examined different n (a) : n (b) ratios and found that the 1:1 ratio appeared to strike a good balance between (i) the more accurate Phase II(a) estimation of θ θ θ with larger n (a) and (ii) the more flexible design optimization in Phase II(b) with larger n (b) .Based on this result, we select n (a) = n (b) = n/2 subjects in each wave.
In Phase II(a), we typically select subjects through BCC* sampling; other existing designs could be used depending on the available information (e.g., Oh et al., 2021a).Then, we use the Phase I and Phase II(a) data to compute the preliminary MLE of the parameters, denoted θ θ θ (a) , where the validation indicator sampled in Phase II(a) and 0 otherwise.Then, we use the grid search algorithm in Section 2.3 with θ θ θ (a) to determine the optimal allocation of the remaining subjects in Phase II(b).We call our two-wave approximate optimal design the optMLE-2.Following completion of both waves of validation, the final MLE θ θ θ are obtained by combining data from Phases I, II(a), and II(b) and with redefined validation indicator V i = 1 if subject i (i = 1, . . ., N ) was audited in either wave of the optMLE-2 design.Thus, ensuing inference is based on n audited and (N − n) unaudited subjects, as with a single wave of audits.

SIMULATIONS
Our objective with these simulation studies is three-fold: (i) to describe the construction of the optimal designs, since there is not a closed-form; (ii) to demonstrate the efficiency gains of the optimal designs over existing designs; and (iii) to investigate the robustness of the proposed designs to model misspecification.This was explored through settings with varied rates of misclassification (Section 3.2), additional error-free information to incorporate (Section 3.4), under model misspecification of the misclassification mechanisms at the design stage (Section 3.5), and in special cases with either outcome or exposure misclassification alone (Section 3.6).

Validation study designs
We compare the performance of five two-phase validation study designs under differential outcome and exposure We compared the designs using two precision measures: (i) relative efficiency (RE), defined as the ratio of empirical variances of β in the final analysis, and (ii) relative interquartile range (RI), defined as the ratio of the widths of the empirical interquartile range (IQR) (McIsaac & Cook, 2015).The optMLE design based on true parameter values and observed Phase I stratum sizes {N y * x * } or {N y * x * z } was treated as the reference design when calculating the RE and RI (i.e., the variance and IQR, respectively, of the optMLE were used in the numerators of these measures).RE and RI values > 1 indicate better precision than the optMLE design, while values < 1 indicate worse precision.We also considered alternative versions of the referential optimal design (Supplemental Table S1), but results were similar to the optMLE and thus they were not included in subsequent simulations.

Outcome and exposure misclassification
We simulated data for a Phase I sample of N = 10 000 subjects according to Equation (1).We generated X and Y from Bernoulli distributions with p x = P (X = 1) and We used approximate outcome prevalence p y0 = P (Y = 1|X = 0) to define β 0 = log {p y0 /(1 − p y0 )}.We generated Y * and X * from Bernoulli distributions with , respectively, where (γ 0 , γ 1 ) and (α 0 , α 1 ) control the strength of the relationship between error-prone and error-free variables.We define the "baseline" false positive and true positive rates for X * , denoted by FPR 0 (X * ) and TPR 0 (X * ), respectively, as the false positive and true positive rates of X * when Y = 0. Similarly, FPR 00 (Y * ) and TPR 00 (Y * ) are the baseline false positive and true positive rates for Y * when X = 0 and X * = 0.With these definitions, we have , , and X * are misclassified, but the misclassification rates are varied separately; we fix FPR(•) = 0.1 and TPR(•) = 0.9 for one variable and vary FPR(•) = 0.1 or 0.5 and TPR(•) = 0.9 or 0.5 for the other.We consider cases where Y = 0 or 1 is more common by fixing p x = 0.1 and varying p y0 from 0.1 to 0.9.Similarly, cases where X = 0 or 1 is more common were considered by fixing p y0 = 0.3 and varying p x from 0.1 to 0.9.Using the designs in Section 3.1, n = 400 subjects were selected in Phase II.We considered minimum stratum sizes of m = 10-50 for the optMLE design; all yielded stable estimates (Supplemental Figure S1), so m = 10 was used hereafter.In choosing m, there is a trade-off between stability of the design and potential efficiency gain, driven by larger and smaller choices of m, respectively.While the grid search parameters used to locate the optMLE and optMLE-2 designs varied between replicates, three-iteration grid searches with step sizes s s s = {15, 5, 1} and s s s = {25, 5, 1}, respectively, were most common.Each setting was replicated 1000 times.
Tables 1 and 2 show that the optMLE-2 design was highly efficient with RE > 0.9 and RI > 0.95 in most settings.In some settings, the RE and RI for the optMLE-2 design were even slightly larger than one; this is because the optMLE design is asymptotically optimal but may not necessarily be optimal in finite samples.In most settings, the optMLE-2 design exhibited sizeable efficiency gains over the BCC*, CC*, and SRS designs, with the gains as high as 43%, 74%, and 83%, respectively.The efficiency gains were higher when the misclassification rates were lower, particularly for Y * , or when Phase I stratum sizes were less balanced.Even under the most severe misclassification (e.g., when FPR 00 (Y * ) and TPR 00 (Y * ) were both 0.5), the MLE should remain identifiable because of the validation sample.
However, the CC* and SRS designs incurred bias, as much as 22% and 27%, respectively, primarily when p y0 was farther from 0.5; in these situations, imbalance in Y made these designs susceptible to empty cells in the validation data.
The optimal and BCC* designs were reasonably unbiased in all settings, although we saw the smallest efficiency gain for the optimal designs when the misclassification rates were highest, since the Phase I data were not very informative for Phase II.The grid search successfully located the optMLE and optMLE-2 designs in all and ≥ 95% replicates per setting, respectively.
The grid search failed to locate the optMLE-2 design in a few replicates because empty cells in the cross-tabulation of unvalidated and validated data from the Phase II(a) sample (e.g., no "false negatives" for the exposure) led to unstable coefficients from logistic regression that rendered singular information matrices.This can happen when error rates are extreme in either direction.When error rates are extremely low, error-prone variables can become collinear with their error-free counterparts.In this case, we might treat the variable as error-free and use the Phase I version of this variable in all models.When error rates are extremely high, we might not observe any cases of agreement in Phase II(a) (e.g., no records with X * = X).In this case, more than two waves of Phase II might be needed to "fill in" the empty cells.Fortunately, we did not encounter this problem very often; out of 20 000 total replicates across these settings, we discarded 172 (0.9%) problematic ones.
Supplemental Figure S2 shows the average Phase II stratum sizes for all designs under the settings described.The makeup of the optMLE design depended on the Phase I stratum sizes and misclassification rates.It oversampled subjects from the less-frequent strata.Furthermore, oversampling of the less-frequent strata was more extreme when misclassification rates for the variable were higher.The optMLE-2 design had similar but less extreme allocation compared to the optMLE design because it contained a BCC* sample of 200 subjects in Phase II(a).When Phase I variables were not very informative about the Phase II ones (e.g., FPR 0 (X * ) = TPR 0 (X * ) = 0.5), the optimal designs became less dependent on the Phase I stratum sizes, and the optMLE-2 design also became less similar to the optMLE design because estimating θ θ θ was harder (Supplemental Figure S3).With larger minimum m = 50, allocation of the optMLE and optMLE-2 designs was more similar than with m = 10, especially when the Phase I variables were informative (Figure S4 in Web Appendix B).

More than two waves of validation
We considered another approximately optimal design, the optMLE-3, which conducted validation in three waves.Phase II(a) was the same as the optMLE-2, with n/2 subjects selected by BCC* based on (Y * , X * ).Then, n/4 subjects were selected in Phases II(b) and II(c) by the adaptive grid search algorithm, with θ estimated from Phases I and II(a) and I, II(a), and II(b), respectively.Data were generated following Section 3.2 with p x = 0.1, p y0 = 0.3, varied outcome misclassification rates, and fixed FPR 0 (X * ) = 0.1, TPR 0 (X * ) = 0.9 for the exposure.Efficiency gains with the optMLE-3 were similar to the optMLE-2, with both recovering ≥ 88% of the efficiency of the optMLE (Table 3), and the designs chose almost identical stratum sizes for validation (Supplemental Figure S5).Other allocations of the Phase II sample across multiple validation waves are of course possible; we refer the reader to McIsaac & Cook (2015) for additional considerations.

Incorporating an additional error-free covariate
We performed an additional set of simulations that incorporated an error-free covariate into the designs and analyses.
Simulation details are in Web Appendix B.1 in the Supplementary Materials.In summary, the optMLE-2 design continued to be highly efficient, with gains as high as 43%, 56%, and 59% over the BCC*, CC*, and SRS designs, respectively (Supplemental Table S2).The optimal designs favored sampling subjects from strata with larger Var(X|Z = z), where the true value of X was harder to "guess" (Supplemental Figure S5).

Optimal designs' robustness to model misspecification
Next, we investigated the impact of model misspecification at the design stage on the efficiency of subsequent estimators.
We simulated data using Equation ( 1 and δ 2 between −1 and 1.We defined eight (Y * , X * , Z) sampling strata and selected n = 400 subjects in Phase II.
Additional optimal designs, denoted optMLE* and optMLE-2*, assumed only main effects for P (X * = 1|Y, X, Z) and P (Y * = 1|X * , Y, X, Z); clearly, these models will be misspecified at the design stage for δ 1 and/or δ 2 = 0.The analysis models were correctly specified (i.e., included the interaction term), although Lotspeich et al. (2021) found the MLE to be fairly robust to model misspecification in this setting.
Simulation results for the proposed designs can be found in Table 4.Even though the optMLE* and optMLE-2* designs were computed based on incorrect model specifications, very little efficiency was lost relative to the correctly specified gold standard optMLE design.Moreover, the optMLE-2* design remained more efficient than existing designs, with efficiency gains as high as 47%, 43%, and 37% over the BCC*, CC*, and SRS designs, respectively (Supplemental Table S3).Thus, the proposed optimal designs appeared to maintain their advantages even when we were uncertain about the model specification at the design stage.This includes the problematic situation where, at the design stage, we incorrectly omitted an interaction term from our misclassification model.The average validation sample sizes in each stratum for all designs are displayed in Supplemental Figure S6.The differences between the optMLE and optMLE* designs, or between the optMLE-2 and optMLE-2* designs, were almost always small, although more visible when the model for Y * was misspecified.

Classical scenarios with outcome or exposure misclassification alone
Detailed results for settings with outcome or exposure misclassification only are presented in Appendices B.2 and B.3, respectively, in the Supplementary Materials.The optimal designs oversampled subjects from strata that corresponded to the less-frequent value of the error-prone variable (Supplemental Figure S7 and S8).The optMLE-2 design approximated the optMLE design well, continuing to offer sizable efficiency gains over existing designs (Supplemental Table S4).

COMPARING PARTIAL TO FULL AUDIT RESULTS IN THE VCCC
The VCCC EHR contains routinely collected patient data, including demographics, antiretroviral therapy (ART), labs (e.g., viral load and CD4 count), and clinical events.Since the VCCC data had been fully validated, available pre-/post-validation datasets could be used to compare two-phase designs and analyses that only validate a subset of records to the gold standard analysis that uses the fully validated data.We used these data to assess the relative odds of having an AIDS-defining event (ADE) within one year of ART initiation (Y ) between patients who were/were not ART naive at enrollment (X), adjusting for square root transformed CD4 at ART initiation (Z).We extracted N = 2012 records from the EHR for this study In the unvalidated data (Y * , X * ), 73% of patients were ART naive at enrollment and 8% of patients experienced an ADE within one year.
The misclassification rate of ADE was 6%, with 63% false positive rate (FPR) and only 1% false negative rate (FNR).
The misclassification rate of the ART naive status at enrollment was 11%, with FPR = 13% and FNR = 3%.Only 19 subjects (1%) had both outcome and exposure misclassification.CD4 count was error-free.We assumed misclassification models for ADE and ART status, with We defined four strata according to the unvalidated Phase I ADE and ART naive status, with stratum sizes 504,1350,42,116), where the first and second subscripts index error-prone ADE and ART naive status, respectively.We set n = 200 and considered the optMLE-2, BCC*, CC*, and SRS designs.When implementing the optMLE-2 design, we selected n (a) = 100 subjects in Phase II(a) via BCC* sampling.All results were averaged over 1000 replicates except for SRS and optMLE-2 designs: SRS encountered 118 replicates where the MLE was unstable or did not converge because of very small numbers of audited events or exposures, while the grid search algorithm failed to locate the optMLE-2 design in 40 of the replicates.On average, the SRS, CC*, BCC*, and optMLE-2 audits chose (n 00 , n 01 , n 10 , n 11 ) = (56, 134, 4, 12), (27,73,26,74), (53, 53, 42, 52), and (25, 39, 42, 95) subjects, respectively, from the four strata in Phase II.
Table 5 shows the results under the two-phase designs and those from the gold standard and naive analyses using fully validated and unvalidated data, respectively, from the full cohort.The log OR estimates under all two-phase designs were reasonably close to the gold standard estimates and led to the same clinical interpretations, i.e., after controlling for √ CD4 at ART initiation, ART naive status at enrollment was not associated with changes in the odds of ADE within one year of ART initiation.The variance under the optMLE-2 design was 14%, 13%, and 86% smaller than those under the BCC*, CC*, and SRS designs, respectively.

PROSPECTIVE AUDIT PLANNING IN CCASANET
Researchers are interested in assessing the association between bacteriological confirmation of TB and successful treatment outcomes among PLWH who are treated for TB.We are in the process of designing a multi-site audit of n = 500 patients to validate key variables and better estimate this association in the CCASAnet cohort.To implement the optMLE-2 design as in Sections 3-4, n (a) = 250 patients would be chosen in Phase II(a) using BCC* sampling from the 20 (Y * , X * , Country) strata.Alternatively, we could incorporate prior data from on-site chart reviews conducted in the five CCASAnet sites between 2009-2010.The original data from this time period captured a total of 595 TB cases (Phase I).In this historical dataset, 70% of cases completed treatment within one year and 68% had bacteriological confirmation of TB.Validated TB treatment and diagnosis were available for 40 subjects who were chosen for validation via site-stratified SRS.We observed 13% and 20% misclassification in Y * (FPR = 7%, FNR = 23%) and X * (FPR = 39%, FNR = 5%), respectively.No subject had both their outcome and exposure misclassified.
We demonstrate two ways to use these historic audits to design a more efficient validation study for the next round of CCASAnet audits.Strategy (i) estimates the parameters with the historic data, denoted θ θ θ (h) , and uses them to derive the optMLE design to allocate n = 500 subjects in one Phase II subsample.Strategy (ii) is a multi-wave strategy that uses θ θ θ (h) to design Phase II(a) and then uses the Phase II(a) parameters, denoted θ θ θ (a) , to design Phase II(b).
Given the small size of the historic audit (n = 40), it was not possible to obtain country-level estimates of all parameters needed to derive the optimal design.Instead, we created country groupings (Z) based on site-specific audit results (Supplemental Table S5), where Z = 0 for Countries A-B with errors in Y * or X * , Z = 1 for Countries C-D with errors in both Y * and X * , and Z = 2 for Country E which had no errors.We assumed misclassification models for TB treatment completion and bacteriological confirmation, with ) −1 , respectively.These groupings were used to obtain the MLE for the historic data, θ θ θ (h) .Since audits will be conducted at the site level, we applied these parameters to the 20 Phase I strata from the current data by assuming the same coefficients for countries in the same CoG group (Supplemental Table S6).
Then, we implemented the grid search to select n (a) = 250 subjects as a more informed first wave for the twowave design (Strategy (ii)).Based on the historic parameters, stratum sizes to be sampled at Phase II(a) were 11 ) = (10, 10, 10, 10), (10, 10, 10, 45), (3, 7, 5, 17), (6, 9, 10, 13), and (12, 16, 11, 26) for Countries A-E, respectively.Validation appeared focused on the smaller countries (C-E).Country A was sampled minimally, proportional to its Phase I sample size, as were all strata in Country B except (Y * = 1, X * = 1).Validated data on these subjects will be used to re-estimate the model parameters and derive the optimal allocation for Phase II(b).Alternatively, the Phase II(a) and historic data could be pooled to re-estimate the parameters.In our situation, the historic audits were much smaller than the Phase II(a) study, so Phase II(a) would likely dominate the pooled analysis.However, if the Phase II(a) study were smaller, i.e., due to budget constraints, then the benefits of data pooling could be greater.
Ultimately, the choice between these strategies is determined by logistics and our confidence in the historic data.We plan to use Strategy (ii): the optMLE-2 design with the first wave informed by prior audits.Incorporating the prior audit information, even if it might be biased, will likely be better than starting off with a BCC* design (Chen & Lumley, 2020), but we do not want to trust the historic audits entirely.Also, conducting multiple validation waves is feasible because they can be performed by in-country investigators (Lotspeich et al., 2020).

DISCUSSION
Validation studies are integral to many statistical methods to correct for errors in secondary use data.However, they are resource-intensive undertakings.The number of records and variables that can be reviewed are limited by time, budget, and staffing constraints.Thus, selecting the most informative records is key to maximizing the benefits of validation.We introduced a new optimal design, and a multi-wave approximation to it, which maximizes the efficiency of the MLE under differential outcome and exposure misclassification -a setting for which optimal designs have not yet been proposed.We devise a novel adaptive grid search algorithm to locate the optimal design, and the designs are implemented in the auditDesignR R package and Shiny app (Supporting Information, Figure 2).
As part of our audit protocol, the CCASAnet sites are notified in advance with the list of patient records to be validated.
This provides time for site investigators to locate the relevant patient charts before the audit, but there is still a chance that validation data may be missing.Our methods rely on the MAR assumption, which asserts that conditioning on the Phase I information, subjects who are unaudited are similar to those who are audited.When we select who to audit, MAR holds because the validation data are missing by design, but when validation data are missing simply because we cannot find them, it calls the MAR assumption into question.In our previous CCASAnet audits, there have been instances where validation data are missing for selected records, although it is not very common.In the past, we have implicitly assumed that these records are MAR in analyses.Methods to simultaneously address audit data that are missing by design (i.e., MAR) and nonresponse (i.e., possibly not MAR) would be an interesting direction for future work.
Our analyses and designs are based on the parametric MLE approach of Tang et al. (2015).Recently, an SMLE approach was developed to analyze two-phase studies with error-prone outcome and exposure data that nonparametrically models the exposure error mechanism, making it robust to different exposure error mechanisms (Lotspeich et al., 2021).Our designs are guaranteed to be optimal for the MLE but still offer efficiency gains for the SMLE; this avoids complicated calculations which would be required to derive a design specifically for the SMLE.In an additional simulation, we found that the efficiency gains of the SMLE and MLE under the proposed optimal designs were essentially identical (Supplemental Table S7).
We focused on designs for full-likelihood estimators because they offer the greatest efficiency if full-cohort information (i.e., both Phases I and II data) are available, which was the case with the VCCC and CCASAnet data examples.
However, if only audit data are available or one wants to avoid placing additional models on the misclassification mechanisms, conditional maximum likelihood estimation (CMLE) could be considered (Breslow & Cain, 1988).
Optimal designs for CMLE could be calculated in a similar manner, but we have not done so here.
Strictly speaking, the proposed optimal design is "optimal" among designs with compatible strata definitions only.In the VCCC example, we considered an optimal design that sampled on a categorical version of continuous CD4 count in addition to the error-prone outcome and exposure; we created two categories based on CD4 counts above or below the median.While we treated CD4 as discrete to compute the optimal design, it still performed well with continuous CD4 in the analysis.There are other ways we might have discretized CD4, e.g., by creating more than two categories or choosing a different cutpoint than the median.How to best stratify continuous variables for design purposes is an interesting question (e.g., Amorim et al., 2021), and designs that maintain the continuous nature of continuous covariates warrant further investigation.
Other interesting topics of future research include developing optimal designs for two-phase validation studies with other types of outcomes and exposures, including count, continuous, or censored data.Developing these designs would involve replacing the models in Equations ( 1)-( 4) with appropriate ones for the new data types and then deriving the successive steps in parallel to the current work.The special case with Phase I data made up of multiple error-prone surrogates for Y or X (e.g., X * 1 , X * 2 , . . ., X * p ), as in a reliability study, would also be a natural extension of the methods herein.Also, the proposed designs could be modified to pursue optimal estimation of the interaction between the misclassified exposure and an additional covariate, similar to the focus of Breslow & Chatterjee (1999), or of multiple parameters simultaneously (e.g., the main effects and interaction).Either of these modifications would require adoption of an alternative criterion to summarize the variance matrix, such as D-optimality or A-optimality which minimize the determinant or trace, respectively, of the variance matrix (Fedorov & Leonov, 2013).

APPENDIX
Derivations of S v (•) and S v (•) Recall that Equation ( 1) is the joint density of a complete observation, which includes the validation indicator along with error-prone and error-free variables.Now, the joint density of just the error-prone and error-free variables, needed to derive the score vectors, is defined Then, depending on whether V i = 1 or 0 (i = 1, . . ., N ), we denote the log-likelihood contribution of the ith sub- are equivalent to the summands in the first and second terms of the observed-data log-likelihood from Equation (2).Now, we denote the score vector for the ith subject based on their log-likelihood contribution by S i (θ θ θ) = (S i (β), S i (η η η) T ) T , where We decompose η η η T into (η η η T y * , η η η T x * , η η η T y , η η η T x ) T , where η η η y * , η η η x * , η η η y , and η η η x correspond to the nuisance parameters in , and P η η ηx (X|Z Z Z), respectively.Then, we derive the score vectors as .

FIGURES AND TABLES
Figure 2: A screenshot of the auditDesignR Shiny application after a T = 2 iteration adaptive grid search to find the optimal design.The user selects options from the grey sidebar (e.g., Phase I strata sizes and model parameters), followed by design-specific options like which error setting is assumed (e.g., errors in both "Outcome + exposure") and the minimum required stratum size m.There are also controls for the adaptive grid search routine, like the maximum allowable grid size (here assumed to be 25 000 candidate designs).After making their selections and clicking "Search," the user can view the top candidate designs in addition to the distribution of Var( β) from each iteration of the grid search.
Table 1: Simulation results under outcome and exposure misclassification with varied outcome misclassification rates.Exposure misclassification rates were fixed at FPR0(X * ) = 0.1, TPR0(X * ) = 0.9.% Bias and SE are, respectively, the empirical percent bias and standard error of the MLE.RE or RI < 1 indicates an efficiency loss compared to the optMLE.The grid search successfully located the optMLE and optMLE-2 designs in all and ≥ 95% replicates per setting, respectively; across all settings, 162 (1.4%) problematic replicates of the optMLE-2 were discarded out of 12 000.Fewer than 1% and 5% of the replicates were discarded because of unstable estimates under the SRS, CC*, or BCC* designs when py0 = 0.1 and 0.9, respectively.All other entries are based on 1000 replicates.Outcome misclassification rates were varied and exposure misclassification rates were fixed at FPR0(X * ) = 0.1, TPR0(X * ) = 0.9.% Bias and SE are, respectively, the empirical bias and standard error of the MLE.RE or RI < 1 indicates an efficiency loss compared to the optMLE.The grid search successfully located the optMLE, optMLE-2, and optMLE-3 designs in all, ≥ 99%, and ≥ 98% of replicates per setting, respectively; across all settings, 24 (0.6%) and 43 (1.1%) problematic replicates of the optMLE-2 and optMLE-3, respectively, were discarded out of 4000.All other entries are based on 1000 replicates.
Table 4: Simulation results when the misspecification models used in the optimal design can be misspecified.Misclassification rates were fixed at FPR(•) = 0.1, TPR(•) = 0.9.The error-prone exposure and outcome were generated from models including the interaction terms δ1XZ and δ2XZ, respectively, but the optMLE* and optMLE-2* designs assumed only main effects for these models.% Bias and SE are, respectively, the empirical bias and standard error of the MLE.RE or RI < 1 indicates an efficiency loss compared to the optMLE.The optMLE and optMLE* were located in all replicates and the optMLE-2 and optMLE-2* designs were located in ≥ 98% and 99% of replicates per setting, respectively; 65 (0.5%) and 17 (0.1%) problematic replicates out of 13 000 were discarded for the optMLE-2 and optMLE-2*, respectively.All other entries are based on 1000 replicates.Web Appendix A Choosing Step Sizes for the Adaptive Grid Search As stated in the text, our adaptive grid search algorithm locates the optimal design by searching a series of grids, which are "adaptively" constructed at iteratively finer scales and over more focused candidate design spaces.The choice of the step sizes (or scales) of the grids appears inconsequential, but we detail the implementation used by auditDesignR to suggest them.Our software assumes a user-specified maximum allowable grid size, which would be dictated by their machine; we use 10 000 as the maximum for Sections 3-5.
We want to choose the first step size, s (1) , to be the largest value for which the dimension of G G G (1) still falls below the allowed maximum.Calculating the dimension of a grid based on a potential step size, s, involves applying the "stars and bars" problem from combinatorics.Based on the audit size constraint (Equation ( 5) in the text), the number of "stars" to partition is equal to (n − Km)/s, the number of subjects to allocate (after the minimum m has been dispensed to each stratum) in increments of the step size s, and there are (K − 1) "bars" (i.e., partitions to form) between them.
Thus, the number of rows in the first grid based on a step size of s is There is slightly more to consider in choosing step sizes s (t) for successive iterations t > 1.We still want to cover the entire candidate design space in an efficient way.On top of that, the candidate design space has now narrowed to be in the s (t−1) -person window around the last iteration's "best" design, so based on {n Still, we need to subtract from Equation (S.2) the number of candidate designs where the stratum sizes are above the upper-bound of the neighborhood.In auditDesignR we manually tabulate the number of such designs and subtract it from Equation (S.2) to calculate the expected grid size for s in iteration t, rows G G G (t) |s, s (t−1) , n (t−1) y * x * z z z .As before, we consider possible values in s s s that share common factors, but now we also want to also focus on s s s < s (t−1) .In the second iteration of the example from Section 2.3, we consider s s s = {5, 1}, which were expected to lead to grids with rows G G G (t) |s s s, s (t−1) , n (t−1) y * x * z z z = {134, 10296}, respectively.Thus we select s (2) = 5, since it is the largest (and only) step size considered that keeps the grid smaller than 10 000 rows.This process is repeated until we can reasonably reach a step size of s (T ) = 1 while keeping the size of the grid below the maximum., where α 0 and α 1 were defined in the same way as in Section 3.2 with F P R 0 (Y * ) ∈ {0.1, 0.5} and T P R 0 (Y * ) ∈ {0.9, 0.5}.Note that without X * , the baseline false positive and true positive rates for Y * are defined with X = 0 in this setting (hence the single zero subscript).We set n = 400.
Without exposure misclassification, the sampling strata for the BCC*, optMLE, and optMLE-2 designs were defined by (Y * , X).Each setting was replicated 1000 times.
Simulation results for the MLE are included in Table S4(a).The optMLE-2 design did not lose much efficiency to the optMLE design and typically surpassed the efficiencies of the BCC*, CC*, and SRS designs, with gains as high as 21%, 72%, and 74%, respectively.Fig. S8 shows the average Phase II stratum sizes selected under each of the designs.The optimal designs favored strata with the less-frequent value of Y * (i.e., Y * = 1) in all settings where it was informative (i.e., F P R 0 (Y * ) = 0.5 or T P R 0 (Y * ) = 0.5).In the highest error setting, the optimal designs appeared to be similar to the BCC* design.
is the validation sampling probability; P (Y |X, Z Z Z) is the logistic regression model of primary interest; P (Y * |X * , Y, X, Z Z Z) and P (X * |Y, X, Z Z Z) are the outcome and exposure misclassification mechanisms, respectively; P (X|Z Z Z) is the conditional probability of X given Z Z Z; and P (Z Z Z) is the marginal density of Z Z Z. Sampling (i.e., V ) is assumed to depend only on Phase I variables (Y * , X * , Z Z Z), so (Y, X) are missing at random (MAR) for unaudited subjects.

misclassification.
Simple random sampling (SRS): All subjects in Phase I have equal probability of inclusion in Phase II.CC*: Subjects are stratified on Y * , and separate random samples of equal size are drawn from each stratum.BCC*: Subjects are jointly stratified on (Y * , X * ) or (Y * , X * , Z), and separate random samples of equal size are drawn from each stratum.optMLE: Subjects are jointly stratified on (Y * , X * ) or (Y * , X * , Z), and stratum sizes are determined by the adaptive grid search algorithm.This design is included as a "gold standard" as it requires knowledge of θ θ θ. optMLE-2: Subjects are jointly stratified on (Y * , X * ) or (Y * , X * , Z).In Phase II(a), n/2 subjects are selected by BCC*.In Phase II(b), n/2 more subjects are selected by the adaptive grid search algorithm, with θ θ θ estimated using Phase I and II(a) data.

Figure 1 :
Figure 1: Matrix and graphical representations of a three-iteration adaptive grid search for a validation study of n = 400.In (a), the bold row indicates the design achieving the lowest Var( β); in (b) the triangle does.n 00 can be omitted because it is determined by constraint (5).

y
* x * z z z } we want to impose lower-and upper-bounds on the stratum sizes considered.Calculating the size of a grid based on a possible step size of s wheren (t) y * x * z ≥ (n (t−1) y * x * z z z − s (t−1) ) (i.e., all candidate designs are above the lower-bound of the neighborhood) involves a modification to Equation (S.1):

Figure S4 :
Figure S4: Distribution of Phase II stratum sizes n y * x * under outcome and exposure misclassification.Two versions of the optMLE design were considered (requiring minimum stratum sizes of m = 10 or 50), alongside the two-wave approximate optMLE-2 design.Exposure misclassification rates were fixed at FPR 0 (X * ) = 0.1, TPR 0 (X * ) = 0.9.

Figure S6 :
Figure S6: Average Phase II stratum sizes n y * x * z under outcome and exposure misclassification when an error-free binary covariate Z with A) 25% or B) 50% prevalence was used in sampling.

Figure S7 :
Figure S7: Average Phase II stratum sizes n y * x * z when optimal designs optMLE* and optMLE-2* can be derived based on misspecified misclassification mechanisms.In A), B), and C), the Y * and X * , Y * only, and X * only misclassification mechanisms can be misspecified, respectively.

Figure S9 :
Figure S9: Average Phase II stratum sizes n yx * under exposure misclassification.
The outcome of interest (Y ) is successful completion of TB treatment within one year of diagnosis; among patients who did not complete treatment, this captures unfavorable outcomes of death, TB recurrence, or loss to follow-up (with each of these outcomes also of interest in secondary analyses).The exposure of interest (X) is bacterial confirmation of TB, defined as any positive diagnostic test result, e.g., culture, smear, or PCR.The Phase I sample comes from the current CCASAnet research database and includes all patients initiating TB treatment between January 1, 2010 and December 31, 2018.Error-prone values (Y * , X * ) of the study variables are available on N = 3478 TB cases across sites in five countries (anonymously labeled as Countries A-E) during this period.Patients were stratified on (Y * , X * ) within Countries A-E to create strata of sizes (N 00 , N 01 , N 10 , N 11 ) =

Table 2 :
Simulation results under outcome and exposure misclassification with varied exposure misclassification rates.Outcome misclassification rates were fixed at FPR00(Y * ) = 0.1, TPR00(Y * ) = 0.9.% Bias and SE are, respectively, the empirical percent bias and standard error of the MLE.RE or RI < 1 indicates an efficiency loss compared to the optMLE.The grid search successfully located the optMLE and optMLE-2 designs in all and ≥ 99% replicates per setting, respectively; across all settings, 10 (< 0.1%) problematic replicates of the optMLE-2 were discarded out of 8000.All other entries are based on 1000 replicates.

Table 3 :
Simulation results with two-or three-wave approximate optimal designs under outcome and exposure misclassification.

Table 5 :
Estimates and standard errors from the analysis of the VCCC dataset under the optMLE-2, BCC*, CC*, and SRS validation designs.All results were averaged over 1000 replicates except for SRS and optMLE-2 designs: SRS encountered 118 replicates where the MLE was unstable or did not converge because of very small numbers of audited events or exposures, while the grid search algorithm failed to locate the optMLE-2 design in 40 and 48 of the replicates, respectively, under the first and second definitions of sampling strata.
1)For simplicity, we start by considering all possible values s that share common factors.(This also ensures overlap between the neighborhoods in successive iterations such that, as a safety net, no candidate designs are "left arXiv:2108.13263v3[stat.ME] 12 Sep 2022 out".)In the example from Section 2.3, we have n = 400, K = 4, and m = 10, we consider possible step sizes s s s = {180, 90, 45, 15, 5, 1}, which lead to possible grids with rows G G G (1) |s s s = {10, 35, 165, 2925, 67525, 7906261}, respectively, following Equation (S.1).Thus, we select s (1) = 15, since it is the largest potential step size in s s s to keep the grid smaller than 10 000 rows.

Table S2 :
Simulation results under outcome and exposure misclassification with an additional error-free covariate.
% Bias and SE are, respectively, the empirical percent bias and standard error of the MLE.The grid search algorithm successfully located the optMLE and optMLE-2 designs in all and > 99% of replicates per setting, respectively; 16 (< 0.1%) problematic replicates out of 18 000 were discarded.All other entries are based on 1000 replicates.

Table S5 :
Historic TB audit results in CCASAnet.Note: No subject had both outcome and exposure misclassification.

Table S6 :
Parameter estimates for TB analysis in CCASAnet using historic audits.

Table S7 :
Simulation results comparing the MLE and SMLE.