The Practical Significance of Measurement Error in Pulmonary Function Testing Conducted in Research Settings

Abstract Conventional spirometry produces measurement error by using repeatability criteria (RC) to discard acceptable data and terminating tests early when RC are met. These practices also implicitly assume that there is no variation across maneuvers within each test. This has implications for air pollution regulations that rely on pulmonary function tests to determine adverse effects or set standards. We perform a Monte Carlo simulation of 20,902 tests of forced expiratory volume in 1 second (FEV1), each with eight maneuvers, for an individual with empirically obtained, plausibly normal pulmonary function. Default coefficients of variation for inter‐ and intratest variability (3% and 6%, respectively) are employed. Measurement error is defined as the difference between results from the conventional protocol and an unconstrained, eight‐maneuver alternative. In the default model, average measurement error is shown to be ∼5%. The minimum difference necessary for statistical significance at p < 0.05 for a before/after comparison is shown to be 16%. Meanwhile, the U.S. Environmental Protection Agency has deemed single‐digit percentage decrements in FEV1 sufficient to justify more stringent national ambient air quality standards. Sensitivity analysis reveals that results are insensitive to intertest variability but highly sensitive to intratest variability. Halving the latter to 3% reduces measurement error by 55%. Increasing it to 9% or 12% increases measurement error by 65% or 125%, respectively. Within‐day FEV1 differences ≤5% among normal subjects are believed to be clinically insignificant. Therefore, many differences reported as statistically significant are likely to be artifactual. Reliable data are needed to estimate intratest variability for the general population, subpopulations of interest, and research samples. Sensitive subpopulations (e.g., chronic obstructive pulmonary disease or COPD patients, asthmatics, children) are likely to have higher intratest variability, making it more difficult to derive valid statistical inferences about differences observed after treatment or exposure.

on a chamber study of 30 healthy, exercising young adults in which a transient group mean decrement in forced expiratory volume in 1 second (FEV 1 ) of 2.9% was observed after 6.6 hours exposure to 60 ppb O 3 (Adams, 2006a). USEPA was particularly concerned that two subjects experienced transient FEV 1 decrements >10% (Brown, 2007). Transient group FEV 1 decrements ß5% and transient individual FEV 1 decrements >10% were deemed to be important for defining adverse effects and for setting national regulations.
Prior to the 2015 O 3 NAAQS, a new chamber study of 31 healthy young adults was performed in which statistically significant transient group FEV 1 decrements were observed for at least one exposure duration at 70 ppb, 80 ppb, and 87 ppb, but not at 60 ppb. In percentage terms, the largest group mean decrements were 5% (70 ppb), 7% (80 ppb), and 11% (87 ppb); some subjects experienced decrements >10% (Schelegle, Morales, Walby, Marion, & Allen, 2009). Based on this study, USEPA concluded that "the results of controlled human exposure studies strongly support setting the level of a revised O 3 standard no higher than 70 ppb" (USEPA, 2015, p. 65353).
Thus, federal air pollution policy considers transient FEV 1 decrements that exceed 10%, or are determined to be statistically significant regardless of magnitude, as convincing evidence of adverse health effects. It is therefore important to explore whether the test methods used to measure these decrements are appropriate and reliable for that purpose.
The ATS protocol requires that three to eight maneuvers be performed for each test. The average is unlikely to change appreciably after three maneuvers, but the maximum will increase until gains from practice (Enright, 2003) are outweighed by subject fatigue . For a specific patient in a clinical setting, this appears to be sufficient. For air pollution research, however, eight maneuvers may not be optimal; improvement has been shown after the eighth maneuver (Lehmann, Vollset, Nygaard, & Gulsvik, 2004), and up to 30 maneuvers may be required to obtain "best" performance in young children (Aurora et al., 2004).

Repeatability Criteria and Early Test Termination
Some maneuvers are discarded due to technical deficiencies; only acceptable maneuvers meeting repeatability criteria (RC) are retained and potentially reported . The stated purpose of RC is to "improve confidence in the diagnostic discrimination of the test and the confidence in which changes in lung function may be interpreted by the physician" (Enright et al., 2004, p. 236). Maneuvers are deemed "repeatable" if the difference between the highest and second-highest FVC and FEV 1 are within the RC. If differences exceed the RC, up to eight maneuvers are performed until the RC are met.
RC have been a feature of ATS/ERS protocols since 1979, though the choice of RC has changed. It was set at 0.10 L/sec in 1979 (ATS, 1979), retained there in 1987 after a review (ATS, 1987), widened to 0.20 L/sec in 1995 (ATS, 1995), and narrowed to 0.15 L/sec in 2005 . There are no objective standards for choosing the "right" RC.
ATS/ERS guidance says within-day differences in FEV 1 ࣘ 5%, week-to-week differences ࣘ11%, and year-to-year differences ࣘ15% for normal subjects should not be interpreted as clinically meaningful; greater differences apply to chronic obstructive pulmonary disease (or COPD) patients (within-day ࣘ13%; week-to-week ࣘ20%) (Pellegrino, Viegi et al., 2005). Similar interpretative guidance has been published for occupational spirometry (American College of Occupational and Envirionmental Physicians, 2016;Redlich et al., 2014). How these thresholds were determined is not reported, and while it is plausible that they implicitly account for intertest variability, there is no basis for inferring that they also account for intratest variability (Hnizdo et al., 2007).
Spirometric protocols also include provisions for early test termination once RC have been met (ATS, 1979(ATS, , 1987(ATS, , 1995. This practice results in the failure to collect readily available, acceptable data, and the retention of "maximum" values that often are not unconstrained test maxima. Even if three maneuvers are sufficient to produce a pair of values satisfying the RC, more maneuvers often produce additional such pairs, some of which may have maxima greater than the maximum of the initial qualifying pair that led to test termination. The failure to collect these data may have negligible effects in clinical practice but it is likely to distort air pollution research studies.

Test Failure Due to Lack of Repeatability
Some research subjects cannot perform spirometric tests that are acceptable (i.e., technically valid). Clinical guidance acknowledges these problems and admonishes that the inability to perform spirometry may itself be evidence of lung impairment (Pellegrino, Viegi et al., 2005). Preexisting respiratory conditions (e.g., asthma, chronic obstructive pulmonary disease (COPD)) increase the proportion of subjects who fail (Pellegrino, Decramer et al., 2005).
Subjects may produce acceptable maneuvers but not be able to produce repeatable ones. A retrospective study of 18,000 spirometric tests conducted at the Mayo Clinic indicated that about 95% of patients could produce repeatable maneuvers with RC = 0.20 L/sec, the 1995 criterion. The authors wanted the RC tightened so that only 90% would pass. They acknowledged but dismissed reduced performance observed among subjects who were short, female, or had worse baseline lung function (Enright et al., 2004).
In a large sample of Norwegians, believed to be representative, 12.7% of females and 7.7% of males failed to meet the 1987 RC = 0.10 L/sec, and 6.8% of females and 7.1% of males failed under the less demanding 1995 RC = 0.20 L/sec (Langhammer, Johnsen, Gulsvik, Holmen, & Bjermer, 2001). Indeed, it was the disparate effect of the 1987 RC that led to its relaxation in 1995 (ATS, 1995).
A comparison across 14 sites worldwide, each with approximately 600 adult subjects aged ࣙ40, following the tighter 2005 RC and using identical spirometers with centralized examiner training, had approximately 10% failure rates, but higher failure rates for older subjects (Enright et al., 2011). Similar age-dependent failure rates in baseline performance have been observed elsewhere (Kainu, Lindqvist, Sarna, Lundbäck et al., 2008;Lehmann et al., 2004). Whether the test failure rate is 5% or 10% in a representative sample, however, the absence of data from these research subjects creates interpretative difficulties. It imparts a form of nonresponse bias for which no statistical adjustment offers a remedy.

Managing Missing Data Due to Failure to Satisfy RC
The same clinical guidance that calls for the application of RC with early test termination also recommends against discarding valid data (ATS, 1979(ATS, , 1987(ATS, , 1995. Other clinical guidance favors the collection of more rather than fewer data and the deletion of no data, and criticizes device manufacturers that do not store data from all maneuvers: "[I]t can be tempting to discard any apparently discordant results during data collection before having the chance to inspect them more carefully. This runs the risk of retaining data that are 'reproducibly wrong' while discarding physiologically valid results!" (Stocks et al., 2014, p. 173).
ATS/ERS guidance is unclear concerning what the examiner is to do if a repeatable pair cannot be obtained. Examiners are advised that testing should end if "[a] total of eight tests [sic; should be "maneuvers"] have been performed (optional) or [t]he patient/subject cannot or should not continue" (Miller, Hankinson et al., 2005, p. 325). However, ATS/ERS also advises that " [n]o spirogram or test result should be rejected solely on the basis of its poor repeatability. The repeatability of results should be considered at the time of interpretation. The use of data from manoeuvres with poor repeatability or failure to meet the [end of test] requirements is left to the discretion of the interpreter" (Miller, Hankinson et al., 2005, p. 326).
In air pollution research studies, the inherent ambiguity in this guideline may lead to inconsistent data collection and reporting. The examiner could (1) discard subjects who cannot produce a highest and second-highest FVC and FEV 1 within the bounds of the applicable repeatability criterion; (2) assign the highest values obtained irrespective of whether the repeatability criterion is met; or (3) exercise discretion in some other manner to choose which values to assign to the test. The effects of these alternative approaches to missing data are potentially very different, and they are generally not reproducible.

WITHIN-PERSON TEST VARIABILITY
Variation in spirometry is expected due to age, sex, height, and other factors (ATS, 1979(ATS, , 1987(ATS, , 1991(ATS, , 1995Hnizdo, Glindmeyer, & Petsonk, 2010;. For healthy adult neversmokers, performance generally peaks in one's late 20s and declines at a rate of 20-30 mL/year (Hnizdo et al., 2010). Inter-and intratest variation can be represented by the coefficient of variation, CV t m , where t indexes tests and m indexes maneuvers within test t. This can be separated into the two components, CV t and CV m . CV t can be accounted for using default adjustments (Miller, Crapo et al., 2005) or statistical models (Redlich et al., 2014). It appears that CV m is implicitly acknowledged nearly everywhere (ATS, 1979(ATS, , 1987(ATS, , 1995Miller, Crapo et al., 2005; but explicitly accounted for nowhere. In practice, the results of multiple maneuvers across tests and maneuvers are summarized by fixed test values, thus implicitly assuming CV t m (and its components CV t and CV m ) equal zero.

Within-Person Intertest Variability, CV t
Within-person differences observed across multiple, identically conducted tests are more likely to be meaningful than simple before/after comparisons. When only two tests are performed, "large variability necessitates relatively large changes to be confident that a significant change has in fact occurred" (Pellegrino, Viegi et al., 2005, p. 962). The 13% threshold below which within-day differences are believed not to be significant for COPD patients (Pellegrino, Viegi et al., 2005) has been estimated to imply CV t ≈ 6%, with lesser percentages (e.g., 3% and 4%) as-sumed but not verified to apply to normal subjects (Hnizdo et al., 2010).
Values for CV t > 6% have been obtained in population-representative samples. For example, CV t was estimated at 13% and 12% for men and women, respectively, in a large random sample of asymptomatic Norwegian never-smokers aged ࣙ20, using the 1987 RC (0.10 L/sec). A separate examination of the 19 nurses and technicians who performed the tests yielded a sample mean CV t = 4% (Langhammer et al., 2001). The reason for these differences is not explained, but might be due to greater homogeneity among nurses and technicians, their expertise in spirometry, or both.
In a project intended to identify the index of lung function with the highest signal-to-noise ratio (i.e., the highest ratio of between-to within-subject variance), researchers reported mean CV t of 2.7% for FVC and 3.3% for FEV 1 . These estimates were within the range of values reported in previous studies (FVC: 1.8−4.9%; FEV 1 : 2.3−4.7%), but all were unrepresentative small samples, making generalizations inappropriate (Künzli, Ackermann-Liebrich, Keller, Perruchoud, & Schindler, 1995). CV t values for children also have been reported (Beydon et al., 2007a), but their relevance to adults is unclear, the methods used to obtain them are different, and a larger fraction of children tends to fail spirometric testing (Loeb et al., 2008).

Within-Person Intratest Variability, CV m
In every spirometric study we have reviewed, there appears to be an implicit assumption that CV m = 0. Large-scale epidemiological studies such as NHANES (2008,2008,2011) also implicitly assume CV m = 0 because fixed values are reported for each test. CV m was calculated for each subject in a randomized sample of 648 Finns aged 25−75 (M: 248, F: 355), 603 (93%) of whom met the 1995 RC (0.15 L/sec). Across the sample, mean CV m for FEV 1 was 1.4% (95% CI = 1.36−1.51). The distribution of subject-specific CV m values was not reported . The retrospective study of Mayo Clinic spirometry data reported mean CV m for FEV 1 ranging from 2.65% to 3.35% among males and from 1.9% to 4.1% among females. In both cases, estimated CV m was downwardly biased by the apparent exclusion of subjects who could not meet the RC. Unacceptable and nonrepeatable maneuvers were higher among older subjects and those with diminished health status, and CV m  increased with smaller physical stature (Enright et al., 2004).

Estimating Intratest Variability in the U.S. Population from NHANES Data Exclusion Rates
The proportion of acceptable maneuvers in the U.S. population excluded due to the RC can be inferred from NHANES (2011). Table I reports the number of maneuvers performed across the sample. If a minimum of three maneuvers is performed, the number of maneuvers should be the same for first through third maneuver. For unexplained reasons, NHANES reports more third than second maneuvers and more second than first maneuvers. The numbers of fourth through eighth maneuvers indicate how many subjects did not meet the RC in maneuvers three to seven, respectively. Three maneuvers of valid data were obtained from approximately 7,200 subjects. However, the RC necessitated a fourth maneuver be performed for 5,035 subjects, 70% of the sample. From maneuver four to maneuver eight, the number of additional maneuvers required ranged from 57% to 71%. Table II shows the data exclusion fractions obtained using the simulation tool described in Section 3 for RC = 0.15 L/sec and CV m values ranging from 1% to 10%. The CV m value that best fits the NHANES data exclusion fractions is about 6%. The NHANES collection was a probability sample, so 6% appears to approximate the average CV m for the U.S. population.

SIMULATION
To gain insight concerning the effects of RC, CV t , and CV m , a Monte Carlo analysis was performed for a single subject with defined age, height, and normal pulmonary function. The simulation model assumes that each test, and each maneuver within each test, is statistically independent and earned an A grade. Relaxing these assumptions would only increase inter-and intratest variability and strengthen the results reported in Section 3.3. Table III shows default simulation model parameters. Normal pulmonary function was obtained from a specific reference equation (Brändli, Schindler, Künzli, Keller, & Perruchoud, 1996). A CV t of 3% (σ t = 0.11 L/sec) was obtained from Künzli et al. (2000), and CV m is assumed to be 6% (σ m = 0.21 L/sec) based on the population value derived from NHANES (2008). To ensure high statistical power (90%) and a low nominal a priori Type I error rate (5%), 20,902 tests were performed (Robey & Barcikowski, 1992, Table I). For each independent test, eight independent maneuvers were performed using each test's simulated mean FEV 1 and an intratest standard deviation of 0.21 (i.e., CV m = 6%).

Default Model
This procedure assures that intratest variability is accounted for without affecting the simulated value obtained for each test. Maneuvers are not in fact  (Brändli et al., 1996) 0 independent, though it is unclear how to model their dependence. Subjects' performance may improve during early maneuvers due to learning and decline in later maneuvers due to fatigue. Thus, if maximum performance is the desired goal, there is an optimal number of maneuvers. But this optimum is unknown, and it is likely to vary across subjects and over time within subjects. To replicate the ATS/ERS protocol, the first three simulated maneuvers from each test were examined to determine whether the highest and second-highest FEV 1 differed by ࣘ0.15 L/sec. If such a pair was found, the highest value was deemed the maximum, it was recorded as the fixed representation for that test, and the test was presumed to have been terminated. If no qualifying pair was found, the fourth maneuver was compared to the maximum of the first three maneuvers. If the difference between the fourth maneuver and the maximum of the first three maneuvers was ࣘ0.15 L/sec, the greater of the two values was deemed to be the test maximum and the test was terminated. This procedure was conducted iteratively for up to eight maneuvers to obtain the ATS/ERS protocol maximum.
In the alternative model, tests were not terminated when the RC were met. The highest value across all eight maneuvers was deemed to be the maximum for each test. The difference between the unrestricted maximum for each test and the deemed ATS/ERS maximum equals the magnitude of measurement error for each test.

Managing Reproducibility Failure
Discarding valid results that do not satisfy the RC reduces the apparent intratest variability by excising the tails of the distribution. The more stringent the RC, the larger will be the tails excised and the degree to which intratest variability is understated. How much understatement occurs depends on the maneuver at which the RC are met and testing terminates. This is shown in Fig. 1. If only three maneuvers are performed, the number of simulated tests that yield no repeatable pair is about 20% at CV m = 3%, 50% at CV m = 6%, and 65% at CV m = 9%. Unless CV m is very low, even the full complement of eight maneuvers may not be enough to produce sufficient data to ensure that the highest feasible maximum is obtained and intratest variability is not materially understated.
As noted in Section 2.3, a choice must be made with respect to the management of simulated tests that fail to produce repeatable pairs and thus cannot be modeled using a strict application of the ATS/ERS protocol. We interpreted the ATS/ERS protocol to require these tests be discarded (option 1 in Section 2.3). Expressed another way, ATS/ERS assumes that subjects who produce maneuvers satisfying the RC are no different from subjects who cannot-an assumption that is likely to be incorrect and artifactually reduce the significance of CV m .
Our approach likely departs from typical practice in chamber study and observational epidemiology. In no study in either genre that we have examined have we found subjects excluded for failure to satisfy RC. This means researchers employed options 2 or 3 from Section 2.3, and embedded measurement error may be impossible to estimate. At CV m = 6%, there is a 15% probability that no repeatable pair will be obtained even after eight maneuvers (see Fig. 1  much larger fractions will be excluded if researchers conduct only three to four maneuvers. This practice, which is clearly undesirable, nevertheless may be necessary in a research design requiring hourly tests (Adams, 2006a(Adams, , 2006bSchelegle et al., 2009). The burden of performing eight maneuvers during the last 10 minutes of each hour may be greater than even young, healthy, athletic subjects can tolerate.

Default Model Results
We compare results from the ATS/ERS protocol terminated after three maneuvers with an unrestricted eight-maneuver model. This comparison maximizes the magnitude of measurement error, but it appears to most closely approximate actual practice in research settings.  Fig. 2 compares the cumulative probability density functions for FEV 1 under the ATS/ERS protocol and the unrestricted eight-maneuver alternative for RC values ranging from 0.1 to 0.2 L/sec. The horizontal difference is measurement error at each point. It is visually apparent that the cause of measurement error is less the choice of RC than the practice of terminating testing once any RC are satisfied.
Measurement error can be characterized in L/sec or percentage of baseline. This is shown in Figs. 3(a) (L/sec) and (b) (percentage) for the range of RC values considered. Mean measurement error ranges from about 0.15 L/sec (at RC = 0.10 L/sec) to about 0.20 L/sec (at RC = 0.20 L/sec), with the difference rising across the simulated distribution. In percentage terms, however, measurement error can exceed 7% and is never less than 3%.
These magnitudes are neither small nor unimportant in air pollution research. They are the same or greater than within-day differences for normal subjects that are interpreted by clinicians as not meaningful (Pellegrino, Viegi et al., 2005). They are also considerably greater than reported differences in spirometric performance attributable to testsubject posture (0.04−0.07 L/sec), which ATS occupational guidance deems a confounding factor large enough to warrant preventive control (Redlich et al., 2014).

Minimum Differences Required to Infer that FEV 1 Pairs Come from Different Distributions
Conventional practice treats each test result as fixed (i.e., the intertest standard deviation, σ t , equals zero), so all differences across tests, no matter their magnitude, are treated as presumptively meaningful. Accounting for inter-and intratest variability requires differences to be examined statistically. For example, taking only intertest variability into account, the default model assumes σ t = 0.11 (derived from CV t = 3%). Fig. 4 shows that any pair of FEV 1 values must differ by ࣙ0.4 L/sec (11%) to infer at p ࣘ 0.10 that they come from different distributions (e.g., before and after exposure). When both inter-and 0.0 0.5  intratest variability are accounted for, this difference must be ࣙ0.6 L/sec (16%). This is shown in Fig. 5, which juxtaposes on the same scale the pre-and postexposure distributions necessary for (1) the postexposure FEV 1 to be below the 10th percentile of the preexposure distribution, and (2) preexposure FEV 1 to be above the 90th percentile of the postexposure distribution. The gap between these FEV 1 values-0.6 L/sec-is the minimum difference between preand postexposure mean FEV 1 for differences to be statistically significant at p < 0.10. (Stipulating that FEV 1 is reasonably expected to decline after exposure, this is equivalent to a one-tailed test at p < 0.05.)

Sensitivity Analysis
The simulation model allows results to be calculated using alternative values for the subject characteristics such as sex, height, and age; the ATS/ERS protocol attribute RC; and measures of withinperson inter-and intratest variability CV t and CV m . Subject characteristics matter because the RC, a constant, is a rising fraction for persons with lower FEV 1 due to age or short stature. For these persons, a larger fraction of valid maneuvers will fail to satisfy the RC. However, the higher rejection rate is counteracted by the ATS/ERS requirement to conduct additional maneuvers, which, ceteris paribus, results in higher maximum test values. If researchers strictly follow the ATS/ERS guidelines and collect up to eight maneuvers, the disproportionate effect of the fixed RC on subjects whose normal pulmonary function is below average will be attenuated. However, they will still have a substantial fraction of subjects for whom there is no acceptable pair of maneuvers, as shown in Table II, and no objective path to resolution.
Sensitivity analysis of intertest variability shows that it has a minimal effect regardless of model. However, differences in intratest variability have substantial effects. These differences are summarized in Table IV across the range of ATS/ERS RC values for the default CV m (6%) and two alternatives on either side (0% and 3%, 9%, and 12%). Halving the default CV m reduces measurement error by about 55%. Increasing the default CV m by half increases measurement error by about 65%, and doubling the default CV m increases measurement error by about 125%. CV m = 0% corresponds to the ATS/ERS protocol, which by assuming no intratest variability implies no measurement error.

DISCUSSION
RC discard some signals as if they were noise, and early test termination prevents the collection of 0.0 0.5   potentially important signals. When inter-and intratest variabilities are assumed not to exist, all calculated pulmonary function changes are implicitly assumed to be real, not test protocol artifacts. This leads to unsupportable inferences about the statistical significance of observed differences. Measurement error alone could easily be greater. Additional problems arise if tests are conducted only before and after exposure because intertest variability will not be accounted for. Guidelines recommend against drawing inferences from just two tests: "It is more likely that a real change has occurred when more than two measurements [i.e., tests] are performed over time" (Pellegrino, Viegi et al., 2005, p. 961, emphasis added). For the default comparison, any pair of test values must differ by more than 0.57 L/sec (16%) to be able to infer at p ࣘ 0.05 that they are not drawn from the same distribution. A reasonable rule of thumb may be to refrain from interpreting as statistically meaningful any observed difference less than this amount unless and until inter-and intratest variability have been taken into account, both in data collection and statistical analysis.

Strengths and Limitations
Our analysis has several key strengths. First, we rely on widely accepted, peer-reviewed studies of normal pulmonary function for all model parameters except for CV m , for which the available literature is limited. Second, we infer a default value for intratest variability from NHANES, the "gold standard" for empirical data about the U.S. population. This inference is based on an examination of NHANES' data exclusion rates, recognizing the similarity between the NHANES and ATS/ERS protocols. Third, our Monte Carlo model imposes no additional assumptions besides normality across and within tests for a single person. These assumptions can be modified to conduct unlimited sensitivity analyses.
Our analysis has many of the same limitations that affect most research in this field. Spirometry has other sources of inter-and intratest variability and potential bias, few of which typically are adequately controlled. Inter-and intratest variability can arise from technician quality (all technicians cannot be above average, much less superior), differences in spirometric devices (precision and accuracy vary), data entry, subject-device interactions, test settings, seasonal and diurnal effects, time periods between tests, and confounding effects. Indeed, the ATS/ERS protocols are commendable for including numerous elements intended to reduce the influence of confounders (ATS, 1979(ATS, , 1987(ATS, , 1995Beydon et al., 2007a;Stocks et al., 2014).
Our simulation model has a related limitation insofar as it does not account for improvements in subject performance across maneuvers due to learning or decrements in subject performance across maneuvers due to fatigue. We are unable to capture this effect because no data appear to be publicly available. This is affected by coaching, the quality of which is variable and difficult to measure. It is intuitively reasonable to expect there is an optimal number of maneuvers where the gains from practice equal the losses from fatigue. But the optimum would vary across subjects, coaches, and other factors that cannot be easily modeled.
Intratest variability poses additional challenges. It may vary across subjects due to a host of factors. The period between maneuvers (not just tests) may matter, and the optimal spacing of maneuvers is both unknown and likely to vary across subjects. Finally, biological instability may arise between maneuvers insofar as testing induces rapid changes in lung volume that affect airway properties (Beydon et al., 2007a).
Our results assume that within-person FEV 1 is approximately normally distributed across both simulated independent tests and simulated inde-pendent maneuvers within each test. Results would differ with other distributional forms. We are aware of no empirical evidence supporting normality or any alternative distributional form. Normality across maneuvers might be refuted and could be informed by better intratest data collection, but we are aware of no way to theoretically inform the choice of the intratest distribution. At this stage of knowledge, it is more important to be transparent about the choice of distribution and cognizant of its potential significance. The effects of that choice cannot be quantified, however, as the number of alternative assumptions is infinitely large.

Practical Recommendations
The failure to account for intratest variability is a material limitation of conventional spirometry in research settings. There appears to have been no systematic effort to collect sufficient data to estimate intratest variability, whether for the population, research samples, or subpopulations of interest. All spirometric protocols recognize that intratest variability is important; hence, the universal guidance to conduct multiple maneuvers. But this recognition is abandoned in practice by terminating tests early, thus failing to collect needed data, and discarding all but a single fixed value to represent each test. The result is measurement error and bias.
Measurement error has pernicious effects on research intended to make causal inferences about small changes after treatment or exposure. A constructive path forward is to collect enough maneuver data to estimate CV m for the general population (e.g., NHANES), subpopulations presumed to be at greater risk (e.g., COPD patients, asthmatics, children), and any convenience sample (e.g., chamber study volunteers). Wherever possible, samplespecific CV m should be estimated and tested against these reference values to ensure that inferences about the significance of observed changes are statistically valid.