An analysis of 2‐day cardiopulmonary exercise testing to assess unexplained fatigue

Abstract Two consecutive maximal cardiopulmonary exercise tests (CPETs) performed 24 hr apart (2‐day CPET protocol) are increasingly used to evaluate post‐exertional malaise (PEM) and related disability among individuals with myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS). This protocol may extend to other fatiguing illnesses with similar characteristics to ME/CFS; however, 2‐day CPET protocol reliability and minimum change required to be considered clinically meaningful (i.e., exceeding the standard error of the measure) are not well characterized. To address this gap, we evaluated the 2‐day CPET protocol in Gulf War Illness (GWI) by quantifying repeatability of seven CPET parameters, establishing their thresholds of clinically significant change, and determining whether changes differed between veterans with GWI and controls. Excluding those not attaining peak effort criteria (n = 15), we calculated intraclass correlation coefficients (ICCs), the smallest real difference (SRD%), and repeated measures analysis of variance (RM‐ANOVA) at the ventilatory anaerobic threshold (VAT) and peak exercise in 15 veterans with GWI and eight controls. ICC values at peak ranged from moderate to excellent for veterans with GWI (mean [range]; 0.84 [0.65 – 0.92]) and were reduced at the VAT (0.68 [0.37 – 0.78]). Across CPET variables, the SRD% at peak exercise for veterans with GWI (18.8 [8.8 – 28.8]) was generally lower than at the VAT (28.1 [9.5 – 34.8]). RM‐ANOVAs did not detect any significant group‐by‐time interactions (all p > .05). The methods and findings reported here provide a framework for evaluating 2‐day CPET reliability, and reinforce the importance of carefully considering measurement error in the population of interest when interpreting findings.


| INTRODUCTION
Post-exertional malaise (PEM) is defined as a worsening of symptoms following physical or mental activity and is a hallmark symptom of myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) (Clayton, 2015). In addition to symptom exacerbation (e.g., fatigue, pain, and sleep disturbance), PEM is conceptualized as a notable decrement in functional capacity beyond what is expected in a healthy individual (Clayton, 2015). As a possible objective way to identify PEM, several investigators have promoted using a repeated-bout cardiopulmonary exercise test (CPET) protocol whereby two maximal effort CPETs are performed 24 hr apart, referred to as a 2-day CPET protocol (Stevens, Snell, Stevens, Keller, & VanNess, 2018). Findings from these studies may generally be summarized as follows: (a) between-group differences (i.e., ME/CFS and controls) are only observed during the second CPET and not the first, and (b) within-group differences (i.e., first vs. second CPET performance) are only observed for patients with ME/CFS. Therefore, Stevens and colleagues (Stevens et al., 2018) suggest that the magnitude of CPET performance decrement between bouts can serve as an objective diagnostic marker for patients with ME/CFS, provide a better understanding of the underlying pathophysiology, quantify the degree of PEM-induced disability impairment rating, as well as inform therapy and illness progression.
Recent studies have similarly adopted the 2-day CPET protocol for the study of patients with multiple sclerosis (Hodges, Nielsen, & Baken, 2018) and sarcoidosis (Braam et al., 2013). Therefore, application of the 2-day CPET protocol has been T A B L E 1 Percentage of participants meeting individual criteria for valid peak effort on both visits (valid peak) and for the entire sample ( Note: HR, heart rate; RER, respiratory exchange ratio; RPE, rating of perceived exertion; VȮ 2 , oxygen consumption. Full sample-includes participants who did not meet valid peak effort criteria on one or both exercise tests Valid peak-the restricted sample of participants who met valid peak effort criteria on both exercise tests Rating of perceived exertion (RPE) was not used to judge whether participants provided a valid peak effort, but is included to provide further context to these results. T A B L E 2 Comparison of mean (SD) of participant characteristics between veterans with Gulf War Illness (GWI+) and controls (GWI-) who met criteria for valid peak effort for both exercise tests employed in several clinical populations characterized by fatigue and PEM (Bouquet et al., 2019;Braam et al., 2013;Campen, Rowe, & Visser, 2020;Hodges et al., 2018;Keller, Pryor, & Giloteaux, 2014;Lien et al., 2019;Nelson et al., 2019;Snell, Stevens, Davenport, & Van Ness, 2013;Vanness, Snell, & Stevens, 2007;Vermeulen, Kurk, Visser, Sluiter, & Scholte, 2010). A major assumption for the 2-day CPET protocol is that measurements are both stable over time and sensitive to change. High test-retest reliability has been reported in healthy and chronic disease populations, as reviewed elsewhere (Balady et al., 2010); however, sensitivity to change has been questioned particularly among individuals with fatiguing illness (Heine, van den Akker, Verschuren, Visser-Meily, & Kwakkel, 2015). Reliability of CPET and its responsiveness to change have not been thoroughly evaluated among individuals who may experience PEM (i.e., ME/CFS and multiple sclerosis), but would yield considerable insight into the utility of 2-day CPET protocol and inform its interpretation. Approximately 25%-32% of military veterans of Operations Desert Storm andShield (1990-1991) are afflicted with a chronic multisymptom illness referred to as Gulf War Illness (GWI) (White et al., 2016) with persistent fatigue being a cardinal symptom. GWI and ME/CFS have several overlapping characteristics, including the lack of objective indicators, similar symptom profiles, and exercise-induced symptom exacerbation (Lindheimer et al., 2020). Work from our laboratory and others has employed exercise testing in veterans with GWI as a stressor to elucidate underlying mechanisms of this illness (Broderick et al., 2011Cook et al., 2003;Cook, Stegner, & Ellingson, 2010;Lindheimer et al., 2019;Rayhan, Stevens, et al., 2013;Smylie et al., 2013;Whistler et al., 2009). However, these studies have focused on a single exercise test and the 2-day CPET protocol has not previously been studied in GWI. Therefore, the present study aims to address this literature gap by quantifying the repeatability of directly measured CPET parameters and establishing their thresholds of change in a clinical population with fatiguing illness (i.e., veterans with GWI) and controls. Parts of these results have previously been reported (Chen et al., 2017;Lindheimer et al., 2019).

| Participants
A total of 35 individuals provided their written informed consent to participate in this study, including 13 controls (GWI-) and 22 veterans (GWI+) who met both the Centers for Disease Control and Prevention (Fukuda et al., 1994) and Kansas case definition for Gulf War Illness (Steele, 2000). In brief, case status requires deployment to the Gulf War theater of operations between August 8, 1990 and July 31, 1991 and endorsement of moderate-to-severe chronic symptoms in three or more of the following domains: fatigue, pain, neurological/cognitive/mood, skin, gastrointestinal, and respiratory. Symptom onset must have occurred during or after deployment and independent of comorbid conditions (i.e., diabetes, heart disease, stroke, lupus, multiple sclerosis, cancer, etc.). Control participants consisted of both nondeployed, otherwise healthy Gulf War veterans and nonmilitary civilians. Participants from either group were excluded from the study if they had any of the following: 1) absolute contraindications to exercise (American College of Sports Medicine, 2013; Fletcher et al., 2013), 2) organ failure, 3) chronic infections (e.g., HIV/AIDS, hepatitis B or C), 4) major neurologic diseases, 5) diseases requiring systemic treatment (e.g., systemic chemotherapy, radiation of brain, thorax, abdomen, or pelvis), 6) major endocrine diseases, 7) history of myocardial infarction, heart failure, or heart disease, or 8) morbid obesity (body mass index >40). All exclusions were verified in the electronic health record in cases where selfreport was uncertain. Physical activity was assessed using the short-form International Physical Activity Questionnaire (IPAQ) to derive metabolic equivalent minutes per week (Craig et al., 2003). Fatigue severity and its impact on quality of life were assessed via the 9-item Fatigue Severity Scale (FSS) with scores >36 constituting clinical fatigue (Krupp, LaRocca, Muir-Nash, & Steinberg, 1989). Lastly, physical health-related functioning was quantified from the veterans version of the Short Form 36 Health Survey (VR-36) where a score of 50 is considered the U.S. average (Kazis, 2000;Kazis, Skinner, Ren, & Perlin, 1999;Ware et al.,2007). Study procedures were reviewed and approved by the VA New Jersey Health Care System Institutional Review Board (#01251). All participants provided informed consent according to the Declaration of Helsinki prior to testing.
F I G U R E 1 Mean (SD) CPET parameters for GWI + and GWI-at peak exercise restricted to participants who met valid peak effort criteria for both exercise tests. f R , respiratory frequency; HR, heart rate; RPE, rating of perceived exertion; VAT, ventilatory anaerobic threshold; VĊO 2 , carbon dioxide production; VȮ 2 , oxygen consumption; V T , tidal volume; WR, work rate

| Cardiopulmonary exercise testing
Participants performed two consecutive maximal effort CPETs, approximately 24 hr apart (±2 hr), on a cycle ergometer (Ergoline, Ergoselect 200) using a ramp protocol (15 watts•min −1 ; 50-70 rpm) until volitional exhaustion. A clinical exercise physiologist supervised all CPETs and ensured participant safety. Heart rate and rhythm (Cosmed T12x; Rome, Italy), and oxygen saturation were monitored continuously. Blood pressure was manually auscultated approximately every 2 min during exercise and into recovery. Perceived exertion (Rating of Perceived Exertion; 6-20 scale) and breathlessness (Borg Breathlessness Scale; 0 -10 scale) were measured each minute throughout exercise and at 2-, 5-, and 10 min of recovery. Pulmonary gas exchange and ventilation were measured breath-by-breath using an oronasal mask (V2 Series; Hans Rudolph, Shawnee, KS) connected to a metabolic cart (Cosmed Quark CPET; Rome, Italy). Testing was terminated when participants met maximal effort criteria or when they were no longer able to maintain pedaling frequency despite verbal encouragement. We defined valid effort as meeting two or more of the following criteria: 1) peak respiratory exchange ratio (RER) ≥ 1.1, 2) peak heart rate ≥ 85% of age-predicted maximum, and/or 3) no change in the rate of oxygen consumption (VȮ 2 ) < 2.1 ml•min•kg −1 over last minute (Taylor, Buskirk, & Henschel, 1955). Raw breath-by-breath CPET data were visually inspected and averaged (15 breaths) for offline analyses (Robergs, Dwyer, & Astorino, 2010). Our a priori variables of interest were those directly measured during CPET and included VȮ 2 , rate of carbon dioxide production (VĊO 2 ), tidal volume (V T ), breathing frequency (f R ), heart rate (HR), work rate (WR), and rating of perceived exertion (RPE). Variables were reported at peak exercise and the ventilatory anaerobic threshold (VAT), except for RPE which was reported at peak exercise only. The VAT was determined by a clinical exercise physiologist using the modified V-slope approach (Beaver, Wasserman, & Whipp, 1986). F I G U R E 2 Mean (SD) CPET parameters for GWI + and GWI-at ventilatory anaerobic threshold restricted to participants who met valid peak effort criteria for both exercise tests. Note. f R , respiratory frequency; HR, heart rate; RPE, rating of perceived exertion; VAT, ventilatory anaerobic threshold; VĊO 2 , carbon dioxide production; VȮ 2, oxygen consumption; V T, tidal volume; WR, work rate

| Statistical analysis
Participant characteristics were compared with independent samples t-tests (α < 0.05), Hedges' d effect sizes, and 95% confidence intervals (95% CI) (Fritz, Morris, & Richler, 2012). To examine potential changes in CPET parameters from Day 1 to Day 2, we calculated a series of separate two-way repeated measures ANOVAs with group (GWI + and GWI-) as the between-subjects factor and time (Day 1 and Day 2) as the within-subjects factor. A significant group-by-time interaction (α < 0.05) would indicate differential changes in those with (GWI+) relative to without (GWI-) GWI. Partial eta-squared (η 2 p ) effect sizes of 0.01, 0.06, and 0.14 suggest small, medium, and large effects, respectively (Cohen, 1988). Based on prior studies using the 2-day CPET protocol in ME/CFS patients and those directly measured during CPET, we selected the following dependent variables: VȮ 2 , VĊO 2 , V T , f R , HR, RPE, and WR. To aid in the interpretation of the results from the RM-ANOVA models, we followed the recommendations of Lexell and Downham to assess the reliability of the 2-day CPET protocol through a series of additional statistical analyses (Lexell & Downham, 2005). First, we examined test-retest reliability by calculating intraclass correlation coefficients (ICC) with a 95% CI using a two-way mixed effect model for absolute agreement F I G U R E 3 Mean (SD) CPET parameters for the full sample of GWI + and GWI-at peak exercise. f R , respiratory frequency; HR, heart rate; RPE, rating of perceived exertion; VAT, ventilatory anaerobic threshold; VĊO 2 , carbon dioxide production; VȮ 2, oxygen consumption; V T, tidal volume; WR, work rate from single measures. Values less than 0.5 indicate poor reliability, values between 0.5 and 0.75 indicate moderate reliability, values between 0.75 and 0.9 indicate good reliability, and values greater than 0.90 indicate excellent reliability (Koo & Li, 2016). Second, to check for systematic bias and outliers, we used Bland-Altman plots to compare the difference between CPET parameters on Day 1 and Day 2 (y-axis) against the mean value of the parameter across both days (x-axis) for each participant. For all CPET variables at peak and VAT, Bland-Altman plots report biases and lower limit of agreements with 95% CIs as recommended by Bland and Altman (Bland & Altman, 1999). Third, to evaluate the degree to which a given CPET parameter would need to change in order to be considered clinically significant [i.e., exceeding the standard error of the measure (Dvir, 2015)], we calculated the smallest real difference (SRD) with 95% CI (Lexell & Downham, 2005). The SRD is estimated using the following formula: SRD = 1.96 × √SEM × √2, where SEM is the square root of the mean square error term from the RM-ANOVA. To provide a unit independent index that could be more easily interpreted, we also calculated the SRD% by dividing the SRD by the mean of the Day 1 and Day 2 measures and multiplying by 100 (Lexell & Downham, 2005). For each of these steps, data analyses were split by group (GWI+ and GWI-).

| RESULTS
Of the 35 participants, 23 (GWI+ = 15 and GWI-= 8) met the criteria for a valid peak effort (i.e., RER, HR, and/or VȮ 2 ) on both testing sessions. The distribution of effort criteria was similar between groups across visits and a breakdown is provided in Table 1. For excluded GWI + participants, valid effort was not achieved at both visits (n = 2), visit 1 only (n = 2), and visit 2 only (n = 3). For the GWI-participants excluded, valid effort was not achieved at both visits (n = 4) and visit 2 only (n = 1). No significant differences were observed for demographic or other participant characteristics (age, sex, body F I G U R E 4 Mean (SD) CPET parameters for the full sample of GWI + and GWI-at ventilatory anaerobic threshold. Note. f R , respiratory frequency; HR, heart rate; RPE, rating of perceived exertion; VAT, ventilatory anaerobic threshold; VĊO 2 , carbon dioxide production; VȮ 2, oxygen consumption; V T, tidal volume; WR, work rate fat percentage, tobacco use, FSS, and VR-36) between those providing valid effort on both days versus those that did not (all p > .05). The following results section focuses on those 23 participants meeting criteria for valid peak effort, but tables and figures for the full sample are also provided.

| Participants
Participant characteristics including their age, sex, smoking history, body mass index, physical activity, fatigue severity, and physical health-related functioning are provided for participants meeting valid peak effort criteria in Table 2 and the full sample in Table 3.

| Repeated measures ANOVA
Results of RM-ANOVA are summarized in Table 4 for participants meeting valid peak effort criteria and Table 5 for the full sample. We did not observe significant groupby-time interactions at VAT or peak for any CPET parameters (all p > .05). Effect sizes for these tests ranged from small (η 2 p = 0.001) to moderate (η 2 p = 0.13). We did observe a significant and large group effect for peak f R (F = 8.75, p = .007, η 2 p = 0.29) and time effect for peak WR (F = 5.46, p = .03, η 2 p = 0.21). Mean (SD) changes across CPET 1 and 2 are illustrated in Figures 1-2 for participants meeting valid peak effort criteria and Figures 3-4 for the full sample.

| Test-retest reliability, bias, and clinically important change
Results for ICCs (95% CI) examining test-retest reliability across tests 1 and 2 for each group's CPET parameters T A B L E 6 Test-retest reliability of CPET parameters across two maximal exercise tests restricted to veterans with Gulf War Illness (GWI+) and controls (GWI-) who met valid peak effort criteria for both exercise tests are provided in Table 6 for participants meeting valid peak effort criteria and Table 7 for the full sample. With the exception of WR at VAT for the GWI + group (0.37; 95% CI: −0.17, 0.74), ICCs for both groups ranged from moderate to excellent. Average ICCs across CPET parameters for each group indicated lower test-retest reliability in the GWI + group (0.76) than the GWI-group (0.90). Using Bland-Altman plots to check for systematic bias and outliers, estimated values for bias (95% CI) as well as lower (95% CI) and upper (95% CI) limits of agreement for CPET parameters at VAT and peak are reported in Table 8. Bland-Altman plots are shown in Figures 5-6 for participants meeting valid peak effort criteria and Table 9 and Figures 7-8 for the full sample. Absolute SRD (95% CI) and SRD% values for GWI + and GWI-groups are presented in Table 10 for participants meeting valid peak effort criteria and Table 11 for the full sample.

| DISCUSSION
We evaluated the 2-day CPET protocol in veterans with GWI and otherwise healthy controls, with an emphasis on characterizing test-retest reliability, systematic bias, and thresholds for clinically meaningful changes in seven CPET parameters directly measured at VAT and peak exercise. Unlike data from previous 2-day CPET studies in ME/CFS and in other clinical populations endorsing fatigue as a primary symptom (Bouquet et  Vermeulen et al., 2010), we did not observe differential changes in physiological function as indicated by the lack of group-by-time interactions in our RM-ANOVA models. However, we were able to substantiate the test-retest reliability of the 2-day CPET by demonstrating that the ICC values ranged from moderate to excellent for both groups (Table 6), apart from WR at VAT which displayed poor test-retest reliability in veterans with GWI. In addition, we did not detect any clear systematic biases across variables with few outliers (Figures 5-6). Finally, we used the SRD to estimate minimum values necessary to constitute clinically meaningful changes for indicating decrements in physiological function.

| Most 2-day CPET parameters were reliable in GWI, but generalizability to ME/ CFS remains to be seen
Studies of physiological and perceptual responses to exercise are susceptible to myriad sources of variability both nonspecific (e.g., demographics, aerobic fitness, prescription medications, and diurnal variation) and specific (e.g., multiple case-definitions, illness duration and severity, and symptom profile) to people with GWI and ME/CFS. Presumably, some of these factors influence exercise performance, signifying a need for better characterization of test-retest reliability of CPET parameters in these patient groups. After limiting our patient sample to those providing a valid peak effort, only WR at VAT demonstrated an ICC below 0.5, suggesting that 2-day CPET produced adequate test-retest reliability for most parameters in veterans T A B L E 8 Bland-Altman bias (95% CI) and limits of agreement (95% CI) for CPET parameters across two maximal exercise tests restricted to veterans with Gulf War Illness (GWI+) and controls (GWI-) who met valid peak effort criteria for both exercise tests with GWI. Given the degree of overlapping symptom profiles between GWI and ME/CFS, data on 2-day CPET reliability would be valuable for contextualizing our findings and for improving comparability among ME/CFS studies. However, in nine prior studies in ME/CFS (Bouquet et al., 2019;Campen et al., 2020;Hodges et al., 2018;Keller et al., 2014;Lien et al., 2019;Nelson et al., 2019;Snell et al., 2013;Vanness et al., 2007;Vermeulen et al., 2010), CPET test-retest reliability was assumed but not measured. Provided patients meet criteria for valid effort on their first CPET, the 2-day protocol requires meeting these same criteria during follow-up testing 24 hr later. Importantly for people with ME/CFS, the timing of the second CPET coincides with debilitating exacerbation of pain and fatigue , potentially further impairing the ability to provide sufficient peak effort.
Conversely, our recent work with GWI patients found that the frequency of veterans experiencing symptom exacerbation and the magnitude of the change 24 hr after 30 min of steady-state cycling at 70% heart rate reserve was considerably lower than what has been reported in ME/CFS (Lindheimer et al., 2020). The high rate of submaximal performance during a maximal test among individuals with ME/CFS (De Becker, Roeykens, Reynders, McGregor, & De Meirleir, 2000), paucity of data on CPET test-retest reliability in this population, as well as our observation of a less frequent and severe PEM response in GWI (Lindheimer et al., 2020) raises doubt about whether test-retest reliability observed here generalizes to ME/ CFS. For these reasons, a separate study characterizing the test-retest reliability of CPET may be warranted in ME/CFS before it can be confidently assumed that the 2-day CPET protocol affords an objective measure of PEM in this population (Stevens et al., 2018). F I G U R E 5 5Bland-Altman plots of CPET parameters at peak exercise across two maximal exercise tests restricted to GWI + and GWI-who met valid peak effort criteria for both exercise tests: difference (CPET 1 -CPET 2) versus average values measured at CPET 1 and CPET 2. For each parameter (f R , V T , VȮ 2 , VĊO 2 , HR, WR, and RPE), the bias (solid horizontal line) and 95% limits of agreement (dashed horizontal lines) are plotted with their corresponding 95% confidence intervals (shaded regions). Note. f R , respiratory frequency; HR, heart rate; RPE, rating of perceived exertion; VAT, ventilatory anaerobic threshold; VĊO 2 , carbon dioxide production; VȮ 2, oxygen consumption; V T, tidal volume

| The SRD is useful for evaluating clinically meaningful changes in individual patients and facilitates between-study comparisons
A recent review suggests that the 2-day CPET quantifies changes in physiological function as a measure of PEM and the magnitude of impairment associated with a patient's compromised recovery (Stevens et al., 2018). When using the 2-day CPET in this capacity, traditional statistical tests, such as the RM-ANOVA, are useful because they indicate whether changes in the patient group were significantly different from a relatively healthy response (i.e., group-by-time F I G U R E 6 Bland-Altman plots of CPET parameters at the ventilatory anaerobic threshold (VAT) across two maximal exercise tests restricted to GWI + and GWI-who met valid peak effort criteria for both exercise tests: difference (CPET 1 -CPET 2) versus average values measured at CPET 1 and CPET 2. For each parameter (f R , V T , VȮ 2 , VĊO 2 , HR, and WR), the bias (solid horizontal line) and 95% limits of agreement (dashed horizontal lines) are plotted with their corresponding 95% confidence intervals (shaded regions). Note. f R , respiratory frequency; HR, heart rate; RPE, rating of perceived exertion; VAT, ventilatory anaerobic threshold; VĊO 2 , carbon dioxide production; VȮ 2, oxygen consumption; V T, tidal volume T A B L E 9 Bland-Altman bias (95% CI) and limits of agreement (95% CI) for CPET parameters across two maximal exercise tests in veterans with Gulf War Illness (GWI+) and controls (GWI-)

Bias (95% CI)
Lower interactions). However, findings from 2-day CPET studies should also be interpreted in the context of established reference values for absolute or percent changes, which provide perspective on whether changes in the patient group are clinically meaningful. Given the heterogeneous nature of GWI and ME/CFS, the SRD statistic is well suited to addressing the need for reference values because it is adjusted for the standard error of the measure, thus allowing researchers and clinicians to better distinguish changes in CPET parameters signifying real physiological alterations from false positives arising from measurement error. Estimating the SRD across seven different CPET parameters (e.g., VȮ 2 , VĊO 2, V T , f R , HR, WR, and RPE) at VAT and peak, we observed values ranging from 8.76% to 38.6% for GWI veterans (Table 10). These values can be used by providers as a tool to establish clinical relevance in changes in CPET variables for individual cases. For instance, consider the data from a 49-year-old male veteran with GWI who participated in this study. On his first CPET, the veteran met criteria for a valid peak effort (RER = 1.16 and HR of 87.5% predicted max) and achieved a peak VȮ 2 of 2,151.08 ml·min −1 . Twenty-four hours later, he performed a second CPET, again meeting effort criteria (RER = 1.18 and HR of 88.7% predicted max) but achieved a peak VȮ 2 of 1775.82 ml·min −1 , that is, an absolute reduction of 375.26 ml·min −1 and a relative reduction of −17.4%. Using data from Table 10, the SRD for peak VȮ 2 from Day 1 to Day 2 would need to exceed 454.53 ml·min −1 or 21.9%. Therefore, this veteran did not demonstrate a clinically meaningful reduction in peak VȮ 2 , which might also be interpreted as no evidence of PEM.
The SRD can also facilitate between study comparisons and provide perspective about the replicability of 2-day CPET studies. For instance, separate studies by Snell et al. (2013), Nelson et al. (2019), andvan Campen et al. (2020) have observed significant decreases in WR at VAT across test 1 and test 2 in ME/CFS patients. This decrement in WR at VAT, and F I G U R E 7 Bland-Altman plots of CPET parameters at peak exercise across two maximal exercise tests for the full sample of GWI + and GWI-veterans: difference (CPET 1 -CPET 2) versus average values measured at CPET 1 and CPET 2. For each parameter (f R , V T , VȮ 2 , VĊO 2 , HR, WR, and RPE), the bias (solid horizontal line) and 95% limits of agreement (dashed horizontal lines) are plotted with their corresponding 95% confidence intervals (shaded regions) F I G U R E 8 Bland-Altman plots of CPET parameters at VAT across two maximal exercise tests for the full sample of GWI+ and GWI-veterans: difference (CPET 1 -CPET 2) versus average values measured at CPET 1 and CPET 2. For each parameter (f R , V T , VȮ 2 , VĊO 2 , HR, WR, and RPE), the bias (solid horizontal line) and 95% limits of agreement (dashed horizontal lines) are plotted with their corresponding 95% confidence intervals (shaded regions) T A B L E 1 0 Smallest real difference (SRD) of CPET parameters across two maximal exercise tests restricted to veterans with Gulf War Illness (GWI+) and controls (GWI-) who met valid peak effort criteria for both exercise tests GWI+ (n = 15) GWI-(n = 8) VȮ 2 at VAT in other studies (Keller et al., 2014), has been interpreted as a lowering of the threshold at which anaerobic metabolism accelerates in ME/CFS and failure of aerobic energy-producing processes in response to exercise stress (Keller et al., 2014).

| Limitations and future directions
The primary limitation of this study is our small sample size. However, our sample size is similar to, and in some cases larger than, previous 2-day CPET studies in clinical populations (Bouquet et al., 2019;Braam et al., 2013;Hodges et al., 2018;Keller et al., 2014;Lien et al., 2019;Nelson et al., 2019;Vanness et al., 2007;Vermeulen et al., 2010). Nevertheless, to obtain practically useful SRD values, it is recommended that a minimum sample size of 15-20 participants is needed, with more recent recommendations advocating for 30-50 participants (Lexell & Downham, 2005). Present study included, we are aware of only one 2-day CPET study meeting this sample size recommendation (Snell et al., 2013). Thus, although this SRD approach offers promise for establishing reference values that are robust to measurement error, the findings reported here should be viewed as an illustrative example of how the SRD could be applied to 2-day CPET research in ME/CFS and GWI rather than an official report of reference values that can be used to guide the Abbreviations: CI, Confidence interval; f R , respiratory frequency; HR, heart rate; RPE, rating of perceived exertion; SEM, standard error of the measure; SRD, smallest real difference; VAT, ventilatory anaerobic threshold; VĊO 2 , carbon dioxide production; VȮ 2, oxygen consumption; V T, tidal volume. a Data missing for one participant.
T A B L E 1 1 Smallest real difference (SRD) of peak CPET parameters across two maximal exercise tests in veterans with Gulf War Illness (GWI+) and controls (GWI-) interpretation of prior and future research. Investigators seeking to use this approach to determine clinically meaningful changes in 2-day CPET studies will need to establish these SRD values in larger samples than what has been reported here. Because differences between CPETs 1 and 2 may be attributed to systematic changes caused by the illness itself rather than random error, future attempts to develop SRD reference values for the 2-day CPET protocol should also consider how best to distinguish typical measurement error in CPET from additive effects of PEM. The SRD is derived from the SEM, which is inversely related to reliability (e.g., lower reliability results in higher SEM and SRD values), and PEM is associated with increased variability in a given outcome measure, potentially resulting in lower reliability. In the present study, veterans with GWI did not exhibit changes from tests 1 and 2 that significantly differed from healthy controls (i.e., no group-by-time interaction), indicating nonsignificant effects of PEM on CPET parameters and by extension that potential additive effects of PEM on measurement error were minimal. This highlights the need for future work testing whether SRD values from repeated CPETs are indeed higher when patients are experiencing PEM compared to when they are not. In people experiencing more severe cases of PEM than observed here (e.g., ME/CFS patients), this idea could be empirically tested by comparing SRD values derived from CPETs separated by 24 hr to those derived from CPETs separated by a time interval long enough for PEM recovery but short enough to minimize other sources of variability in CPET measurement, such as deconditioning or training effects (e.g., CPETs separated by 2 weeks). Additionally, there is a basic need to understand how changes in CPET parameters relate to other measures of PEM. Though it is assumed that greater decrements in physiological function are associated with worsening of symptoms, this has not been investigated in prior studies nor the present investigation.

| CONCLUSION
CPET is a valuable tool for characterizing cardiorespiratory function in healthy and clinical populations. However, studies involving GWI or ME/CFS that use the 2-day CPET protocol to make inferences about PEM-related decrements in physiological function should take test-retest reliability and the standard error of the measure into account to avoid potential false positives when interpreting findings. To that end, a better characterization of test-retest reliability and clinically meaningful changes for these patient groups is needed.