Bias through selective inclusion and attrition: Representativeness when comparing provider performance with routine outcome monitoring data

Abstract Background Observational research based on routine outcome monitoring is prone to missing data, and outcomes can be biased due to selective inclusion at baseline or selective attrition at posttest. As patients with complete data may not be representative of all patients of a provider, missing data may bias results, especially when missingness is not random but systematic. Methods The present study establishes clinical and demographic patient variables relevant for representativeness of the outcome information. It applies strategies to estimate sample selection bias (weighting by inclusion propensity) and selective attrition bias (multiple imputation based on multilevel regression analysis) and estimates the extent of their impact on an index of provider performance. The association between estimated bias and response rate is also investigated. Results Provider‐based analyses showed that in current practice, the effect of selective inclusion was minimal, but attrition had a more substantial effect, biasing results in both directions: overstating and understating performance. For 22% of the providers, attrition bias was estimated to be in excess of 0.05 ES. Bias was associated with overall response rate (r = .50). When selective inclusion and attrition bring providers' response below 50%, it is more likely that selection bias increased beyond a critical level, and conclusions on the comparative performance of such providers may be misleading. Conclusions Estimates of provider performance were biased by selection, especially by missing data at posttest. Results on the extent and direction of bias and minimal requirements for response rates to arrive at unbiased performance indicators are discussed.


| INTRODUCTION
In the Netherlands, routine outcome monitoring (ROM) is implemented to support individual treatments in mental health services (MHS) by informing therapists and patients on the progress made (de Beurs et al., 2011;Lambert, 2007). In addition, aggregated data from ROM are used to evaluate and improve the quality of MHS (de Beurs, Barendregt, & Warmerdam, 2017), in line with international efforts to do the same (Kilbourne et al., 2018;Porter, Kaplan, & Frigo, 2017).
In 2006, performance appraisal became important when the Dutch government introduced a new health insurance act for regulated competition among providers and among health insurers (Enthoven & van de Ven, 2007). Quality assessment is of key importance, as providers are supposed to compete on quality and efficiency, insurers should purchase care based on price and performance, and patients are expected to make an informed choice for those providers with the best outcomes. The new legislation aimed to counteract ever-rising health care costs and simultaneously improve quality by increasing transparency about costs and outcomes.
The use of outcome data to monitor, evaluate, and learn from the performance of mental health care providers is called benchmarking (Bayney, 2005). A benchmarking institute, Stichting Benchmark Geestelijke GezondheidsZorg (SBG, Foundation for Benchmarking Mental Healthcare), was established as a trusted third party to inform patients, providers, and insurers on the quality of health care (www. akwaggz.nl). Treatment outcome was deemed the key performance indicator, as "systematic outcomes measurement is the sine qua non of value improvement" (Porter, Larsson, & Lee, 2016).
The nationwide implementation of routine assessment of treatment outcomes in MHS started in 2010, with ROM serving multiple purposes: as clinical support tool, as data source for performance appraisal of providers for patients and financiers, and as data source for scientific research. Assessments at regular intervals (e.g., every 3 months or even session-by-session) are required to support treatment, but aggregated pretest and posttest ROM data of treatments are adequate for performance appraisal purposes (www.akwaggz.nl). Patients routinely complete self-report questionnaires on symptomatology in curative outpatient care, and professionals complete rating scales on their patients' functioning in care for severe mental illness. Providers send their outcome data monthly to the benchmark institute, where data are checked, aggregated, and transformed into various performance indicators (de Beurs et al., 2016;. Performance of providers is evaluated by comparing the average pretest-to-posttest change in symptoms, functioning, and health-related quality of life for various patient groups (common mental disorders, severe mental illness, geriatric psychiatry, and substance use disorders) achieved after a year of treatment or after a completed treatment trajectory.
At the start in 2010, a 50% response rate was deemed achievable based on estimates of 70% pretest and 70% posttest inclusion rates (resulting in an overall 49% response rate). Also, we expected that at least 50% response was required for valid aggregated outcome information. This estimate was based on the literature (e.g., Livingston & Wislar, 2012) and on experiences with similar international endeavours, such as the Minnesota Health Scores initiative (www.mnhealthscores. org) and the pay-for-performance scheme of the English National Health Service, where a response rate >50% is one of the requirements for providers to qualify for a bonus payment . Dutch MHS providers were allowed 5 years to achieve this 50% response rate, with yearly increments of 10%, and their response rates were monitored by SBG. ROM response rates rose according to the plan. By 2016, 95% of all providers in the Netherlands submitted data of concluded treatments to SBG monthly, and pretest and posttest data of symptomatology and/or functioning were available for 47% of all treatments. There was substantial variance among providers, with some achieving only 20% ROM response and others reaching response rates of 90% or more (www.akwaggz.nl).
We aimed to investigate which factors may bias performance indicators and which response rates yield sufficiently trustworthy information on provider performance. Any response rate below 100% creates the possibility of biased results, as patients with complete data may differ systematically from nonresponders and may not be representative of the entire population of the provider. A comprehensive body of literature has been published on missing data (Graham, 2009;Little & Rubin, 2014;Rubin, 2004;Seaman & White, 2011). Data can be Missing Completely at Random (no relationship between whether a data point is missing and any values within the data set, or any unobserved values).
More likely, data are missing systematically, and here, two cases are distinguished: Missing at Random (there is a relationship between missing data points and other observed values in the data set) and Missing Not at Random (there is a relationship between missingness and unobserved data). The latter is nonignorable, as it will bias results (Graham, 2009). An example is endogenous selection bias (i.e., conditioning on a collider; Elwert & Winship, 2014). For instance, some patients may be harder to assess and more difficult to treat too. Consequently, the measured outcomes from more easily assessed patients would overestimate the benefits of MHS treatment, would not be an accurate reflection of the true nationwide results, or would not reflect the true performance of a single MHS provider. Minimization of selection bias or at least information on the extent of bias per provider is essential for the validity and utility of the performance indicator. How much the results are biased depends on the extent of systematic differences between patients with complete and incomplete data. The extent of bias may also depend on the response rate of a provider Hoenders et al., 2014;Young, Grusky, Jordan, & Belin, 2000).
The present study sets out to investigate these issues.
With longitudinal treatment outcome data, selection (accidently or intentionally) can occur at two time points: pretest and posttest.
Omission of pretest data may lead to selective inclusion bias; omission

Key Practitioner Message
Unbiased performance indicators require sufficient ROM response rates. of posttest data from included subjects may cause selective attrition bias. Through selective inclusion at pretest, performance indicators of providers will become positively biased when predominantly patients are assessed for whom a good outcome is expected (e.g., well-treatable patients with less complex problems, little co-morbid psychopathology or co-morbid somatic problems, a first episode, employed, extensive social network, and high socio-economic status [SES]). Conversely, performance indicators will become negatively biased when mostly difficult-to-treat patients are included. An investigation of sample selection bias in ROM data should therefore focus on patient variables with prognostic value for outcome. Selective attrition at posttest will bias outcome towards the positive when patients with unsuccessful therapies are not reassessed. Such bias can be intentionally introduced in the data collection phase but can also occur unintentionally, for instance, because unsuccessfully treated patients (e.g., early dropouts) are less compliant with a posttest assessment.
Intentional or not, we cannot assume that inclusion or attrition occur at random: patients with complete data may differ from the noncomplete group, and countrywide results and findings on the performance of individual providers may become biased by selection.
Both inclusion and attrition bias threaten the external validity of the results (Cuddeback, Wilson, Orme, & Combs-Orme, 2004). Furthermore, the lower the response rate obtained by an MHS provider, the more room there is for biased results.
We investigated the association between patient characteristics (demographic and clinical) and outcome, their association with inclusion and attrition, and their potential to bias aggregated outcomes of providers. Hence, for each provider, we established (a) bias due to selective inclusion at pretest, (b) bias due to selective attrition at posttest, and (c) the combined biasing effect of inclusion and attrition. We also investigated the association between naturally occurring ROM response rates and extent of bias. Based on the findings, we will discuss minimal requirements for inclusion, attrition, and overall response rates to attain sufficiently unbiased performance indicators.

| Patient, treatments, and providers
The present study is limited to treatments concluded in 2016 of adult outpatients (aged 18-65) predominantly with common mental disorders, such as mood and anxiety disorders of mild to moderate severity.
Other groups of patients were included in this nationwide effort (severe mental disorders, elderly patients, and substance abuse) but different outcomes (and instruments) were used, so these groups are omitted from the present analysis. Mean age of the present sample was X _ = 38.30 (SD = 13.50); 61.1% was female; and 31.9% was treated for depression, 24.5% for an anxiety disorder, and 16.8% for a personality disorder (see Table 1).
Treatments were pharmacological, psychosocial (e.g., cognitivebehavioural therapy), or a combination of both, predominantly provided in individual weekly or biweekly sessions with a psychiatrist, clinical psychologist, or psychiatric nurse, as well as in group format.
The present study is limited to the first year of treatment; the average duration of treatments was M = 42.2; SD = 13.4; range = 1-52 weeks.
Providers can be large nationwide-operating institutions, large institutions providing integrated MHS in a specific region of the country, smaller institutions working locally, or even private practitioners. SBG has contracts with 500+ institutional providers. For the present study, only data were used from institutional providers who submitted at least 25 treatments with complete pre-post data per provider to arrive at a reliable estimate of their performance. This resulted in n = 113,707 treatments and n = 135 providers. Data of small institutional providers were thus excluded. Individual providers working in private practice were not included in the current data set either.

| Treatment outcome
To assess severity of symptomatology with common mental disorders, generic self-report measures were used such as the Brief Symptom Inventory (Derogatis, 1975), the Outcome Questionnaire-45 (Lambert, Gregersen, & Burlingame, 2004), and the Symptoms Questionnaire-48 (Carlier et al., 2012). Scores on these self-report questionnaires were transformed into a common metric (normalized T-score) with a pretest mean of T = 50 (SD = 10). Treatment outcome was operationalized as the pretest-to-posttest difference in severity of symptoms expressed in T-scores (ΔT), achieved within the first year of treatment. The average outcome was ΔT = 7.29; SD = 10.17.

| Threshold values for selective inclusion bias and attrition bias
For selective inclusion or attrition bias, a critical cut-off value was set at 0.50 ΔT points. As ΔT is based on T-scores with SD pretest = 10, a 0.50 ΔT-point implies a 0.05 shift in standardized pre-post difference or an effect size of ES = 0.05 (Cohen, 1988;Seidel, Miller, & Chow, 2013). A critical cut-off value of 0.50 implies that we deem a difference between two providers larger than 1.0 points as not attributable to inclusion or attrition bias. with pretest data to assess selective inclusion and 59,136 cases were compared with 28,753 cases to assess selective attrition.
To select relevant patient characteristics, we first analysed the bivariate association between demographic and clinical patient variables (predictors) and the posttest score. Most variables had a statistically significant association with outcome, but only five contributed substantially to the multiple model (partial η 2 > .001, see Table 2). More severe symptomatology (a higher pretest T-score) and worse functioning (lower GAF score) have the most influence on the posttest score.
The five predictors with substantial prognostic value for outcome of Table 2 were used to assess selective inclusion and selective attrition in two multiple multilevel logistic regression analyses with binary dependent variables ("included at pretest or not" and "assessed at posttest or not"). Table 3 presents the results. The combination of three prognostic variables predicted inclusion significantly: χ 2 (5) = 1,053.88; p < .001. Better functioning (higher GAF score) at pretest, higher SES, and lower age were associated with being included at pretest. These three variables were used to calculate the propensity score.
Having a personality disorder, higher SES, better functioning (higher GAF score), and higher age were associated with attrition at posttest; the pretest T-score (the most prominent predictor of the posttest score) was not associated with attrition. The combination of the four variables predicted attrition significantly: χ 2 (6) = 656.45; p < .001. These four variables and the pretest score (all variables that appeared predictive of outcome in Table 2) were used to estimate missing posttest scores for imputation. The multilevel estimation of propensity and imputation included one additional level: the provider.

| Analysis of patient-based data
Next, mean ΔT before and after imputation and before and after inverse propensity weighting was established. On the Per patient columns, Table 4    selective inclusion and attrition bias per provider and calculated the mean ΔT of all providers, based on their average performance.
Selection was substantial: the mean percentage of included patients at pretest was 77% (range among providers: 19-99%). However, selective inclusion did not bias the average outcome substantially (see Table 4, Per provider columns).
There was considerable attrition too. The mean percentage of included patients who were reassessed after 1 year was 66% (range among providers: 25-92%), but again, this did not bias the overall mean performance of all providers. The overall response rate after selection due to inclusion and attrition was 52% (range 11-90%).
All differences between the various mean ΔTs on the Per patient columns of Table 4 are small and not statistically significant (T < 0.95; p > .34), and the countrywide provider-based average is not affected by selection bias.

| Bias per provider
Bias due to selective inclusion ranged from −0.53 to 0.49 among providers: only one provider had a bias of ±0.50 or more, and 10 providers (7.4%) had a bias of ±0.25 or more (see Table 5 and Figure 2).
There was no relation between the percentage of included patients

| DISCUSSION
In this paper, we presented a strategy to estimate both inclusion and attrition bias in ROM data to assess provider performance. Included and nonincluded patients at pretest did not differ on most demographic and clinical variables. Although some statistically significant differences were found, the size of their effect (bivariate and multiple) on outcome was small and close to zero. Nevertheless, in the patientbased analyses, selection bias was found, predominantly due to selective inclusion. Also, selective attrition at posttest biased the results, here most visibly in the results of provider-based analyses. These findings stress the value of distinguishing both sources of bias: selective inclusion and selective attrition. Finally, the results reveal that an overall response rate of 50% adequately safeguards against biased provider-based results.
When analysing patient-based data, the nationwide average was more biased by selective inclusion than by attrition. The elevated ΔT weighted is most likely due to giving more weight to cases with a lower level of functioning (GAF scores are negatively associated with pre-post gain). However, the biasing effect of selective inclusion was less pronounced when inspecting the results per provider.
Provider-based data also demonstrated effect of selection, but bias went in both directions, with half of the providers increasing their for all other providers, inclusion bias remained within critical limits.
Finally, there is no association between inclusion rate and bias; a low inclusion rate does not appear to lead to biased results. These findings Abbreviations: ΔT, difference between pretest and posttest T-score; ΔT combined , difference between pretest and posttest after imputation of missing posttest scores and inverse propensity weighting; ΔT imputed , difference between pretest and posttest after imputation of missing posttest scores; ΔT weighted , difference between pretest and posttest adjusted by inverse propensity weighting.  Moreover, a strong association is found between the extent of posttest attrition and absolute bias. As previously mentioned, providers are incentivized for response rates and not for outcomes. Consequently, attrition is most likely due to unwillingness of patients to comply with reassessments when their treatment was unsuccessful rather than a result of intentionally not assessing unsuccessful treatments at posttest. For instance, a substantial proportion of patients terminate their treatment at an early stage, usually after one or two sessions (Swift & Greenberg, 2012), and it is difficult to obtain posttest data from these patients.
Statistical correction for both selective sampling mechanisms, by weighting cases based on their propensity for inclusion and by imputing posttest scores, exposed 29 providers with biased results. Interestingly, bias went both ways, with some providers having better outcomes after statistical correction (n = 16, 11.9%) and some having worse outcomes (n = 13, 9.6%). Some providers may selectively With the present data-analytic approach, we can reveal the extent of this bias and flag overstated performance.
Close inspection of Figure 2 reveals that the extent of inclusion bias was limited; for most providers, bias due to selective inclusion fell between critical limits of ±0.50. Attrition had a more profound biasing effect, and a minimum 70% response is required to keep bias within critical limits. When both effects are combined, the graph suggests that at least 50% response is required to keep bias sufficiently at bay. When a 50% response rate was achieved, only four providers (3.0%) remained biased due to the combined effect of selective inclusion and attrition beyond the stringent limit of 0.5 ΔT points. However, it should be noted that currently, only 62 (45.9%) of the providers achieve this level of completeness of data.
The present findings suggest that 100% implementation of ROM is not needed to obtain valid information on providers' performance, as weighting for noninclusion and imputation of missing posttest scores yields similar results for most providers on the performance indicator.
A posttest inclusion rate of 70% and an overall response rate of at least 50% seems sufficient for a representative estimate of a provider's performance. This 50% response rate coincides with required levels of the pay-for-performance scheme of the British National Health Service . For providers with a response rate under 50%, results on their performance became untrustworthy, as chances are that their results were more than 0.5 ΔT points off the mark. The lower the overall response rate, the larger the bias and the more bias went both ways: some providers scored higher and some lower after correcting for selection bias.
The findings support the initial decision from 2010 to recommend 50% response as a minimum requirement for providers, in order to deem their results sufficiently unbiased by selection. There are other reasons to strive for optimum implementation of ROM beyond the 50% mark and to encourage providers to improve data collection. First of all, ROM was not primarily implemented for accountability but predominantly intended as a beneficial adjunct to treatment (de Beurs, FIGURE 2 Scatterplots of bias due to selective inclusion, selective attrition, and bias due to all combined by response percentage. Points represent the response rate of providers (x-axis) by the extent of the bias (y-axis) in ΔT units; positive bias implies overstated results, negative bias understated results Barendregt, & Warmerdam, 2017). Research has shown that ROM by itself can lead to better results (Boswell, Kraus, Miller, & Lambert, 2013), especially for patients at risk for treatment failure (Lambert, 2010). Treatment failures tend to be overlooked by therapists (Hannan et al., 2005). ROM can also lead to more efficient treatment delivery as outcome feedback may reduce treatment length (Delgadillo et al., 2017). Furthermore, ROM grants patients a more active role in their own recovery (Patel, Bakken, & Ruland, 2008). All patients deserve this adjunct to treatment, not merely a fraction to ensure a nonbiased performance indicator. Hence, striving for high ROM response rates is good clinical practice, but it is also vital to counteract providers who game the system (Bevan & Hood, 2006) (Killaspy, 2018) or gaming the system by other means (Jacobs, 2014). Others warn of adverse effects on the clinical application of ROM if data are also used for benchmarking (Delespaul, 2015). Indeed, some Dutch providers have balance between these two aims and will revise the system to increase its usefulness to plan, monitor, and modify individual treatment. The selection of allowed outcome measures will be broadened to include disorder-specific measures, and guidelines for the frequency of ROM assessments will be offered. Patients will be asked to explicitly consent to the use of their ROM data for quality monitoring. Publication of providers' aggregated outcomes will become voluntary and depend on their contract with insurers.

| Strengths
There are various approaches to imputation of missing posttest data to choose from (Díaz-Ordaz et al., 2014), depending on whether differential provider effects are taken into account (Taljaard, Donner, & Klar, 2008)  plete pretest-to-posttest data). Consequently, the single-level regression-based approach diminishes differences between providers, whereas the multilevel approach takes these differences into account.
As there is considerable variation in outcomes among providers, the multilevel approach was deemed the most appropriate, even though the more stringent approach yielded a verdict of biased results more readily. It is nonetheless wise to set the bar high for validity of performance indicators and better to err on the side of deeming them biased rather than too readily deeming them valid.
It is not uncommon to assess selection bias by comparing subjects with complete data to all other subjects with incomplete data, irrespective of when data loss occurred . However, in a longitudinal study design, data loss can occur at two distinct time points: the pretest and the posttest. In the present study, selection bias was divided into two potential sources of bias: selective inclusion and selective attrition, and effects of both were investigated separately. Different variables were associated with loss of data at pretest and at posttest. According to patient-based analyses, selective inclusion had a greater biasing effect than selective attrition. This is a fortunate finding because it is easier for a provider to counteract data loss at pretest than at posttest. At pretest, most patients are willing to meet demands for information on the severity of their illness or other variables. At posttest, it may be more difficult to obtain this information, especially from patients who discontinue treatment prematurely. This may limit the window of opportunity to get such data. Consequently, bias due to attrition is harder to prevent than bias due to selection.
This paper offers a strategy for providers who want to know the trustworthiness of their aggregated ROM data and want to establish whether their outcomes are biased by selective inclusion or attrition.
Multiple imputation and weighting will give a methodologically sound estimate of the extent and direction of potential bias, even though multilevel analysis is not feasible with data from a single provider.
However, the underlying analyses may be too complicated or cumbersome for many providers. For them, the present results do offer a simple message: ensure that you have a high response rate (>50%) and the risk of biased results will be limited. When the response rate falls below 50%, ΔT might be biased and misleading as performance indicator.

| Limitations
A limitation of this study is that we restricted our sample to adult patients with common mental disorders. Findings may be different for other age groups or disorder groups, and this still needs to be investigated. For instance, with severe mental disorders, we assess treatment outcome with the HoNOS (Wing et al., 1998), a rating scale usually completed by a professional. Then, the professional, not the patient, is the source of the data, and consequently nonresponse is due to lack of compliance of professionals with administrative processes, like completion of the HoNOS. Under such conditions, nonrandom missingness may be even more likely. Furthermore, in the Netherlands, treatment outcome is evaluated at least once a year, and for the present study, we used outcome data from the first year of treatment. This means that longer treatment trajectories were only partially evaluated.
Weighting and imputation of posttest data was based on a selective set of demographic and clinical variables. Various other factors such as work status or ethnic background may be relevant, but these data were not available to us. Further research should include additional factors that are potentially associated with outcome and missingness of outcome data. In addition, interactions between observed demographic and clinical variables may be relevant to outcome. For instance, the effect of age or gender may differ between diagnostic groups. Interactions among predictors were not included in the present study, as they would complicate a subject that is already difficult to understand.
To assess selective inclusion, an alternative approach to inverse probability weighting would have been multiple imputation of missing data. Based on patient characteristics, a pretest score could have been estimated and imputed as well. Based on this imputed pretest score and other patient characteristics, a posttest score can be estimated and imputed. We decided against this double imputation, as it would have multiplied the uncertainty about the resulting score and ΔT.
We do not know of studies where this double imputation process has been applied and evaluated.

| CONCLUSION AND IMPLICATION
If response rates fall below 50%, there is a substantial chance of aggregated outcomes of providers being biased by selective inclusion or attrition. We propose using two extra indicators per provider for the trustworthiness of their performance indicator: the overall response rate (>50% or not) and whether the results are unbiased by the combination of selective inclusion and attrition (<0.5 ΔT difference or <0.05 ES between weighted posttest imputed performance and observed performance). Accordingly, the representativeness of the outcome data will be adequately revealed.