The profile inter‐unit reliability

Abstract To assess the quality of health care, patient outcomes associated with medical providers (eg, dialysis facilities) are routinely monitored in order to identify poor (or excellent) provider performance. Given the high stakes of such evaluations for payment as well as public reporting of quality, it is important to assess the reliability of quality measures. A commonly used metric is the inter‐unit reliability (IUR), which is the proportion of variation in the measure that comes from inter‐provider differences. Despite its wide use, however, the size of the IUR has little to do with the usefulness of the measure for profiling extreme outcomes. A large IUR can signal the need for further risk adjustment to account for differences between patients treated by different providers, while even measures with an IUR close to zero can be useful for identifying extreme providers. To address these limitations, we propose an alternative measure of reliability, which assesses more directly the value of a quality measure in identifying (or profiling) providers with extreme outcomes. The resulting metric reflects the extent to which the profiling status is consistent over repeated measurements. We use national dialysis data to examine this approach on various measures of dialysis facilities.


| INTRODUCTION
Monitoring outcomes of health care providers is an important activity that has received much attention in the literature (eg, Normand et al., 1997;Normand and Shahian, 2007;Jones and Spiegelhalter, 2011;He et al., 2013;Estes et al., 2018). In order to identify extreme (poor or excellent) performance and to intervene as necessary, outcomes of patients associated with health care providers are routinely monitored most often by both government and private payers. This monitoring can help patients make more informed decisions and can also aid consumers, stakeholders, and payers in identifying providers where improvement may be needed, and even closing or fining those with extremely poor outcomes. Therefore, it is important that the quality measures used for profiling providers are appropriate and one aspect of this is the measure's reliability.
To assess the reliability of a quality measure, the inter-unit reliability (IUR) is commonly used. The IUR specifies the proportion of the total variation in the quality measure that can be attributed to the betweenprovider variation. The variation in a specific measure across health care providers can be viewed as comprising two parts: the between-provider variation and the within-provider variation. The IUR is then defined as the ratio of the between-provider variance to the total variance. Scholle et al. (2008) and Adams (2009) suggested that a quality measure should attain an IUR of at least 0.7. This work has recently been discussed by the National Quality Forum as suggesting a possible guideline for assessing measure reliability. If the IUR is large, it is argued that most of the variation observed between health care providers is driven by systematic differences between the providers and not by the variation in the outcomes of the patients being treated. Thus, it is argued that the measure gives a reliable assessment of between-provider differences and could be used for the purpose of assessing extreme providers. Note, however, that  discussed several drawbacks of using the IUR to characterize the suitability of a measure for profiling providers. In particular, 1. The variation between providers may be due to various factors in addition to differences in the quality of the health care provided. Differences between providers can also arise because there are important unmeasured characteristics, such as patient comorbidities or patient and provider demographics, that are not within the control of the provider and that differ across providers. Thus, a large IUR can be a signal of incomplete risk adjustment and may not be much related to the quality of care at all. 2. The IUR may not determine the suitability of a measure for identifying outliers. Even measures with an IUR close to zero can be very useful for identifying extreme providers, whose outcomes do not conform to an assumed statistical model.
In this paper, we propose an additional metric of reliability that assesses more directly the value of a quality measure in identifying providers with extreme outcomes. The underlying idea is that we should consider a measure to be reliable if, on repeated applications, it profiles the same providers as being extreme with relatively high probability. We proceed in two steps: first, we evaluate the ability of a measure to consistently profile providers with extreme outcomes; second, we use the IUR to calibrate this new metric, which we call the profile IUR (termed PIUR throughout this paper).
Our paper continues as follows: Section 2 first reviews the IUR for a simple linear model and then defines the proposed PIUR for normally distributed patient outcomes. Section 3 exemplifies the PIUR for several commonly used profiling methods. In Sections 4 and 5, we examine the proposed PIUR with simulations and national data on dialysis patients. We conclude with a discussion in Section 6. 2 | PROFILE IUR 2.1 | A simple linear model and review of the IUR Let Y * ij represent a continuous outcome for subject j in provider i, where i m = 1, …, and j n = 1, …, i . Here m is the total number of providers and n i is the sample size for provider i. Consider an underlying linear regression model is the random noise, and X ij is a vector of patient characteristics. The regression coefficients, β, measure the within-provider relationship between the covariates and the response. Here we assume that large values of Y * ij correspond to poor outcomes. In model (1), it is common to assume (at least implicitly) that X ij is independent of α i . However, in practice, patient characteristics can be correlated with provider attributes (eg, patients with less favorable health status may be referred to providers with poorer treatment strategies). In this case, the estimated regression coefficients based on the usual likelihood analysis of the model (1) are biased. Alternatively, β can be estimated in a model with fixed effects for provider-specific parameters, which avoids the aforementioned issues of bias (Kalbfleisch and Wolfe, 2013). The resulting estimate can then be used as an offset to estimate the remaining parameters and the α i 's.
We note that, in many profiling applications, the number of providers and the number of patients are large so that μ σ σ , , b w , and β can be precisely estimated. To simplify the notation, we proceed below as though their values are known. Let be the riskadjusted response, so that the model (1) becomes . Here σ b 2 is the between-provider variance, and ∕ σ n w i 2 is the within-provider variance. The IUR for a provider with n i patients is the proportion of the total variation in Y i that can be attributed to the between-provider variation: which is also the square of the correlation between Y i and the true provider effect α i Y α IUR = Corr ( , ).
i i i 2 Figure 1 plots the density of an example with two distributions of interest: the distribution of the provider effects, α i , and the distribution of the estimated provider effects, Y i . The IUR is the ratio of the variances of these two distributions.
To assess the average ability of a quality measure to distinguish between providers, the overall IUR can be obtained from the decomposition of total variation in a one-way analysis of variance. Recall that the between sums of squares (SSB) is defined as The overall IUR across providers is then defined with respect to the average provider size n′ ∕ σ σ σ n IUR = + ′ .

| Limitations of the IUR
The interpretation of the IUR depends on the sources of variation in the provider effects. The argument that a relatively large value of the IUR is required to justify the use of a measure for profiling is based on the assumption that the variation in the provider effects is entirely due to the quality of care (Adams, 2009;. That is, there are no unobserved confounders that are correlated with the outcome of interest and whose values vary across providers. This assumption, however, is often invalid. For example, unobserved socio-economic factors, comorbidities, and genetic differences of patients may differ substantially across providers and so contribute to the between-provider variation. Thus, a large IUR can be a signal of incomplete risk adjustment. Moreover, the IUR indicates the average ability of the measure to distinguish between providers. In identifying providers that are outliers, however, we are not concerned with this average, but rather with the measure's ability to identify providers where outcomes are extreme. Thus, a measure with a small IUR may still be useful in identifying a few providers whose outcomes are extreme. These concerns motivate us to propose an alternative measure of reliability, which emphasizes upon the identification of outliers.

| A model with outliers and the PIUR
It is convenient to extend the model (1) to include contamination with outliers. Such a model is considered by Efron (2013) for large-scale hypothesis testing. Suppose that the provider effects are either from the null normal distribution, N σ (0, ) b 2 with probability π 0 or from a distribution of outliers with probability π π = 1 − 1 0 . Thus, we generalize model (1) by taking is a density for outliers with support on the region α C > for some For provider profiling, it is natural to consider a quality measure as reliable if it is able to reliably identify the same providers as extreme. Thus, we might assess a measure by its propensity to identify the same providers as extreme when the data are replicated. Replication is not possible in practice, so we consider a sample-splitting approach as follows. Randomly divide each providers patients into two nearly equal-sized subgroups. For a given threshold, determine whether each provider is identified as extreme in the first and the second subgroups. Repeat this process a large number of times to estimate the empirical probability that a provider is profiled in the second subgroup given that it is profiled in the first. This empirical reflagging rate is then put on the IUR scale, by determining the IUR that would yield this reflagging rate in the absence of outliers. The difference , and n = 100 i . The IUR compares the variance of the former distribution to that of the latter [This figure appears in color in the electronic version of this article, and any mention of color refers to that version] between the PIUR and the IUR indicates the extent to which the measure identifies outliers.
More specifically, given a data set, a quality measure and a profiling method, PIUR is determined as follows: Algorithm 1 Definition of the PIUR.
1. Randomly divide each provider's patients into two nearly equal-sized subgroups, for example, groups A and B. 2. For a given threshold p and profiling method, determine whether the provider is flagged based on data in groups A and B. 3. Repeat this process a large number of times, and estimate the probability that a provider is flagged in group B given that it is flagged in group A. This estimate is the empirical reflagging rate | θ B A .
4. Under the assumption that the data arise from the model As defined, the PIUR is on the same scale as the IUR, but with emphasis on the ability of quality measures to consistently identify outliers. For example, for a given empirical reflagging rate The values of the PIUR, compared with the IUR, are influenced by the proportion of outliers and their magnitude. That is, a higher PIUR compared to the IUR indicates the presence of outlier providers, which is not captured in the IUR itself.

| PROFILING METHODS
In this section, we briefly review several commonly used profiling methods for flagging extreme providers. We show that considering the provider-specific IUR provides a simple theoretical justification for estimating the proposed PIUR for various profiling methods. That is, assuming that the data arise from the model (2), the conditional probability for the ith provider, G R ( ) i , depends only on the IUR i .
3.1 | Provider effects due entirely to variation in the quality of care In these cases, it is natural to consider tests of sharp null hypotheses about the provider effects. Generally one of two methods is used: fixed effects (FE) and random effects (RE). Under the linear model, the fixed effects Z-score for a test of α = 0 Based on fixed effects, the ith provider is flagged as worse than expected if Z z > 2, are the cumulative distribution functions of the standard normal distribution and a bivariate normal distribution with variation 1 and correlation ρ, respectively; for example, Alternatively, the RE approach is based on the best linear unbiased predictor (BLUP) or empirical Bayes estimate arising from the "posterior" distribution of α i given the data. The estimate of α i then is α which has a posterior variance ∕ σ n IUR i w i 2 . Thus, the corresponding RE-based z-score is then given by Here IUR i plays the role of a shrinkage factor.
where s s ρ = / 2 1 , and ρ is the same as in Proposition 1.
3.2 | Provider effects are due to incomplete risk adjustment If quality of care is not the main source of variation in the provider effects, the random variation accounted for in σ b should be incorporated in the profiling method. The approach based on fixed effects with random intercept (FERE) (Jones and Spiegelhalter, 2011; utilizes fixed effects estimates but judges their values with reference to the marginal distribution, including the between-provider variation. The Z-score can be constructed as . This approach is based on the assumption that most of the betweenprovider variation is due to unobserved characteristics that are outside the control of the provider. Thus, the FERE approach only flags a provider if its outcome is extreme with reference to the total variation. This approach is useful for identifying providers that are outliers or do not follow the assumed model.
where s z = − p 3 . Note that Propositions 1 to 3 are based on providerspecific IUR. To assess the average ability of a quality measure to consistently identify outliers across providers, we extend the PIUR based on the overall IUR. Numerical evaluations for the proposed methods are provided in Section 4.

| Empirical null approach
The empirical null approach is based on work of Efron (2004;2013) who defined the empirical null and used it in problems of assessing false discovery rates. Kalbfleisch and Wolfe (2013) proposed the use of the empirical null in profiling health care providers. We suppose first that all providers are approximately of the same size so that ≈ n n i for all i. In the empirical null approach, a normal distribution is fitted to the central part of the distribution of the fixed effects Z-scores, Z i FE, , i m = 1, …, . This can be done using robust methods that are not influenced by values in the tail of the distribution. For example, one might use M-estimation or maximum likelihood approaches based on a truncated normal model (eg, Efron, 2013). The resulting estimates of the mean and variance are μ M and σ M 2 , and the empirical null distribution is N μ σ (ˆ,ˆ) M M 2 . This distribution, instead of N (0, 1) is used as the null hypothesis with which to assess extreme values of the FE-Z-scores. More specifically, the ith provider is flagged as worse than expected if , where z p is the upper pth quantile of the standard normal distribution.
If the model (2) is exactly true for all providers, the empirical null approach and the FERE approach give essentially the same solution. This follows from the results of Andrews et al. (1972) and Huber (1964;1973), which can be used to show that as the number of providers → ∞ m . More generally, however, the empirical null approach also applies to the model (3) where it gives asymptotically correct results, whereas FERE will result in potentially biased estimates of the intercept and σ b 2 . As a consequence of the asymptotic equivalence of the empirical null approach and the FERE approach when the model (2) is exactly true, the PIUR of the empirical null approach can be computed by referring the empirical reflagging rate to G R ( ) FERE . The dependence of the empirical null on sample size can be handled by stratifying the facilities into relatively homogeneous strata as in Kalbfleisch and Wolfe (2013) and He et al. (2013). In addition, we have been developing smoothed estimates of the mean and variance of the Z-scores as a function of sample size so that each provider has an individualized empirical null distribution.
One major advantage of the empirical null approach over FERE or RE is that it generalizes relatively easily to other nonlinear examples where the FE-based Z-scores are approximately normal for relatively large n i . Thus, this approach can be used, for example, in situations where the response is binary as in He et al. (2013) and Estes et al. (2018) or a failure time as in Kalbfleisch and Wolfe (2013) and in our example in Section 5.

| NUMERICAL EVALUATION
In this section, we examine the properties of the proposed PIUR through numerical evaluation. We consider the FERE-based Z-scores for an one-sided test with a significance level P = .025. We consider the linear model (2) with . We vary the magnitude of the between-provider variance, σ b 2 , such that ordinary IUR takes values 0.00, 0.25, and 0.50. We assume that the provider effects are either from the null normal distribution N σ (0, ) b 2 with probability π 0 , or from a distribution of outliers with probability π π = 1 − 1 0 . We vary the value of π 1 from 0.00, 0.01, 0.02, and 0.05. The magnitude for these outlier provider effects are fixed taking values γ times σ T , where γ = 2, 3, or 4 and σ = 1 T . Table 1 shows the theoretical values of the PIUR for various values of IUR, where the theoretical values are calculated based on the assumed distribution. For example, for a given value of IUR, the corresponding conditional probabilities for the FERE-based Z-scores can be computed as The results shown in Table 1 suggest that, even when the IUR is small, relatively high PIUR can occur for settings including contamination with outliers. For example, even when the IUR = 0.00, if the proportions of outliers are set at 5% with the magnitude for these outlier provider effects taking values 2, 3, or 4 times σ T , the corresponding FERE-based PIURs are 0.56, 0.81, and 0.93, respectively. We next consider a linear model framework with 1000 providers and n = 100 i patients per provider. The continuous outcome, Y ij , is generated from the linear model (2) with σ = 1 w 2 . The magnitude for these outlier provider effects are fixed taking values γ times σ T , where γ = 4. The remaining set ups are the same as those in Table 1. Table 2 shows that the empirical values of the profile IUR are close to the corresponding theoretical values.
To assess the proposed methods in settings with various sample sizes across providers, we consider a linear model framework with 1000 providers and n i generated from a normal distribution with mean 100 and standard deviation 50 and then rounded to the nearest integer values. To avoid extremely small provider sizes, we set n i as 10 if the rounded integer value is less than 10. Tables 3 and 4 shows that the empirical values of the profile IUR are close to the corresponding theoretical values.

| APPLICATION
In 2016, more than 120 000 patients were diagnosed with end-stage renal disease (ESRD) (Saran et al., 2018), with kidney dialysis as the most common treatment option. In order to monitor the performance of dialysis facilities, several risk-adjusted quality measures have been implemented by the Centers for Medicare and Medicaid Services (CMS) on the Dialysis Facility Compare (DFC) site and in the ESRD Quality Incentive Program, which is a CMS value-based purchasing program. In this section, we apply the PIUR to two quality measures that are reported on the DFC site.

| Standardized mortality ratio
The standardized mortality ratio (SMR) is a risk-adjusted measure that is used to evaluate whether facility-specific mortality rates are in line with the national average across all Medicare certified US dialysis facilities. The SMR for facility i is defined as where O i is the observed number of deaths in facility i, and E i is the corresponding expected number of deaths for patients in this facility computed under a population norm. An SMR less (greater) than 1 indicates that the facility's observed death rate is less (greater) than expected based on overall national rates with adjustment for the measured characteristics of patients in this facility.
For practical implementation, the SMR is computed from a two-stage model: in the first stage, a Cox model stratified by facilities is used to estimate regression parameters associated with patient characteristics. This model assumes that the hazard function is , where λ i 0 is the facility-specific baseline hazard. This stratified approach avoids any problems that might arise when patient characteristics are correlated with facility effects. In the second stage, the regression parameters are used as an offset in an unstratified Cox model to estimate the baseline failure rate at the "average" facility. The expected number of events for the jth patient in the ith facility, denoted by E ij , is calculated as , where τ is the maximal follow-up time,  β X ij T is treated as an offset with  β estimated from stage 1, t Λ ( ) 0 is the "population-average" cumulative baseline hazard, and R t ( ) ij is the at-risk process. The expected number for facility i is then computed as . The corresponding P value can be computed using a Poisson approximation under which the O i , under the null hypothesis that the facility's death rate is the same as the population average, follows a Poisson distribution with mean E i .
We use SMR data collected from the 4-year period (2013-2016). Methods of profiling are based on the empirical null approach. A total of 5965 facilities are included in the analysis, after excluding facilities with fewer than three expected deaths. The number of observed deaths per facility ranges from 0 to 418, and the number of expected deaths ranged from 3 to 309. Figure 2A shows a histogram of the SMR.
The IURs for SMR were computed based on the approach proposed by He et al. (2019). Based on 1-year data, the IURs had a range of 0.22 to 0.25, which indicates that about one-fourth of the variation in the SMR can be attributed to the between-facility differences and about three-fourths to within-facility variation. Based on 4 years of data (2013 to 2016), the IUR for SMR increased to 0.53, which indicates that about half of the variation in the measure can be attributed to the between-facility differences and about half to withinfacility variation. In comparison, with a P value of .025 and using the empirical null approach, the estimated conditional probability of being flagged again is  Table 1, when the IUR = 0.50, if the proportions of outliers are set at 2% with the magnitude for these outlier provider effects taking values two times σ T , the corresponding PIUR is also 0.62. Thus, when there are outlier facilities, the PIUR tends to be larger than the IUR.

| Standardized readmission ratio
An unplanned hospital readmission is defined as any unplanned hospital admission that occurs within 30 days of discharge from a previous admission. Readmissions are expensive. High readmission rates are indicators of poor care, leading to patient morbidity and poor quality of life, and can be prevented through effective post-discharge early intervention and care coordination Chan et al. (2009). The standardized readmission ratio (SRR) is a measure of dialysis facility-level hospital readmission among ESRD dialysis patients. It is computed as i is the number of observed readmissions, and is the expected number in facility i. Here Y ij is the observed outcome for the jth discharge in facility i, and E ij denotes the corresponding model-based expected outcome, accounting for patient-level characteristics and assuming the facility-specific event rate equals the population rate. Specifically, readmission rates are modeled using a logistic model: ij , the parameters α i correspond to the fixed facility effect and β is a vector of regression parameters. The expected number is computed as where  β is the estimate of β and  α M denotes the median of all estimated facility effects. Similar to the interpretation of SMR, an SRR less (greater) than 1 indicates that the facility's observed readmission rate is less (more) than expected based on national rates. Figure 2B shows the histogram of SRRs for 5740 dialysis facilities in 2016. The IUR had a value of 0.49, which indicates that about half of the variation in the SRR can be attributed to the between-facility differences and about half to within-facility variation. In contrast, with P value of 0.025 and using the empirical null approach, the profile IUR is 0.74, which indicates the existence of outlier facilities. For example, in Table 1, when the IUR = 0.50, if the proportions of outliers are set at 5% with the magnitude for these outlier provider effects taking values two times σ T , the corresponding PIUR is 0.70. Thus, the difference between the PIUR and the IUR indicates the presence of outlier providers and, hence assesses more directly the ability of a quality measure for identifying outlier providers.
F I G U R E 2 Histograms of SMR and SRR. A, The SMR figure is based on 5965 dialysis facilities with expected deaths greater than or equal to 3. B, The SRR figure is based on 5740 facilities with numbers of index discharges greater than 10. SMR, standardized mortality ratio; SRR, standardized readmission ratio Note that although the IUR (0.53) for the SMR based on the 4 years of data is larger than the IUR (0.49) for the SRR based on 1 year of data, the difference between the PIUR and the IUR for SRR is larger than that for SMR. This indicates a larger proportion and/or higher magnitudes of outliers providers for SRR, which is not captured in the IUR itself.

| DISCUSSION
The IUR is a metric that specifies the proportion of variation in the quality measure that is due to the between-provider variation. If all between-provider variation is due to quality of care and all patients and providers follow the assumed linear model, the IUR may be a reasonable signal to noise metric for a quality measure. However, in settings where the main source of between-provider variation is due to incomplete risk adjustment or the main focus of the provider profiling is to identify outliers, the role of the IUR is limited. In fact, the IUR is based on the whole distribution of provider effects, and can be a poor indicator to assess the ability of a measure to identify outliers (Staggs and Cramer, 2016;Staggs, 2017;. Given that, the IUR may not be an appropriate metric for assessing the performance of a quality measure. In other words, the value of the IUR may not determine the suitability of a measure for identifying outliers; even measures with a small IUR can be very effective for identifying extreme providers, while a large IUR can be a signal of incomplete risk adjustment.
In this paper we assume that one main purpose of reporting quality measures is to reliably identify very good and very poor providers. To address the problems associated with the IUR, we propose an additional measure of reliability. The proposed PIUR is not designed to address the problem of "no unobserved confounders." Instead, it is designed to assess the ability of quality measures to consistently identify outliers. Whether there are unobserved confounders or not, the values of the PIUR, compared with the IUR, are influenced by the proportion of outliers and their magnitude.
As proposed, the scale of the IUR is used to quantify the proposed PIUR, while the PIUR takes into account the providers with extreme outcomes. We have shown that the PIUR can be effective at assessing whether a quality measure reliably profiles outlier providers, even under low values of the IUR. If there are no outliers, one should expect the PIUR to be the same as the IUR. In cases where there are outlier providers, even measures with an IUR close to 0 can have relatively high PIUR and can be very useful for identifying extreme providers. Therefore, when the emphasis is on identifying "extreme" providers, the difference between the proposed PIUR and the IUR indicates the usefulness of the measure for profiling. In contrast, the size of the IUR or the PIUR value alone may not be a reasonable indicator of the use of a quality measure for identifying very good or poor providers. Therefore, we do not recommend a threshold for IUR or PIUR.
The values of the PIUR depend on the profiling method for flagging extreme providers. Both the commonly used FE and RE approaches assume that the provider effects are the consequence of variation in the quality of treatment and are under the full control of the providers. As a result, they will tend to identify as worse than expected, large providers, even when their true effect is not extreme. In contrast, the FERE approach is based on the assumption that most of the between-provider variation is due to unobserved characteristics that are outside the control of the provider. Both of these extreme assumptions are typically invalid, and the provider effects correspond to a combination of quality of care and incomplete risk adjustment. Unfortunately, the source of the provider effect cannot be identified on the basis of the data alone and can only be estimated based on expert opinion. The article,  gives some discussion of how the various methods depend on the source of the variation. The empirical null approach has the advantage of leading to robust estimates of the betweenprovider variance by limiting the influence of extreme observations. Additional discussion on this point can be found in  and a working manuscript by Xia et al. (2019).
The covariates in our motivating settings are patientlevel, not provider level. However, a covariate may have both a between-provider component, which we might summarize in terms of X i , the provider-specific mean for provider i, and a within-provider component X X − ij i . This leads to differing within-provider and betweenprovider covariate effects as is discussed in Neuhaus and Kalbfleisch (1998). When provider effects are correlated with covariates, it should also be noted that use of a random effects model can yield a biased estimate of β (Pan, 2002;Kalbfleisch and Wolfe, 2013). The β that we are adjusting for is the within-provider effect; one way to estimate that is to use a fixed effects model.
As illustrated in Section 5, the proposed method can also be applied to complex quality measures used in nonlinear models. Specifically, He et al. (2019) have developed methods to extend the IUR to such models. The empirical null approach described in Section 3.3 has been generalized to binary logistic model (He et al., 2013) and the Cox proportional hazards model (Kalbfleisch and HE ET AL.

| 9
Wolfe, 2013). Thus, the empirical null-based PIUR can be easily computed to such nonlinear examples.