Prospective payment systems and discretionary coding—Evidence from English mental health providers

Abstract Reimbursement of English mental health hospitals is moving away from block contracts and towards activity and outcome‐based payments. Under the new model, patients are categorised into 20 groups with similar levels of need, called clusters, to which prices may be assigned prospectively. Clinicians, who make clustering decisions, have substantial discretion and can, in principle, directly influence the level of reimbursement the hospital receives. This may create incentives for upcoding. Clinicians are supported in their allocation decision by a clinical clustering algorithm, the Mental Health Clustering Tool, which provides an external reference against which clustering behaviour can be benchmarked. The aims of this study are to investigate the degree of mismatch between predicted and actual clustering and to test whether there are systematic differences amongst providers in their clustering behaviour. We use administrative data for all mental health patients in England who were clustered for the first time during the financial year 2014/15 and estimate multinomial multilevel models of over, under, or matching clustering. Results suggest that hospitals vary systematically in their probability of mismatch but this variation is not consistently associated with observed hospital characteristics.

: (a) a capitation payment model, which is a per-person risk adjusted sum to cover a range of care for the population across a number of different settings or (b) an episodic payment model, which rewards providers according to the number and type of patients they treat, and sometimes the quality of care they provide, similar to the prospective payment system (PPS) used to fund acute hospital care in England and other countries (Khan, Nowak, & NHS England, 2014;O'Reilly et al., 2012;Sood, Buntin, & Escarce, 2008). In both payment approaches, prices for mental health care (either per patient or per treatment) are set locally.
In this article, we focus on the episodic payment system. In this system, patients are categorised into one of 20 clusters according to need, and these clusters are grouped into one of three superclasses (nonpsychotic, psychotic, organic) depending on the prevalent profile and MH disorder of the patient (see Table 1). 1 Under the episodic payment approach, each cluster will attract a fixed daily price, which is different for inpatient (admitted) and outpatient (nonadmitted) care. Table 1 provides the average cost for an episode of care by cluster across all hospitals. Clusters also define the relevant period of care, and the system requires patients to be reviewed and assigned to clusters according to those periods. The clusters are mutually exclusive meaning that a patient should only be assigned to one cluster at any given time.
Patients are assigned to a cluster by a clinician or clinical team, who can be assisted in their assignment process (known as "clustering") by an algorithm, called the Mental Health Clustering Tool (MHCT). The paper-based MHCT, which has been recommended for use since 2013 NHS England, 2013a, 2013b), consists of 18 items and combines information from the 13 items of the Health of the Nation Outcomes Scales (HoNOS; Wing, Curtis, & Beevor, 1994), a routine outcome measure used in mental health services, and the five items of the Summary Assessment of Risk and Need (SARN) instrument (Self, Painter, & Davis, 2008;Self, Rigby, Leggett, & Paxton, 2008), which assesses need and risk on both a current and historical basis (see Online Appendix Table A.I). A computerised version of the MHCT algorithm has been developed to support clinicians and provides a probability of a patient being assigned to a particular cluster. The MHCT has been designed "to ensure consistency of clustering and to improve the overall accuracy of cluster allocation" (McKenna, 2012). 2 A clinician is, however, able to override the algorithm allocation and the ultimate classification is based on clinical judgement.
The reimbursement of mental health care will be based on the patients' categorisation into clusters. In particular, the proposed episodic payment approach links a provider's payment to the volume and type of mental health care activity, independently of how much treatment any individual patient receives or how that treatment is delivered and is thus a form of PPS. The potential advantages and risks of PPS have been discussed extensively in the literature (Charlesworth, Davies, & Dixon, 2012;Jacobs, 2014). One key risk is the potential of upcoding in which providers assign patients to categories that maximise payment but do not appropriately reflect patients' needs (Dafny, 2005;O'Reilly et al., 2012). In mental health services, upcoding is possible because clustering is performed by members of the clinical team rather than by clinical coders. With only 20 clusters, clinical teams may, to varying degrees, be aware of the relative monetary value attached to each. The use of the computerised version of the MHCT is not mandatory, 3 and its suggested cluster allocation can be manually overridden.
Although the allocation of patients to clusters other than that recommended by the MHCT algorithm could represent an appropriate clinical decision, it could also, intentionally or unintentionally, benefit hospitals financially (Jacobs, 2014). Some random variation in clustering is expected because care needs varying across patients and not all risk factors are observable or recorded. This has little effect on providers' reimbursement because the expected payment for a given (latent) patient type is unaffected. Conversely, any systematic coding differences across providers of care would raise concerns because it potentially results in an inappropriate allocation of financial resources. Systematic differences may arise because of differences in unmeasured case mix across providers 4 or because providers engage in discretionary coding to their advantage. Either mechanism calls into question the appropriateness of reimbursing MH providers based on clusters.
This work is the first to assess whether providers differ systematically in their coding behaviour and whether this is associated with their observable characteristics, such as the average cost of care. In doing so, we provide the first comprehensive assessment of the coding behaviour of all NHS MH providers in England by exploiting a large, national patient-level data set. We test whether the clustering process is subject to upcoding by MH providers, defined as positive discrepancies between patients' assignment to cluster by clinicians and the cluster allocation suggested by the MHCT algorithm. Importantly, having an external standard, the MHCT, against which observed coding can be compared, is a unique feature of our study and sets us apart from the existing literature on discretionary coding in PPS.

| RELATED LITERATURE
Concern that a payment mechanism that relies on the classification of patients by clinicians might be subject to distortion or manipulation first arose with the adoption of PPS by the U.S. Medicare in 1983, although the potential for distortion had been previously recognised (Simborg, 1981). Under Medicare PPS, patients are reimbursed according to the diagnosis-related group (DRG) that they are allocated to and hospitals are perceived to have discretion over the allocation (Ellis & McGuire, 1986).
The nature and manifestation of that discretion has been subject to considerable debate and research. At one extreme, the falsification of records or deliberate distortion of evidence constitutes fraud (Jesilow, 2005), and such a possibility has given rise to an active debate on how hospital payment systems might need to be policed and audited (Kuhn & Siciliani, 2008). Less extreme is the possibility that treatment decisions and care pathways might be influenced by the desire to allocate a patient to a better paid (or better resourced) DRG (Rosenberg & Browne, 2011).
The empirical investigation of these phenomena was driven by the observation of increasing costs arising from more complex and costly bundles of DRGs being observed overtime: A phenomenon referred to as DRG creep (Simborg, 1981). If DRG creep is not a consequence of patients getting sicker, or of more sophisticated but appropriate treatments being used, it is conjectured likely to be a manifestation of hospitals upcoding-deliberately increasing the complexity of the procedures that they undertake-and there is now credible evidence that this exists in practice, both for the U.S. and for other health care systems that have adopted DRG mechanisms (Silverman & Skinner, 2004;Steinbusch, Oostenbrink, Zuurbier, & Schaepkens, 2007). Besides indicating that the risks of upcoding and other forms of manipulation 5 are real, the literature points to some potentially important determinants. First, because manipulation may be motivated by financial returns, one hypothesis is that for-profit health care providers may be more inclined to upcode. Both theoretical and empirical support for this is, however, mixed. In regard to theory, not-for-profit providers may still desire to produce a financial surplus in order to further their own goals. In practice, the relationship between managers and clinicians, rather than the overarching goals of the providing organisation, appears to be a more important driver for upcoding (Silverman & Skinner, 2004). Second, the design of any DRG system would seem to be important in limiting or facilitating manipulation. Systems that rely on objective, medically meaningful criteria are inherently more resistant to manipulation, whereas the more complex a system becomes and the greater the proliferation of DRGs, the greater the risk (Steinbusch et al., 2007).
The phenomena of DRG creep and upcoding have predominantly been considered in relation to acute hospital physical health services, following the broad adoption of the PPS systems for hospital services, which started in the United States and spread extensively to Europe (Ellis & McGuire, 1986;Steinbusch et al., 2007). Translating the insights into the mental health care context considered in this paper poses challenges. Relative to most acute care payment systems, which have hundreds to thousands of DRGs, 6 the MH clustering system, we consider is simple and limited. However, the criteria upon which clustering is undertaken, seem a priori to be subject to clinician discretion, which in turn may be an inherent characteristic of care for mental illness (Goldman & Grob, 2006;Bellows & Halpin, 2008). Hence, our analysis of the extent of provider discretion within this emerging system is of importance to health policy makers in framing the development of mental health care payment systems.

| DATA
The analysis uses administrative data from the Mental Health Services Data Set, which covers 53 English NHS MH hospital trusts (the providers). For each patient, we obtain the observed cluster allocation as well as a rich set of individuallevel variables, including patients' gender, age (coded in age bands), marital status (single, married, separated, divorced, undisclosed), ethnicity (White, Black, Asian, other), and approximate level of deprivation at small area level (in quintiles). Patients' residence is reported at small area level (the Lower Layer Super Output Area). Each small area includes approximately 1,500 inhabitants and is designed to be homogeneous with respect to tenure and accommodation type. We use Lower Layer Super Output Areas defined according to 2001 Census boundaries by the English Office for National Statistics. 7 We link this geographic identifier to the 2010 Index of Multiple Deprivation (IMD) to approximate deprivation levels at small-area level (McLennan, Barnes, Noble, & Dibben, 2011;Noble, Wright, Smith, & Dibben, 2006). Information on patient severity is provided by the ratings on the HoNOS and SARN instruments.
We restrict our analysis to patients who had not been clustered between April 1, 2011, and March 31, 2014, and who received treatment between April 1, 2014, and March 31, 2015. These dates are determined by the financial years that are used in recording data. Patients who have been clustered during a previous care episode may be at risk of having their cluster allocation carried forward without detailed review; that is, any subsequent clustering may not be independent of previous decisions. We therefore exclude all patients who have received treatment as recorded in Mental Health Services Data Set data in the previous 3 years (266,100 patients), and any subsequent clustering after the first episode between April 1, 2014, andMarch 31, 2015 (18,526 patients). 8 To ascertain the persistence of clustering assignment for the same patients over time, we estimate the polychoric correlations (Kolenikov & Angeles, 2004;Olsson, 1979) between current and past clusters, separately by superclass.
For each patient, we observe the actual cluster allocation given for their first episode and further calculate the most likely cluster using the computerised MHCT algorithm (http://www.cppconsortium.nhs.uk/algorithm/). The algorithm requires the user to choose a superclass (nonpsychotic [clusters 1-8], psychotic [10][11][12][13][14][15][16][17] or organic [18-21]) 9 and then calculates probabilities associated with each cluster in this superclass based on the ratings on the HoNOS and SARN instruments. Clusters with higher probabilities are more likely to be those intended to be used in accordance with the episodic payment coding guidelines NHS England, 2013a, 2013b). The best fit cluster is defined as the cluster with the highest probability (measured in percentage points) according to the MHCT algorithm. We examine several hospital-level characteristics that we expect to determine hospitals' clustering behaviour. First, for a given level of reimbursement, providers with higher cost structures face a stronger incentive to engage in discretionary coding that could inflate payment. We use detailed costing data provided by all public hospital providers in England 10 to compute the average cost per episode by hospital h ϵ {1, …, H} and superclass jϵ{1, 2, 3} for the year 2013/14. These data provide information on the daily costs for an admitted and nonadmitted patient day, as well as the total number of days per cluster. 11 The average cost per episode in hospital h and superclass j in year t is then calculated as follows: where E rht is the number of patients' episodes in cluster r treated in hospital h, C A rht and C NA rht are, respectively, the daily costs for admitted and nonadmitted days in cluster r = 1, …, R j and hospital h, and D A rht and D NA rht are, respectively, the average number of admitted and nonadmitted days in cluster r and hospital h. To account for the possible non-linear effects of costs on clustering behaviour, we split the average cost per episode variable into terciles of its distribution.
Second, many providers will contract with a number of purchasers (Clinical Commissioning Groups [CCGs]), each of which can negotiate their own prices for clusters. We hypothesise that providers with more concentrated contracting relationships find it easier to tailor their coding behaviour to maximise payments, an argument similar to that made by Fernandez, McGuire, and Pradou (2017). But it may also be that providers with more, and smaller contracts may engage more in discretionary coding as they believe monitoring to be less intense. We approximate concentration of contractual arrangements by the percentage of a provider's patients that are covered by the main CCG, that is, the one representing the most of patients in the previous financial year (t-1 = 2013/14).
Third, coding behaviour might be related to experience, 12 which we capture using both the number of patients treated by the provider in year t-1 and a measure of staff engagement. In the absence of direct measures of staff engagement, we have used as a proxy, information collected through the 2013/14 NHS Staff Survey on staff training, learning, and development. 13 Specifically, we computed the staff engagement proxy as the proportion of respondents in each MH hospital who agreed/strongly agreed with the following questions: "My training, learning and development has helped me to.... a)...do my job more effectively; b)...stay up-to-date with professional requirements; c)...deliver a better patient / service user experience." We believe this provides a reasonable proxy for the dimension of staff engagement that is correlated with clinical coding and good operational practice.
Finally, patients living in more deprived neighbourhoods are expected to have higher levels of need, thus requiring more resources (Epstein, Stern, & Weissman, 1990). This may translate into a higher probability of deviation from the benchmark MHCT clusters. For this reason, we include the percentage of the provider's patient population belonging to the most deprived quintile of the 2010 IMD score distribution as an additional measure of need.
All hospital-level variables other than average deprivation are measured in the year prior to our analysis period (i.e., respectively in t-1 = 2013/14, whereas t = 2014/15) to mitigate the risk of endogeneity bias due to reverse causality.

| METHODS
To establish the extent to which the observed clustering of patients by providers deviates from the best fit cluster recommended by the MHCT, and whether this is associated with observed or unobserved hospital characteristics, we estimate two types of multilevel regression models (Rice & Jones, 1997, Snijders & Bosker, 2012.
First, we perform mixed-effects logistic regression with patients i = 1, …, n h treated by providers h = 1, …, H and: where Y ih equals 1 if the MHCT best fit cluster and observed cluster differ and 0 otherwise, X ih is a vector of observed patient characteristics including gender, age, marital status, ethnicity, and local area deprivation, P ih is the probability of the best fit cluster assigned by the MHCT, Z h is observed provider characteristics such as volume of activity and production cost in 2013/14, and μ h is a normally distributed random provider effect with 0 mean and variance σ 2 . The provider effect μ h captures systematic variation across providers, conditional on observed patient and provider characteristics. 14 Second, and as an extension of the above, we utilise the fact that clusters are ordered according to the level of care required (and therefore resource use and likely reimbursement) to further differentiate mismatches into over and under clustering. Over clustering arises when the observed cluster number is higher than the cluster number suggested by the MHCT, and vice versa. We estimate multilevel multinomial logit models (Hedeker, 2003) of the form: where Y ih equals 1 in the case of under clustering, 2 in the case of over clustering, and 3 if MHCT and the observed clustering agree, which forms the base category.
To quantify the unobserved provider heterogeneity, we follow Larsen and Merlo (2005) and compute the median odds ratios (MORs) as follows: The MOR expresses the ratio of the probability of mismatch across two randomly chosen providers, with the higher probability forming the numerator. This can be compared with the odds ratios of other explanatory variables and thus helps to put the relative importance of unobserved heterogeneity into context.
We run separate analyses for each superclass. To check the robustness of our findings, we conducted three further analyses. First, we checked the impact of reordering the clusters on the basis of average episode cost (Section 5.3); second, we tested for a systematic provider effect on assignment (Section 5.4); lastly, we tested the impact of including patients who had previously been clustered (Section 5.5).
All models are estimated using Markov Chain Monte Carlo techniques (Browne, 2012) with maximum likelihood estimates as starting values obtained via iterative generalised least squares (Goldstein, 1986). To achieve stationarity, we run the Markov Chain Monte Carlo chain for 55,000 iterations and discard the first 5,000 iterations as burn-in period 14 We choose a random effects approach over a fixed effects approach to avoid incidental parameter bias in non-linear models (Lancaster, 2000) and because we wish to explore the impact of observed hospital level characteristics on coding behaviour. (Draper, 2011;Geyer, 2011). To reduce autocorrelation and heteroscedasticity, we utilise the estimates of every 50th replication to compute point estimates and 95% credible intervals (CrIs). All computations are performed in MLwin 3.00 operated through the runMLwin 64-bit routine (Leckie & Charlton, 2013) in Stata 13. 15 5 | RESULTS

| Descriptive statistics
Our analysis sample consists of 148,472 patients ( Table 2). The distribution of patients across clusters is highly concentrated, with at least 30% of patients in each superclass being categorised in a single cluster. Table 3 presents descriptive statistics by superclass. Each hospital treated on average over 32,000 patients in the year prior to our analysis period, whereas the average percentage of patients in the CCG with the largest commissioning agreement with each provider is around 40% of their case-load (range 14.4-97.6%). Of the 53 providers, only six had 75% or more of their total activity commissioned by a single CCG. The average cost per episode was highest for psychotic patients (around £6,060) and lower for nonpsychotic and organic patients (around £2,530). On average, the percentage of patients residing in the most deprived quintile of the IMD 2010 distribution was over 29%.
We estimate the polychoric correlations between current and past cluster assignments for each episode with multiple clusters, using the full sample of MH patients treated in year 2014/15. 16 The correlations (standard errors in parenthesis) are 0.5826 (0.001615), 0.5936 (0.001599), and 0.5956 (0.001596) for clusters in the nonpsychotic, psychotic, and organic superclasses, respectively, which establishes that cluster assignment is persistent within each superclass. Table 4 compares the Best Fit cluster allocation (rows) suggested by the MHCT algorithm with the observed cluster allocation (columns). The diagonal indicates the proportion of patients where suggested and observed allocations coincide. The average agreement across the 20 clusters is 35.9%, with only four cells showing agreement in excess of 50%. The weighted kappa statistic, a measure of agreement that penalises according to the degree of mismatch (Cohen, 1968), is equal to 0.3759 (0.0020) in the nonpsychotic superclass subsample, and respectively 0.2022 (0.0042) and 0.4675 (0.0034) for the psychotic and organic superclass subsamples, suggesting slight to moderate agreement (Landis & Koch, 1977). 17 The match between observed and suggested cluster allocation varies across the superclasses. The highest average match is observed for the organic superclass with 49.6% (range 30.3-74.9%), possibly because there are fewer groups, whereas the lowest average match is observed for the psychotic superclass (29.5%; range 13.8-48.3%). In 15 In the Online Appendix, we also show results of mixed-effect multinomial regressions in which we model the probability of a patient being assigned to a given MH cluster within a certain MH superclass. 16 These correlations are estimated in Stata by using the polychoric user-written function (Kolenikov & Angeles, 2004), that provides a more accurate estimate than the Stata built-in function (Uebersax, 2015). 17 The weights are defined in the standard way as 1 − |r − d|/(m − 1), where r and d index the rows and columns of the clusters by the different assignment mechanisms (MHCT algorithm or clinician) within each superclass, and m is the maximum number of possible clusters within each superclass. those instances where observed and suggested cluster allocation are in disagreement, the observed cluster is usually adjacent to that suggested, though with no clear direction, and the probability of observed assignment decreases with the distance between observed and suggested cluster. A noteworthy exception is the psychotic superclass in which patients are likely to be assigned to cluster 10 ("First episode in psychosis") independent of the severity of the suggested cluster, that is, the distance between classes. 18 Table 5 shows the main regression results. Columns 1-3 report results for Equation (2), which models the probability of mismatch between the clinician and the MHCT algorithm. 19 Columns 4-9 report results for Equation (3), which further 18 Cluster 10 is a specific clinical presentation, which is usually treated in early intervention in psychosis teams and not all patients who develop psychosis will necessarily start in cluster 10. 19 In Online Appendix Table B.I., we show the results of the same binary logit mixed effect regression model, when only hospital random effects are included, and then additional covariates are sequentially added.  distinguishes between over clustering (patients are assigned to a higher cluster than suggested by the MHCT) and under clustering. The reference category in each analysis is agreement between clinician and MHCT algorithm. Focussing on the first analysis (first three columns of Table 5), in each superclass, the probability of a mismatch is negatively correlated with the probability of the Best Fit cluster suggested by the MHCT (significant at p < 0.01). This suggests that clinicians and the algorithm respond to similar signals of severity so that the degree of discretionary coding reduces as the uncertainty about cluster allocation reduces. However, the relationship is not perfect: the average marginal effect 20 of a percentage point increase in the probability of Best Fit cluster (that is measured in percentage points) is associated with a 0.64% decrease (SE: 0.0002) in the probability of mismatch in the nonpsychotic subsample, and with a decrease in the probability of mismatch of 1.13% (SE: 0.0014) for psychotic patients and a decrease of 0.43% (SE: 0.0003) for organic patients, respectively.

| Determinants of mismatch between observed and suggested cluster allocation
In the nonpsychotic superclass, higher cumulative HoNOS scores are associated with lower probability of a mismatch, and this holds true for over and under clustering alike. Conversely, for patients in the psychotic and organic superclasses, the models estimate a positive association between cumulative HoNOS score and mismatch. For the first patient group, the association is driven by an increased probability of under clustering but not over clustering. Higher SARN scores are associated with a higher probability of mismatch for patients in the nonpsychotic and organic superclasses.
Only a few provider characteristics are statistically significantly associated with the probability of mismatch. Providers with higher volumes of activity in the past year are less likely to diverge from the MHCT suggestion, although the effect is only statistically significant for the organic superclass. The proportion of a provider's patients residing in the most deprived areas of the country is associated with a lower probability of mismatch for the nonpsychotic superclass, and a lower probability of over clustering (but not under clustering) for the organic superclass. Average cost and contractual homogeneity are not associated with divergent coding behaviour for any superclass at the 5% level.
The estimated MORs reveal the existence of substantive unexplained between-hospital variability in coding behaviour. Using results from Equation (2); columns 1-3 in Table 5), in two randomly selected hospitals, the probability of a given patient being clustered differently by the clinician and the MHCT algorithm is approximately 45-64% higher in one hospital than the other. These effects are large in comparison to the effects of observable patient and provider characteristics. For example, in the organic superclass (column 3) increasing a hospital's activity by 10,000 patients or increasing a patient's HoNOS score by 10 points leads to an increased risk of mismatch of 14% and 8.5%, respectively. Using results from Equation (3); columns 4-9 in Table 5), we find that provider heterogeneity is more pronounced in the probability of under clustering than over clustering but that this difference is not statistically significantly different from zero as indicated by overlapping 95% CrIs.

| Robustness check: Mismatch when clusters are ordered by average cost per episode
One potential drawback in our empirical analysis is the assumption that clusters are ordered according to the level of care required, and therefore also by their expected reimbursement. 21 However, the order based on reference unit costs of the clusters is not the same as the clusters' nominal order. To test whether our results on the quantification of the coding discretion are robust to the cluster ordering, we reorder the clusters within each superclass based on the average cost per episode across all hospitals in year 2013/14, (see Equation (1), Section 3), and we reestimate the models in Table 5 using the new cluster order based on such average costs. 22 The last two columns of Table 1 report the average cost per episode of each cluster and the clusters' order based on such average costs. The original ordering is almost unchanged for nonpsychotic clusters; it is exactly the same in the organic superclass, and shows several changes in the clusters of the psychotic superclass. The results of the new estimation are presented in Table 6 and show that although the significance level of some of the regression odd ratios coefficients (HoNOS and SARN scores) for mismatching in the psychotic superclass change (but not their magnitude; compare columns 6 and 7 of Tables 5 and   20 The average marginal effect is calculated as the change in the probability of mismatch for a 1-percentage point increase in the independent variable of interest holding all other covariates at their observed level; averaged across all observations in the sample (Cameron & Trivedi, 2010). 21 The binary mixed effect logit regression model shown in Table 5,   6), both the point estimates of the MORs and their 95% CrIs remain largely unchanged. The other two superclasses are not affected by the change. Overall, these findings are reassuring about the robustness of our results on discretionary hospital coding with respect to the ordering of the clusters.

| Robustness check: Provider effect on patients' assignment to individual clusters
We also investigate the presence of a systematic provider effect on patients' assignment to individual clusters to test whether providers disagree in their assignment to specific clusters (see Online Appendix C). Provider heterogeneity is somewhat more pronounced in this analysis than when assessing mismatch, as evidenced by the larger variation in MORs from 1.46 to 3.59 across clusters in different superclasses. However, MORs are broadly similar across clusters in the same superclass, suggesting that the allocation of patients to some clusters rather than others suggesting these is less heterogeneity across providers, once the patient's prevalent MH disorder, identified by the assigned superclass, has been determined.

| Robustness check: Determinants of mismatch using the full sample
In Online Appendix Tables D.I and D.II, we present the estimation results of the regression models investigating the mismatch in the patient assignment, without imposing the restriction of patients not having been previously clustered. We use a 50% clustered random sample (with clustering by MH hospitals and MH clusters) of the original sample of patients treated in year 2014/15. 23 The results for the MORs are either very similar to the ones provided in Tables 5 and 6 (with clusters ordered by average costs), or the 95% CrI of the two sets of estimates (with and without the "newly clustered patient" restriction) overlap at least partially, suggesting once again the robustness of our findings.

| DISCUSSION AND CONCLUSIONS
The English NHS is moving to a new reimbursement model for mental health care that links payment to activity, thus aligning the payment system in MH to those common in many physical health care systems. Although this change may help create a fair and sustainable funding system, there are also well-known risks of unintended consequences such as incentives to inappropriately allocate patients higher payment groups. We have examined the extent to which the categorisation of NHS patients by MH providers is subject to discretion. For this purpose, we investigated differences between patients' first cluster assignments by clinicians and those assignments suggested by an external standard, the MHCT algorithm. We find MORs ranging from 1.46 to 1.88, which reflects significant unexplained variation between providers in how they allocate patients to clusters over and above observed need factors. Unobserved provider effects are at least as important as observed hospital characteristics such as volume or deprivation in determining cluster allocation. Variations between providers may be driven by differences in severity, treatment thresholds, or subjective perceptions in recording of HoNOS scores. Some of the predictors in our model may be suggestive of discretionary behaviour, for example, where they reflect resource pressures, although others may be more indicative of broader aspects impacting on service delivery, for example, levels of deprivation, though these may also indirectly affect decisions around levels of care intensity. However, the observed discretionary behaviour may not be associated with attracting higher payments due to financial considerations because average costs were not associated with upcoding. Furthermore, we do not find evidence that provider differences are more likely to result in upcoding than downcoding.
Our study has a number of limitations. First, it is possible that there may be legitimate unobservable differences between providers that determine their allocation behaviour that we have not been able to account for. In this case, the observed MORs capture discretionary behaviour as well as case-mix differences and external constraints, although it is a priori unclear whether this leads to inflated or deflated estimates of the MOR. Second, although the MHCT has been recommended as a guide for clustering, its use may differ across providers and this would also be captured by the MOR. Unfortunately, use of the MHCT is not recorded, so we cannot explore this further. Finally, throughout the study period providers were required to cluster patients but reimbursement was not linked to cluster allocation at this time. 23 The full sample is too large to be analysed in MLwin. Nevertheless, the 50% clustered random sample should provide unbiased, albeit slightly less precise results.
Hence, our analysis should be understood as exploring the potential for discretionary coding, rather than as evidence that providers respond to incentives by exploiting the flexibility granted by the classification system.
The considerable degree of discretion in the English MH clustering system has important implications for policymakers in the design and operation of the payment system. Clinical judgement may play a larger role in allocation within the MH context than in acute care where diagnostic information and procedures may be more clear-cut and, hence, auditable. Nevertheless, those responsible for the design of the MH payment system will need to find ways to put checks in place to ensure the integrity and fairness of the reimbursement model. The MHCT may offer a starting point, and providers could be required to justify deviations from the proposed cluster allocation if the level of mismatch breaches certain thresholds. This would require continued development and validation of the MHCT algorithm to ensure that it generates consistent groupings of patients with similar needs (Jacobs, 2014).