Business indicators of healthcare quality: Outlier detection in small samples

Authors

Errata

This article is corrected by:

  1. Errata: Correction of ‘Business indicators of healthcare quality: outlier detection in small samples’, ASMBI, v. 28, issue 3, May/June 2012, pages 282–295 Volume 30, Issue 3, 372, Article first published online: 20 May 2013

Gaj Vidmar, University Rehabilitation Institute, Republic of Slovenia, Linhartova 51, SI-1000 Ljubljana, Slovenia.

E-mail: gaj.vidmar@ir-rs.si

Abstract

Healthcare quality monitoring by the Ministry of Health in Slovenia includes over 100 business indicators of economy, efficiency and funding allocation, which are analysed annually for over 20 hospitals. Most of these indicators are random-denominator same-quantity ratios with a strongly correlated numerator and denominator, and the goal is the identification of outliers. A large simulation study was performed to study the performance of three types of methods: common outlier detection tests for small samples—Grubbs, Dean and Dixon, and Nalimov tests—applied unconditionally and conditionally upon results of Shapiro–Wilk normality test; the boxplot rule; and the double-square-root control chart, for which we introduced regression-through-origin-based control limits. Pert, Burr and three-parameter loglogistic distributions, which fitted the real data best, were used with no, one or two outliers in the simulated samples of sizes 5 to 30. Small (below 0.2, right skewed) and large (above 0.5, more symmetrical) ratios were simulated. Performance of the methods varied greatly across the conditions. Formal small-sample tests proved virtually useless if applied conditionally upon passed normality pre-test in the presence of outliers. Boxplot rule performed most variedly but was the only useful one for tiny samples. Our variant of the double-square-root control chart proved too conservative in tiny samples and too liberal for samples of size 20 or more without outliers but appeared the most useful to detect actual outliers in samples of the latter size. As a possibility for future improvement and research, we propose pre-testing of normality by using a class of robustified Jarque–Bera tests. Copyright © 2011 John Wiley & Sons, Ltd.

1 Introduction

Healthcare quality monitoring by the Ministry of Health in Slovenia includes numerous business indicators of economy, efficiency and funding allocation, which are annually analysed for all the hospitals in the country [1]. Examples of these indicators are the area of a hospital used for a certain service (e.g., dialysis or computed tomography) per total area of the hospital, and the expenses for a certain purpose (e.g., energy consumption or staff education) per total expenses of the hospital.

The essence of associated statistical analyses is the identification of outliers, where a compromise between state-of-the-art and wide understandability is desired. Hence, we adopted an exploratory approach that combines three types of methods:

  • three common outlier detection tests useful in small samples, namely the Grubbs test [2], the Dean and Dixon test [3] and the Nalimov test [4], which are all based upon assumption of normality and were hence tried unconditionally as well as conditionally upon results of normality tests;
  • the Tukey [5] boxplot rule, that is, identifying as outlier any value that is greater than 1.5 times the inter-quartile range larger or smaller than the third or the first quartile, respectively;
  • control charts.

The present study investigates these three approaches through extensive simulations. The paper first introduces a novel proposal regarding the application of appropriate control charts. The simulation set-up and the simulation results are then presented, followed by the empirical results on robust normality pre-testing and outlier detection. Finally a summary is given with a discussion of related work and directions for further research.

2 Control charts

2.1 General considerations

The indicators addressed in this study are same-quantity ratios (thus bound between 0 and 1), which are appropriately treated neither as proportions nor as fixed-denominator ratios. They are random-denominator ratios with strongly correlated numerator and denominator. Examples of such indicators are presented in Figure  1. For the two types of indicators, that is, small and large ratios (as detailed in Section 3), different scales are used in the histograms. The distributions of small ratios are right skewed, whereas the distributions of large ratios are roughly symmetric. It is evident that there is a very strong correlation between the numerator and the denominator, whereas the ratios tend to be independent of their denominators. To avoid unauthorised disclosure and hospital identification without loosing the information relevant for our study, the actual quantities defining the numerator and the denominator are masked.

Figure 1.

Examples of financial indicators of healthcare quality. In the left column, the distributions are shown as histograms; in the central column, the numerator is plotted against the denominator, and the correlation is listed for each indicator; in the right column, the value of the indicator is plotted against the denominator.

It is essential to note that funnel plots, which have rightfully been promoted for monitoring cross-sectional performance indicators in health care [6-9] and also in education [10], may not be the appropriate choice for such data. The reason is that virtually all points would get labelled as outliers in such plots because of huge denominators (thousands of square metres, millions of Euro) yielding excessively narrow CIs for the average proportion (even at 99% confidence level). Even if the whole problem is considered as one of over-dispersion [11, 12], which has been recognised in healthcare setting and for which different strategies have been suggested, and given that abandoning indicators is not an option, neither random-effects models [11] nor the Laney's approach [12] is universally feasible. Therefore, a different choice of the control chart and its control limits is tenable.

2.2 Proposed modification of the double-square-root chart

We opted for the double-square-root (Shewart) chart [13], in which the square root of the numerator is plotted against the square root of the difference between the denominator and the numerator. Like the funnel plot, this is also an increasingly popular method in statistical healthcare quality control [14, 15]. However, it was essential to replace the traditional control limits—based on the underlying assumption of binomial distribution, like in funnel plots, and therefore much too narrow for our data—by newly defined ones. The newly defined control limits were obtained using linear regression through the origin (the rationale being that, e.g., no costs can be incurred without income, and no space can be used for a given purpose without any space) and estimating control limits by using 95% CI for prediction.

Four examples of such charts are presented in Figure  2. They show indicators with no outliers (upper left), one outlier above the control limits (upper right), one outlier below the control limits (lower left), and two outliers (one above and one below the control limits; lower right).

Figure 2.

Examples of proposed double-square-root charts. The square root of the numerator is plotted against the square root of the difference between the denominator and the numerator, whereby the control limits are obtained by performing linear regression through the origin and estimating the 95% CI for prediction. Outliers are depicted as large filled circles.

3 Simulation set-up

We studied performance and agreement of the chosen methods through a large simulation study on realistic data. In accordance with the real data (Figure  1), two types of ratios were generated:

  • the small ones belonging to the [0,0.2] interval;
  • the large ones belonging to the [0.5,1] interval.

Samples of sizes 5, 10, 20, 25 and 30 were drawn from the three distributions that were found to best fit the empirical data. After automated fitting using EasyFit Professional 5.1 software (MathWave Technologies, Dnipropetrovsk, Ukraine), we chose the following distributions: Pert, three-parameter Burr (2) (referred to henceforth simply as Burr) and three-parameter loglogistic (3Ploglog) (3). To further improve resemblance to real data, we used the modified (four-parameter) Pert distribution (1) with the additional shape parameter (γ) [16] (referred to henceforth as 3Ploglog). Although Pert is a bounded distribution, Burr and 3Ploglog are only non-negative, but they are nonetheless useful practical models for same-quantity ratios in rejection-based simulations because their parameters can be chosen so that large values are extremely rare (i.e., their right tail can be made extremely thin).

display math(1)
display math(2)
display math(3)

Zero, one or two simulated outliers were included in the samples. The outliers were generated by increasing the relevant parameter (mode, scale parameter and mean for Pert, Burr and 3Ploglog, respectively) by 50%, 100%, 150% and 500% while holding other parameters (related to dispersion and shape) fixed. The simulation was performed with R [17] by using rejection sampling. The outliers, mc2d, lmom and actuar R packages were used.

First, data for the ratios were generated (drawn from the given distribution until all data were between 0 and 1) for the base sample and then for the outlier(s). The following parameters were used for the base population, whereby all the drawn values were divided by 1000:

  • Small ratios
    • Pert: min = 0, max = 30, mode = 10, γ = 5
    • Burr: α = 2, k = 3,β = 30
    • 3Ploglog:ξ = 0, μ = log(10),σ = 1.28
  • Large ratios
    • Pert: min = 400, max = 1000, mode = 700, γ = 5
    • Burr:α = 3, k = 10, β = 750
    • 3Ploglog:ξ = 0, μ = log(700), σ = 1.28

The outlier was always drawn from the same distribution as the base sample, except that the central tendency parameter of the distribution from which the outlier was drawn was larger. As mentioned earlier, its value was 150%, 200%, 250% and 600% of the value for the base population, that is, larger by a factor of 0.5, 1, 1.5 and 5, respectively. For the outlier population, the following parameters were increased:

  • mode for Pert;
  • β for Burr;
  • μ for 3Ploglog.

Because of complexity and time constraints, only three situations were simulated:

  • no outliers;
  • one outlier at the right-hand side of the sample distribution;
  • two outliers at the right-hand side of the sample distribution.

Because correlation between the numerator and the denominator is required for the double-square-root control chart, once the ratios had been obtained, the numerators were drawn from the uniform distribution with the lower bound set to 0 and the upper bound adjusted so that the desired correlation was obtained. The desired correlation range was set between 0.2 and 0.6, which is a relatively wide span, in order to avoid convergence problems.

Under each condition, 1000 samples were generated (or slightly fewer in case the 1000 samples were not obtained within the 350 hour run-time, after which the simulation was stopped).

4 Simulation results

The results of the simulations are summarised in Table 1 (no-outlier condition), Table 2 (one simulated outlier) and Table 3 (two-outlier condition).

Table 1. Results of the simulations without outliers.
n510202530
Ratio typeTest Valid (%)MeanMedianMaxCorrect (%)Valid (%)MeanMedianMaxCorrect (%)Valid (%)MeanMedianMaxCorrect(%)Valid (%)MeanMedianMaxCorrect(%)Valid(%)MeanMedianMaxCorrect(%)
  1. a

    Note: MDSRCC denotes modified double-square-root control chart. Valid denotes the proportion of simulations that passed the Shapiro–Wilk normality test. Mean, median and max refer to the number of outliers found. Correct denotes the proportion of simulations in which the test identified no outliers.

 Grubbs  0.090291 0.110389 0.100290 0.080292 0.090391
 DixonConditional860.020198690.040196550.040296510.040296490.040296
 Nalimov  0.330373 0.700552 1.271628 1.501623 1.692716
 Grubbs  0.230280 0.440569 0.7801060 0.960857 1.1701054
SmallDixonUnconditional 0.150286 0.220482 0.450671 0.520668 0.580666
 Nalimov  0.460363 1.131638 2.342917 2.9121013 3.353119
 Boxplot  0.470261 0.600455 0.971647 1.221643 1.421640
 MDSRCC  0.0000100 0.110189 0.711335 0.991321 1.261412
 GrubbsConditional 0.060294 0.060294 0.070293 0.070393 0.070394
 Dixon 870.040296740.070293630.040396630.040396630.030297
 Nalimov  0.230381 0.600762 1.8911229 2.3821521 3.0732114
 Grubbs  0.100291 0.070393 0.070494 0.080393 0.070494
LargeDixonUnconditional 0.090292 0.110290 0.130589 0.130690 0.110592
 Nalimov  0.270379 0.630765 1.6211247 1.9711542 2.4312139
 Boxplot  0.350273 0.2901080 0.320881 0.310982 0.340980
 MDSRCC  0.0000100 0.100190 0.711337 0.961424 1.301412
Table 2. Results of the simulations with one outlier.
n510202530
Ratio typeTest Valid (%)MeanMedianMaxCorrect (%)Valid (%)MeanMedianMaxCorrect (%)Valid (%)MeanMedianMaxCorrect(%)Valid (%)MeanMedianMaxCorrect(%)Valid(%)MeanMedianMaxCorrect(%)
  1. a

    Note: MDSRCC denotes modified double-square-root control chart. Valid denotes the proportion of simulations that passed the Shapiro–Wilk normality test. Mean, median and max refer to the number of outliers found. Correct denotes the proportion of simulations in which the test identified one outlier.

 GrubbsConditional 0.170215 0.220320 0.290227 0.270226 0.280227
 Dixon 470.05024210.07026100.10021080.0902970.09019
 Nalimov  0.490329 1.061637 1.932732 2.292723 2.472621
 GrubbsUnconditional 0.661251 1.001653 1.571851 1.7811049 2.0011247
SmallDixon  0.561248 0.671552 1.161753 1.201753 1.261852
 Nalimov  0.891352 1.822738 3.1931118 3.7531212 4.174149
 Boxplot  0.761260 1.091460 1.621749 1.882945 2.092842
 MDSRCC  0.00000 0.480147 1.001370 1.181364 1.351457
 GrubbsConditional 0.120210 0.240320 0.320326 0.330327 0.300324
 Dixon 780.06025620.160413520.180315520.200316520.180315
 Nalimov  0.400324 1.021729 2.3921420 3.0131416 3.6232213
 GrubbsUnconditional 0.180214 0.290422 0.340425 0.350524 0.330522
LargeDixon  0.140212 0.240418 0.320620 0.310520 0.290719
 Nalimov  0.450324 1.051724 2.0911413 2.6321710 3.072228
 Boxplot  0.490226 0.5701025 0.6202025 0.6602524 0.6303022
 MDSRCC  0.00000 0.480145 0.891357 1.071353 1.281445
Table 3. Results of the simulations with two outliers.
n510202530
Ratio typeTest Valid (%)MeanMedianMaxCorrect (%)Valid (%)MeanMedianMaxCorrect (%)Valid (%)MeanMedianMaxCorrect(%)Valid (%)MeanMedianMaxCorrect(%)Valid(%)MeanMedianMaxCorrect(%)
  1. a

    Note: MDSRCC denotes modified double-square-root control chart. Valid denotes the proportion of simulations that passed the Shapiro–Wilk normality test. Mean, median and max refer to the number of outliers found. Correct denotes the proportion of simulations in which the test identified two outliers.

 GrubbsConditional 0.10020 0.15031 0.24021 0.20020 0.24020
 Dixon 670.03020180.0302060.0502040.0602030.05010
 Nalimov  0.42038 1.001613 2.072721 2.442716 2.632718
 GrubbsUnconditional 0.410210 1.131633 2.112943 2.4621143 2.7821443
SmallDixon  0.28025 0.570417 1.452842 1.542943 1.622843
 Nalimov  0.710319 2.192734 3.9541218 4.4841213 4.984149
 Boxplot  0.43020 1.261446 2.212649 2.522945 2.752843
 MDSRCC  0.00000 0.32010 1.141318 1.401429 1.622436
LargeGrubbsConditional 0.02020 0.02040 0.04040 0.04020 0.04030
 Dixon 720.15020580.27041450.05030430.04030450.04030
 Nalimov  0.11030 0.39064 2.2121514 3.0431513 3.5931611
 GrubbsUnconditional 0.03020 0.04040 0.06041 0.12052 0.14042
 Dixon  0.16020 0.30041 0.21062 0.20072 0.19052
 Nalimov  0.16030 0.58064 2.1611511 2.822179 3.263208
 Boxplot  0.11020 0.6101015 0.840721 0.890920 0.830818
 MDSRCC  0.00000 0.37010 1.111325 1.371334 1.592437

When there were no outliers, the methods performed very well—with the estimated accuracy above 90%, except the boxplot rule and especially the Nalimov test, which achieved a 73% accuracy with n = 5 and dropped to merely 16% with n = 30. Naturally, the results of the formal outlier detection tests were better when we only considered those simulations in which normality was not rejected (i.e., in the conditional case), because otherwise ‘outliers’ were occasionally found simply because of the skewness of the distributions from which the samples were drawn. Overall, the modified double-square-root control chart performed best under this condition.

Under the one-outlier condition, the formal outlier detection tests performed worse in the conditional case than in the unconditional cases. This highlights the problem of normality testing with outliers, which is addressed in the next section. However, accuracy was also low in the unconditional cases and did not depend markedly on the sample size. The boxplot rule proved the most accurate for small ratios, whereas the modified double-square-root control chart gave the best results for large ratios. It is also noteworthy that the performance of the boxplot rule worsened as sample size increased, whereas the performance of the control chart improved.

When two outliers were simulated, all methods performed rather poorly. Similar to the one-outlier situation, the formal tests assuming normality performed much better when all the simulated samples were used (i.e., in the unconditional case). With small ratios, the boxplot rule achieved accuracy comparable with the formal tests applied unconditionally, whereas the modified double-square-root control chart was the least accurate. However, with large ratios, the control chart was the most accurate, whereas all other methods proved highly inaccurate.

To summarise, it is apparent that the performance of the methods varied greatly across the conditions. Formal small-sample tests became virtually useless if applied conditionally upon passed normality pre-test in the presence of (especially two) outliers with a sample size of 10. Among the formal tests, the Dean and Dixon test performed worst overall. The simple boxplot method performed the most variedly, but it was the only useful one for tiny samples. Our variant of the double-square-root control chart proved too conservative in tiny samples and too liberal under the no-outlier condition with n ≥ 20 (both conclusions holding also for the Nalimov test and for boxplot for small ratios), but it appeared by far the most useful (although still far from perfect) to detect actual outliers with a larger n, especially with large ratios.

Regarding the chosen sample sizes, it should be noted that with a sample size above 30, the simulation procedure failed to converge. However, that did not pose a serious limitation to our study because we focused on small samples, because moderate or large samples are rarely encountered in statistical comparisons of such quality indicators between hospitals and similar organisations. Samples are bound to be of small or moderate size because particularly with financial indicators, the comparisons can be meaningful only if truly comparable organisations are compared within a specific sector (e.g., hospital type) and/or a very homogenous area (e.g., Slovenia, which is relatively uniformly urbanised).

5 Possibilities for robust normality pre-testing

As the starting point for this section, we take the normality test that is commonly attributed to Jarque and Bera [18]. Actually [19], Bowman and Shenton [20] were the first to observe that under normality, the asymptotic means of sample skewness and kurtosis statistics are 0 and 3, respectively; the asymptotic variances of the two statistics are 6 ∕ n and 24 ∕ n, respectively; and their asymptotic covariance is 0. Another version of the skewness–kurtosis test for normality was suggested by D'Agostino and Pearson [21].

A class of robust normality tests for small samples possibly containing outliers against Pareto tails has recently been proposed [22]. This class also contains tests that accommodate the kind of alternative distributions that are known to be problematic for the Jarque–Bera (JB) test (e.g., bimodal, Weibull and uniform). The proposal can be seen as an extension of the robust modification of the JB test [23]. The base for the proposed class of tests is a location functional, denoted by T(F)[24], whereby the relevant location functionals (T(i) for i = 0…3) are arithmetic mean (T(0)), median (T(1)), trimmed mean (T)(2)(s)) and pseudo-median (T(3)). Relaxing the form of jth theoretical moment estimator μj = E(X − E(X))j by using math formula, where X1:n  <  X2:n  <   …   <  Xn:n are the order statistics, the new class of test statistics (denoted by RTJB) can be defined as

display math(4)

Although k1(n) and k2(n) are theoretical values of proportions for the first and the second term of the statistics dependent on sample size, the C1 and C2 constants can be obtained from Monte Carlo simulations, whereby their values for small samples under trimming (r > 0) differ from those without trimming (r = 0). The K1 and K2 constants are small-sample variants of mean corrections so that asymptotical normality is obtained, and thus, the χ2 asymptotical distribution of the test statistics is valid.

Illustrative special cases of this class include the ‘median robustified JB test’ (5), the ‘trimmed-mean robustified JB test’ with trimming parameter s = 5 (6), the ‘pseudo-median robustified JB test’ (7) and the ‘trim–trim robustified JB test’ with trimming parameters s = r = 1 (8):

display math(5)
display math(6)
display math(7)
display math(8)

Some theoretical results on consistency and asymptotical χ2 distribution of these and other RTJB class tests can be found in [22], where they were introduced. Here, we just briefly summarise some preliminary results of power and size comparisons through simulations with various distributions and sample sizes:

  • In samples of size 25, the JB test and its successors [23] had nearly zero power against the Beta(0.5, 0.5) alternative, as did the robust directed test of normality against heavy-tailed alternatives [25] even for sample sizes up to 200. Shapiro–Wilk and Anderson–Darling were the most powerful tests, whereas some RTJB class tests were almost as powerful.
  • Against the Burr (2, 1, 1) alternative in small samples, the power of JB test was comparable with the Anderson–Darling test, whereas the Shapiro–Wilk test and some tests from the RTJB class were more powerful. With a sample size of 100, the power of all tests reached 1. Against logistic alternative, all normality tests had very low power in small samples (because of the resemblance of the logistic to the normal distribution).
  • The most powerful tests for normality against a mixture of two equally probable normal distributions with means 0 and 5 and unit variance were the D'Agostino, the Anderson–Darling and the Shapiro–Wilk tests, whereas the power of RTJB class tests and of the (original and robust) JB test was very low for n = 25, although unlike the latter, it improved quickly for the RTJB class tests with n = 50.
  • In simulated samples from the standard normal distribution containing one extreme outlier (from the normal distribution with a mean of 3 and a unit variance), which can be viewed as assessing a particularly defined size of the tests (i.e., defining the proper decision as retaining the null hypothesis of normality despite the outlier), the simpler RTJB class tests did slightly better than the D'Agostino and the JB test, although they were still very much on the liberal side. The Shapiro–Wilk test performed a little less liberally, whereas the estimated ‘size’ was even closer to nominal for the Anderson–Darling test. The robust medcouple test [26] retained the proper size irrespective of the sample size, but it proved the least powerful in the power comparisons mentioned previously. Encouragingly, RTJB class tests with trimming applied to both the location and the generalised moment had almost as good power as the simpler RTJB class tests, whereas their estimated ‘size’ was very close to the nominal 5%.

Much work remains to be performed regarding the RTJB class test, including simulations with other alternatives and under different assumptions. It is inevitably challenging to construct powerful tests that are robust at the same time. There is always a trade-off between power and robustness, which is where the RTJB class tests of normality might offer a useful compromise.

6 Summary and future directions

A large simulation study of outlier detection in small samples of random-denominator same-quantity ratios with a strongly correlated numerator and denominator was performed. The performance of the following three types of methods was assessed: the common formal outlier-detection tests (Grubbs, Dean and Dixon, and Nalimov tests) applied unconditionally and conditionally upon the results of (Shapiro–Wilk) normality test; the boxplot rule; and the double-square-root control chart (for which we introduced regression-through-origin-based control limits). Pert, Burr and 3Ploglog distributions (which fitted the real data best) were used with zero, one or two outliers in the simulated samples of sizes 5 to 30. Small (below 0.2, right skewed) and large (above 0.5, more symmetrical) ratios were simulated. The performance of the methods varied greatly across the conditions. Formal small-sample tests became useless if applied conditionally upon passed normality pre-test in the presence of (especially two) outliers with a sample size of 10. Boxplot rule performed most variedly, but it was the only useful one for tiny samples. Our variant of the double-square-root control chart proved too conservative in tiny samples and too liberal for samples of size 20 or more without outliers, but it appeared the most useful to detect actual outliers in samples of the latter size (especially with large ratios).

Following further research on robust normality testing, it might be useful to repeat the first part of our outlier-detection simulations (i.e., the conditionally applied formal outlier tests) with different normality pre-tests. Improved normality pre-testing should improve the feasibility and usability of outlier tests in small samples because the naïve approach of abandoning normality pre-testing resulted in (too) many false alarms in the no-outlier simulations.

Putting our work in a broader context, we should first recognise that the extensive simulation approach to the problems of outliers and robustness owes its main origin to the work of Andrews, Hampel, Huber, Tukey and associates in the context of the Princeton Robustness study [27]. Statistical process control literature has also dealt with outliers, although primarily within the context of time-dependent process data [28]. It should also be noted that the entire outlier detection approach, which we assessed through simulations, should be viewed as heuristics rather than as statistical testing or strictly probabilistic decision-making. Constructing a general statistical test for outlier detection is namely an unsolvable problem without many specific substantial assumptions answering the question ‘what is an outlier?’ This problem has some similarity with the problem of choosing the correct number of clusters posed nearly four decades ago [29]. Although the problem of clustering is challenging enough, even in one dimension [30], a clustering approach—which has been mentioned as an option for dealing with over-dispersion[11]—might be worth trying. Another alternative to analysing the kind of data that we addressed, which is better explored and established, is bootstrap tolerance intervals [31].

In conclusion, it may not be surprising that seemingly simplistic methods, exemplified by boxplots and control charts, which combine a robust ‘eye-balling’ approach with a ‘touch’ of implicit inference and vast experience from practical data analysis [5, 32], have yet again proven their value in statistics applied to a real-life industrial and organisational setting.