The performance of 
 X¯ control charts for large non‐normally distributed datasets

Because of digitalization, many organizations possess large datasets. Furthermore, measurement data are often not normally distributed. However, when samples are sufficiently large, the central limit theorem may be used for the sample means. In this article, we evaluate the use of the central limit theorem for various distributions and sample sizes, as well as its effects on the performance of a Shewhart control chart for these large non‐normally distributed datasets. To this end, we use the sample means as individual observations and a Shewhart control chart for individual observations to monitor processes. We study the unconditional performance, expressed as the expectation of the in‐control average run length (ARL), as well as the conditional performance, expressed as the probability that the control chart based on estimated parameters will have a lower in‐control ARL than a specified desired in‐control ARL. We use recently developed factors to correct the control limits to obtain a specified conditional or unconditional in‐control performance. The results in this paper indicate that the X¯ control chart should be applied with caution, even with large sample sizes.


INTRODUCTION
Shewhart control charts are commonly used to monitor process data. Typically, the performance of such control charts is heavily dependent on the assumption of normally distributed data. In practice, this assumption is often violated. For example, Alwan 1 analyzed 235 real datasets and concluded that most of these datasets do not meet the assumptions underlying the traditional control charts.
Since recent advances have led to an increase in the amount of available information, one way to work around the violation of the normality assumptions is to gather larger datasets and use subgroup averages instead of indi-vidual observations. Because averages are normally distributed under certain conditions, according to the central limit theorem (CLT), this should largely resolve the issue of non-normally distributed data (cf Billingsley 2 ).
While the approach of using averages instead of individual observations is suitable for many statistical techniques, the major difference with many other statistical techniques is that in statistical process monitoring (SPM) we are interested in the long tail behavior of the distribution. This means that, even when the statistic is almost normally distributed, small deviations at the long tails can lead to a bad control chart performance in terms of the false alarm rate and the average run length (ARL). In this paper, we therefore investigate the performance of Shewhart-typeX control charts for large non-normally distributed datasets using the convolutions of the distributions. To the best of our knowledge, the performance of ShewhartX control charts in this setting has not been investigated thus far.
The paper is structured as follows. In the next section, we briefly describe the model and control charts considered in this paper. Subsequently, in Section 3, the CLT is summarized followed by the convolutions of various probability distributions. In Section 4, we investigate the differences between the normal and non-normal convolutions. Next, Section 5 describes the performance of the Shewhart control chart based on large non-normally distributed datasets. Finally, Section 6 provides some concluding remarks.

THE CLASSICAL SHEWHART CONTROL CHART
Because of the increase in data supply and storage, nowadays organizations often possess large datasets. As the CLT states that under certain conditions the sample means are normally distributed when the samples are sufficiently large, we could treat the sample means as individual observations and use a Shewhart control chart for individual observations under normal theory. To construct such a chart, m samples of size n are collected when the process is assumed to be in control. On the basis of these data, the process mean is estimated by where X ij is the j-th observation in the i-th subgroup (i = 1, 2, … , m and j = 1, 2, … , n), and the process standard deviation is estimated from the standard deviation of the sample meansX i S = ( An unbiased estimator of the standard deviation of the sample means ( The choice of the estimator of the standard deviation of the sample means is based on Cryer and Ryan. 3 We have also evaluated the alternative and more traditional estimator based on moving ranges (which was also used by Roes Correction added on 19 June 2018, after first online publication: the running header has been corrected et al 4 ). However, the use of this estimator has not improved the performance of the ShewhartX control chart, which confirms the result of Cryer and Ryan. 3 The control limits based on estimated parameters are given bŷ withÛCL andLCL the respective upper and lower control chart limits and k the factor used to achieve the desired in-control performance. When the process parameters are known, k is commonly set equal to 3, which yields a false alarm rate of 0.0027 or equivalently an ARL of 370.4. However, when process parameters are unknown, other values can be chosen to match a certain desired performance. Obtaining a desired control chart performance for practitioners in expectation represents the unconditional performance of the control chart. Recently, factors k u have been derived to ensure that the in-control ARL in expectation (EARL) is equal to a specified value (EARL 0 ) (see Goedhart et al 5 ).
Another recent development is to evaluate control chart design on the variation of the in-control ARLs of the individually estimated, also called conditional control charts. Saleh et al 6 investigated the conditional performance of the traditional control charts based on estimated parameters. They show that for estimated control chart limits for k = 3 the probability of ending up with an estimated chart that has an in-control conditional ARL (CARL) lower than 370.4 is considerable. Goedhart et al 7 developed new correction factors k c for control charts in order to ensure that the probability (P E ) that a design delivers an estimated control chart with an in-control CARL lower than a specified value (CARL 0 ) is at most a specified probability (p).
In this article, we study both the unconditional and conditional performance of the control chart constructed with (3) including the newly developed factors, for the cases where the data are non-normally distributed and various sample sizes (n = 5, 30, 50, 100, 250, 1000). With this model, we can investigate whether the CLT works well and whether the newly developed correction factors are applicable to large non-normal datasets as well. We consider the normal distribution, the standard uniform distribution, heavy tailed symmetrical distributions (Student's t 4 and t 10 and the logistic distribution), and skewed distributions (the lognormal, Gamma(5, 1), Gamma( 5 2 , 2) ∼ 2 5 and 2 20 distributions). The distribution of the sample means for any one of these non-normal distributions can be found using the convolution of that non-normal distribution, ie, where C n is the convolution of n i.i.d. random variables with distribution F. In the next section we produce the distribution of C n for the considered non-normal distributions.

THE DISTRIBUTION OF THE SAMPLE MEAN
Let X 1 , X 2 , … , X n be n i.i.d. observations drawn from F, with E[X i ] = and Var[X i ] = 2 < ∞. Then as n tends to infinity, the random variables √ n(X − ) converge in distribution to a normal N(0, 2 ) (cf Billingsley 2 ), ie, Hence the asymptotic distribution of the sample means is normal under the above restrictions. The exact distribution for finite values of n can be obtained by evaluating the convolution. To assess the performance of the Shewhart control chart for sample means of non-normally distributed samples, we need the distributional properties of the convolution of these samples: C n = ∑ n i=1 X i . The convolutions will allow an investigation of the distribution of the sample means of non-normal distributions and a comparison with the asymptotic normal distribution according to the CLT.
The convolutions are given below; further details on the derivations and approximations are given in the appendix.

The convolutions 3.1.1 The normal distribution
The convolution of i.i.d. normal random variables is just a normal distribution, with mean n and variance n 2 C n ∼ N(n , n 2 ).

The uniform distribution
The convolution of i.i.d. standard uniform random variables has an Irwin-Hall (IH) distribution, which has a piecewise polynomial probability density function with parameter n (see Hall 8 ):

The Student's t v distribution with degrees of freedom
For = 1, t 1 is equal to a standard Cauchy distribution and its convolution C n will have a Cauchy distribution as well (see Blyth 9 ): where 0 and n denote the location and scale parameters of the Cauchy distribution respectively. Note that the conditions needed to apply the CLT do not hold for this case, as the Cauchy distribution has no finite mean and variance. For > 1, we use an approximation based on the numerical inversion of the characteristic function.

The logistic distribution
The standardized version of the sum of i.i.d. logistically distributed random variables with = 0 and s = 1 can be approximated by a Student's t distributed random variable with = 5n + 4 degrees of freedom (George and Mudholkar 10 ):

The lognormal distribution
The distribution of the convolution C n of the lognormal distribution can be approximated using 2 methods: the Fenton-Wilkinson approximation by Fenton 11 or the Pearson IV approximation by Nie and Chen. 12 The performance of the Pearson IV approximation turns out to be more accurate than the Fenton-Wilkinson approximation as it matches 2 more moments (see Section 3.2). In the sequel, we will use the Pearson IV approximation with location parameter , scale parameter > 0, and shape parameters m > 1 2 , ≠ 0.

The gamma Γ( , ) distribution with parameters and
, with parameters and , then its convolution is gamma distributed with parameters n and C n ∼ Γ(n , ).

The chi-squared 2 distribution with degrees of freedom
The convolution distribution of the sum of n i.i.d. chi-squared random variables with degrees of freedom is again a chi-squared distribution with n degrees of freedom:

Accuracy of the approximated distributions
As reported in the previous section, the convolutions of the Student's t with > 1, logistic and lognormal distributions have to be approximated. In the graphs in the left column of Figure 1, the approximated densities of the convolutions for the t 10 , t 4 , logistic and lognormal distributions are plotted and compared with the empirical distribution based on 6 million samples. The graphs in the middle and right columns of Figure 1 zoom in on the 0.135th and 99.865th percentiles of the distributions. The graphs show that the approximated t 10 , t 4 , and logistic convolutions are accurate. For the lognormal approximations, we find that the Pearson IV approximation is closer to the empirical distribution than the Fenton-Wilkinson approximation. Thus, we will use the Pearson IV approximation in the sequel.

EVALUATION OF THE CENTRAL LIMIT THEOREM
To investigate the differences between the actual distribution of the sample mean and the appropriate normal distribution, we have plotted both distributions and the tail behaviors. In Figures 2 to 4, we have used n = 5, 30, 250 and = 0.0027 to investigate the tail behaviors. The graphs on the left give the densities, while the graphs in the middle and on the right zoom in on the 0.135th and 99.865th percentiles of the distributions. The graphs show that, for a sample size of n = 30 or larger, the convolutions of the uniform, t 10 and logistic distributions, do not deviate much from the normal distribution. The distribution of the t 4 convolution, however, clearly has wider tails than the normal distribution.
The overall distribution of the Gamma convolution is quite close to normal, with gamma( 5 2 , 2) ∼ 2 5 closer to normal than gamma(5,1). When we zoom in on the tail behavior, the gamma distributions show skewed tail behavior with narrower tails on the left and wider tails on the right than the normal distribution.
The 2 20 convolution deviates a little from the normal distribution, but less so than the 2 5 convolution. The lognormal convolution shows the largest difference with the normal distribution. The distribution of the lognormal convolution is still strongly skewed for large values of n (n = 250).
Note that when we consider a relatively small sample size (n = 5), there are large differences for all distributions. This indicates that the normal approximation is not good enough for small sample sizes.

Simulation procedure
To evaluate the control chart performance, we conduct 10 000 simulation runs for each parameter combination. For each simulation run 1. A dataset consisting of m samples of size n is generated. On the basis of these data, is estimated by X and ∕ √ n is estimated by S∕c 4 (m), using (1) and (2). Next, UCL andLCL can be determined using (3). Factor k u is based on Goedhart et al 5 and factor k c on Goedhart et al. 7 2. For each dataset, the conditional false alarm rate (CFAR) is calculated as CFAR = 1 − P(LCL <X < UCL) = 1 − P(nLCL < C n < nÛCL) using the convolutions of Section 3.1. The CARL is given by 1∕CFAR.
When we perform the above procedure, we end up with 10 000 CARLs of individually estimated control charts. When k u is used, the EARL is estimated by averaging the 10 000 CARLs of the simulated control charts. When k c is used, the exceedance probability (P E ) is obtained by determining the percentage of CARLs lower than a specified value (CARL 0 ). Both the unconditional and conditional results were verified using the empirical distribution of the non-normal distributions.
We expect that the higher EARL 0 or CARL 0 , the larger the sample size should be to ensure that the performance of the control charts is as desired. This is because the higher these values are, the more our interest moves towards the long tail of the distribution of the sample means, where minor deviations from the normal approximation have more impact on the performance. For this reason, we consider various values for EARL 0 and CARL 0 , namely, 1000, 370.4, and 100.
Finally, as we expect that the correction factors are more accurate when the sample size (n) is larger, we consider a broad range of values, namely, n = 5, 30, 50, 100, 250, 1000. For the amount of samples m, we take values m = 30, 50, 100, 200.

Unconditional performance
In this section, we present the simulation results of the control charts based on (3) and k u as defined in Goedhart et al. 5 Tables 1 to 3 present the results for an EARL 0 equal to 1000, 370.4, and 100, respectively. Each table presents the EARL and 5th, 50th and 95th percentiles of the CARL distribution.
Each table shows that the larger the sample size (n), the closer the EARL is to its desired value EARL 0 and so the more applicable is the correction factor. Increasing the number of samples (m) also reduces the deviation in performance with respect to the case of normally distributed data, but the impact of m is less strong than the impact of n, as was to be expected. Also, the value of EARL 0 is of influence: the higher EARL 0 , the larger the sample size should be to obtain a performance that resembles the performance under normality. This can be explained as the relative difference between the distributions of the means based on the non-normal and normal distributions is the largest in the tails of the distributions. To give an example, for the case EARL 0 = 1000, the t 10 and logistic distributions require a sample size of 100 or larger in order to obtain a reasonable in-control performance with the use of the given correction factors while, for the case EARL 0 = 100, a sample size of 30 is sufficient to obtain the desired EARL values.
As discussed in Section 4, the uniform distribution is the only distribution that has a convolution distribution with thinner tails than the normal distribution on both sides. This produces extremely large EARL values for small n. Furthermore, as the uniform distribution is bounded by an interval, conditional control limits have been generated that produce a CFAR of zero for small values of n giving an infinite CARL. Tables 1 to 3 show the amount of infinite values we found for the uniform distribution within the second parentheses.  In Section 4, we already indicated large differences between the normal distribution and the distributions of the lognormal and t 4 convolutions and small deviations compared with the uniform, t 10 , logistic, Gamma(5, 1), Gamma( 5 2 , 2) ∼ 2 5 , and 2 20 convolutions. The EARL results confirm these hypotheses, as for all values of n and m the lognormal EARL values are consistently far below the desired EARL 0 , indicating the strong skewness as observed in the analysis of the convolutions.

Conditional performance
In this section, we present the results of the control charts based on (3) with k c such that the probability of having an in-control CARL lower than a specified value (CARL 0 ) is equal to p (cf Goedhart et al 7 ). We set p = 10%. Tables 4 to 6 present the realized exceedance probabilities P E for a specified CARL 0 of 1000, 370.4, and 100, respectively. Each table presents the results for various sample sizes (n = 5, 30, 50, 100, 250, 1000), various numbers of samples (m = 30, 50, 100, 200), and various distributions (normal, uniform, t 10 , t 4 , logistic with = 0 and s = 1, lognormal with = 0 and = 1, Gamma(5, 1), Gamma( 5 2 , 2) ∼ 2 5 and 2 20 ). As for the unconditional case, the tables show that the larger the sample size (n), the closer P E is to its desired value p(10%), and so the better the applicability of the control charts. Also, the value for CARL 0 has an impact: the lower the CARL 0 , the closer the control chart performance is to the desired performance. This can be explained by the increase in relative difference further in the tails of the distributions.
The normal approximation is worst in the case of the lognormal distribution, as we see that the deviation of P E with respect to p = 10% is the largest. A very large sample size (n) is needed to guarantee a desired conditional performance. In the case of CARL 0 = 100, a sample size of 1000 gives reasonable P E values, also for the lognormal distribution, while for CARL 0 = 1000 and 370.4 even a sample size of 1000 is not large enough to ensure the right exceedance probabilities.
Interestingly, increasing m actually increases P E for the non-normal distributions in most situations. For example, the t 4 distribution for CARL 0 = 370.4 and n = 50 has a P E of 17.2% for m = 30. With m increased to 200, for t 4 now 40.3% of the CARLs are below the desired CARL 0 = 370.4. This can be explained by a decrease in parameter estimate variation and thus a decrease in the constant k c , causing tighter control limits.

SUMMARY AND CONCLUDING REMARKS
In this paper, we have studied the applicability of the CLT to large non-normal datasets. According to the CLT, sufficiently large samples should lead to normally distributed sample averages. However, since SPM is concerned with the far tail of the distribution, it was unclear whether the convergence to normality would be sufficient.
In this research, we have thus investigated whether the charting constants that are designed for normally distributed data can also be applied to large non-normal datasets. In particular, we have applied the Shewhart control chart for individual observations to monitor the sample means of non-normally distributed datasets.
The study demonstrates that the appropriateness of the control charting constants, also for non-normally distributed data, depends on various factors. These factors include the sample size (n), the number of samples (m), the specified desired performance of the control chart, and the degree of the deviation from normality. When the deviation from normality is moderate (as is the case for the uniform, t 10 , logistic, Gamma(5, 1), Gamma( 5 2 ) ∼ 2 5 , and 2 20 distributions), a sample size of 100 is large enough to ensure appropriate use of the correction factors.
However, when the deviation from normality is substantial due to heavy tails (t 4 ) or substantial skewness (lognormal), the correction factors are not applicable even when

A.1 The normal distribution
The convolution of i.i.d. normal random variables can be found using the moment generating function approach. The moment generating function of a convolution of normally distributed variables X ∼ N( , 2 ) is ) which is just the moment generating function of a normal distribution, with mean n and variance n 2 and hence C n ∼ N(n , n 2 ).

A.2 The uniform distribution
As shown by Hall, 8 the convolution of i.i.d. standard uniform random variables has a piecewise polynomial probability density function of degree n − 1 which we denote as the IH(n) distribution.

A.3 The Student's t distribution with degrees of freedom
There is no closed form of the convolution of Student's t distributed random variables X ∼ t for > 1(see Walker and Saw 13 ), but approximations do exist. We use an approximation based on the numerical inversion of the characteristic function given by Witkovsky. 14 The characteristic function of the sum of Student's t distributed random variables, C n , equals C n (t) = n X (t), where the characteristic function of a single Student's t distributed random variable equals in which K {z} denotes the modified Bessel function of the second kind. The distribution function F C n = Pr{C n ≤ x} of C n is found using the inversion formula of Gil-Pelaez 15

A.4 The logistic distribution
Now assume a logistic distribution for the random variable: X ∼ logistic( = 0, s = 1). The standardized version of the sum of X i can be written as which distribution can be approximated by with = 5n + 4 degrees of freedom. For more details on this approximation see George and Mudholkar. 10

A.5 The lognormal distribution
The characteristic and moment generating function of the lognormal distribution are undefined. The distribution of the convolution C n can be approximated by 2 methods. In the first place, the Fenton-Wilkinson approximation will be used, as it is said to perform well in the tails of a lognormal distribution (see Mehta et al 16 ). Secondly, an approximation based on the type IV Pearson distribution will be used.

A.6 The Fenton-Wilkinson approximation
Consider the sum of lognormal (LN) random variables X i , where each X i ∼ LN( , 2 ) with the expectation E(X i ) = exp( +0.5 2 ) and variance Var(X i ) = (exp( 2 )−1)exp(2 + 2 ). The expectation and variance of C n are E(C n ) = nE(X i ) and Var(C n ) = nVar(X i ). The Fenton-Wilkinson approximation is a lognormal PDF with parameters C n and 2 C n such that ex ( C n + 0.5 2 C n ) = nE(X i ) and (ex ( 2 C n ) − 1)ex (2 C n + 2 C n ) = nVar(X i ). Solving for C n and 2 C n results in a lognormal distribution for the sum: C n ∼ LN( C n , 2 C n ).

A.7 The type IV Pearson approximation
The type IV Pearson approximation was developed by Nie and Chen 12 and equates the first 4 central moments ( 1 , 2 , 3 , 4 ) of the sum of lognormal distributions to the 4 parameters of the Pearson IV distribution. Denote the sum of lognormal random variables by C n , where each X i ∼ LN( , 2 ).
Where the Fenton-Wilkinson approximation only uses the first 2 moments as parameters for a lognormal distribution to represent the sum of lognormal random variables C n , the Pearson IV method uses 4 moments to approximate