A bootstrap method for the measurement error estimation in Gauge R&R$R{\&}R$ Studies

A measurement system fit for use is an essential resource to perform quality control activities: its assessment must be carried out periodically to quantify its bias(location) and precision(width) error to qualify it for the purpose it is used. In particular, the measurement system capability must be determined to evaluate how much of the observed variability originates from the gauge's precision error. Gauge Repeatability and Reproducibility ( R&R$R{\&}R$ ) studies are aimed at getting a reliable estimate of the precision error σM$\sigma _M$ of the measurement system. The outcome of a Gauge R&R$R{\&}R$ study are point estimates and confidence intervals of the precision error components and related measurement capability metrics. Here, a general bootstrap‐based procedure is proposed to get confidence interval estimations of the measurement error and its components from a Gauge ( R&R$R{\&}R$ ) study for continuous observations carried out with either the Average and Range (ARGG) control chart or the experimental design method. An application of the proposed bootstrap‐method is presented for a dataset of observations.

Furthermore, grounding on the ISO 9001-2015 document, the standard IATF 16949-2016 "Quality management system requirements for automotive production and relevant service parts organizations" issued by the International Automotive Task Force (IATF) 6 sets more restrictive requirements to the MS qualification process in the automotive industry by specifying at point 7.1.5.1.1 that: "Statistical studies shall be conducted to analyse the variation present in the results of each type of inspection, measurement, and test equipment system identified in the control plan. The analytical methods and acceptance criteria used shall conform to those in reference manuals on measurement system analysis. " The reference manual on Measurement Systems Analysis (MSA) has been published by the Automotive Industry Action Group (AIAG). 1 This manual provides practitioners with definitions, statistical techniques and threshold criteria to qualify a MS. The performance of a MS should be established in terms of its resolution, bias and precision errors. Here, we are interested to the estimation of the precision error through a capability study of the MS: Gauge & studies are aimed at getting an estimation of the precision error and its components. For normally distributed observations, the AIAG Measurement Systems Analysis Manual describes two widely adopted methods to carry out a Gauge & analysis: the Average and Range (ARGG) control chart and the experimental design methods. The former method is implemented by usingā nd control charts to estimate the precision error components. Rarely, the control chart is used if the study is designed to have a large number of repetitive measures on the same part, that is, the sample size of the control charts is too large to use the sample range in the variability estimation. The experimental design method is based on (balanced or unbalanced) ANOVA models with Random(R) or Mixed(M) factors. Both the methods assume normality of the observations. They are widely adopted by practitioners and implemented in many commercial software for industrial statistics. From a practical point of view, the average and range method is easier to be implemented by practitioners because it only requires a basic statistical knowledge about̄and ( ) Shewhart control charts; conversely, the method based on random and mixed ANOVA models requires practitioners to have a good background on Design of Experiment (DoE) techniques. A comprehensive discussion about this second approach to perform a Gauge & study is provided by Burdick et al. 2 It is worth noting that the experimental method provides more information about the MS capability than the ARGG method, see the AIAG Manual 1 and Park and Ha.
The outcome of a Gauge & study is a set of point estimates of precision error components and performance metrics to qualify the measurement system, as defined in the next Section. These estimated values are characterized by an uncertainty originated from practitioner-to-practitioner variability in the selection and measurement of parts to carry out the study. For this reason, in the outcome of a Gauge & study confidence intervals (CI) are calculated for each estimated precision error component (source of variability). However, exact CIs cannot be defined for some components and performance metrics: for this reason, many approaches have been investigated to get reliable approximations. Approximated CIs have been derived for the Gauge & experiments based on ANOVA random(R) and mixed(M) models. The precision error components are derived from the sources of variability identified in the ANOVA statistical study: their approximated CIs are calculated starting from these sources of variability. In particular, the Modified Large Sample (MLS) and Generalized (GCI) confidence intervals are suggested and widely discussed in Burdick et al., 2 who also conclude that they get a comparable performance in terms of coverage probability. For a Gauge & study, Burdick et al. 2 argue that these methods of calculation for CIs of each variance component should be preferred to other asymptotic large sample approaches based on the Maximum Likelihood (ML) and Residual Maximum Likelihood (REML) estimation of the variance components. The MLS method was firstly introduced by Graybill and Wang 4 : it has the advantage of providing practitioners with closed-form formulas that can be easily implemented on a spreadsheet and is mostly used in commercial software, like Minitab. However, its application cannot be easily generalized to any investigated ANOVA model. The Generalized Confidence Intervals GCIs were originally introduced by Weerahandi. 13 They are based on the notion of generalized pivotal quantity (GPQ) for a scalar parameter representing a variance component. The GCIs are obtained by means of Monte Carlo simulations of linear combinations of GPQs related to the variance components of interest. Next, a quantile-based approach lets practitioners to determine the GCIs bounds. Burdick et al. 2 provide extensive discussions and formulas to get the GCIs of interest for the ANOVA models they investigate.
An alternative approach to GCIs is implementing Bootstrap techniques to calculate the confidence intervals. The bootstrap is a computer-based simulation method based on resampling for assigning measures of accuracy to statistical estimates, see Efron and Tibshirani. 4 For a nonparametric bootstrap, resampling is performed with replacement from the original sample of observations without any assumption about their distribution. The bootstrap samples allow new outcomes of the estimated parameter to be obtained and, thus, a standard error to be estimated. With parametric bootstrap the observations are assumed to have a known distribution with unknown parameters, which are estimated from the original sample of data. In this case, the bootstrap samples are generated from the known distribution with the estimated parameters.
A nonparametric bootstrap approach has been considered by Wang and Li 10 to get confidence intervals for a Gauge & study performed with the Average and Range (ARGG) control chart's method. The original observations from the study are resampled to get bootstrap distributions of the calculated parameters from thēand ( ) control charts. A comparison is carried out with the ML, REML and MLS approaches for the interval estimation of repeatability and reproducibility errors made with a random ANOVA model. In another paper, Yeh and Sun 14 implement a Monte Carlo simulation to replicate a Gauge & study based on the Average and Range (ARGG) method and to get the empirical distribution of some performance metrics of the MS. Is it worth highlighting that both these studies only consider the generation of CIs from a Gauge & study with the Average and Range (ARGG) method. Furthermore, both of them are focused on a small set of performance measures for the measurement system.
In this paper, a normal-theory parametric Bootstrap approach is proposed to calculate the CIs for the precision error of the MS, its components and several performance measures. In particular, in addition to available literature procedures: • it can be applied to any selected method to perform a Gauge & study. In particular, with the ARGG method it gives comparable results to the available nonparametric procedure proposed in Wang and Li 10 ; • with the experimental method, it computes approximated CIs for the precision error components and MS performance measures generally characterized by narrower widths than the Generalized Confidence Intervals proposed in Burdick et al. 2 for both the Random(R) and Mixed(M) ANOVA models; • it provides practitioners with significant insights about the outcome of a Gauge & study by looking at the bootstrap distribution of each investigated MS performance measure.
The remainder of the paper is organized as follows: in the next Section, the adopted notation and the list of most important MS performance metrics is provided. Some considerations about the linear models to be considered for the observed measures are commented, as well. Section 3 presents the proposed Bootstrap method to get the interval estimates. A performance study is shown in Section 4. An illustrative example is presented in Section 5. Conclusions and future research complete the paper. An Appendix details formulas to perform a Gauge & study based on the Average and Range (ARGG) and the ANOVA methods.

MS CAPABILITY PERFORMANCE
The capability performance quantifies the fitness for use of a MS. Several acceptance criteria based on different metrics can be considered. The same notation as in Burdick et al. 2 is adopted here. However, many practitioners usually refer to formulas reported in the Automotive Industry Action Group (AIAG) Manual 1 : for this reason, we also recall the AIAG Manual notation. Table 1 shows the sources of variability calculated in a Gauge & study, together with their symbols, respective formulas and notations adopted in related literature, see Burdick et al., 2 and in the AIAG Manual. 1 It is worth noting that is defined either as the "Process Variation" by Burdick et al. 2 or "Part Variation" by the AIAG Manual. If the MS is used to get measures for on-line process monitoring, for example to implement a control chart, then the definition of "Process Variation" is more appropriate and the parts involved in the Gauge & study have to be collected from a stable (in-control IC) process. Otherwise, when conformity to specifications is checked with the measurement process, then the parts involved in the Gauge & study should be collected as to be representative of the entire specification interval: in this case, "Part Variation" is an acceptable definition. The "Measurement System Variation" is also known as the "precision error" or "measurement error" of the MS. The estimation of each source of variability can be made by either the Average and Range (ARGG) or the experimental method, as briefly summarized in Appendix A.

Models for measurement error
We do reference to additive linear models to account for the measurement error effect when collecting a measure of a parameter. In particular, let us consider a quality characteristic of a part code to be produced and monitored in a manufacturing process. The true, but unobservable, value of the quality characteristic is denoted by and is considered as a normal random variable, i.e. ∼ ( , 2 ); here, the subscript ′ ′ denotes "Process". At every time = 1, 2, …, the quality characteristic is measured by a MS to be tested for capability. The measurement process is performed by an appraiser , for = 1, … , , who is an operator or a gauge or a combination of both. The additive linear model from Linna and Woodall 10 has been widely adopted in Statistical Process Monitoring literature to account for the measurement error. It assumes that the measurement process is randomly performed by any available appraiser and pools all combined observations from the MS to quantify the measurement error, which is independent of . Therefore, the additive linear model from Linna and Woodall 8 links to as follows: where = 1, 2 …, and ∼ ( , 2 ) is the measurement error due to the MS. Its mean and its standard deviation represent the bias and the precision error of the MS, respectively. The constant is related to the linearity error of the MS, see the AIAG Manual. 1 Assuming a perfect calibration, no bias-linearity errors affect the MS and we get ( , ) = (0, 1). As a consequence, = and the observed values are ∼ ( , 2 ), where 2 = 2 + 2 . The precision error is the sum of the repeatability and reproducibility variability calculated by: According to the AIAG Manual: • the repeatability error is referred to as the "within-appraisers" variability: it is defined as the variation in measurements obtained with one measurement tool when used several times by the same appraiser while measuring the same quality characteristic on the same part: it is considered as the inherent variation of the MS; • the reproducibility error is referred to as the "between-appraisers" variability: it is defined as the variation in average of measurements collected by different appraisers while measuring the same quality characteristic on the same part; therefore, it can be considered as the between-appraisers bias.
When the number of available appraisers is too large to be included in the Gauge & study, it provides information about the distribution of pooled observations from all appraisers by investigating on a random subset of ≤ appraisers. The Gauge & study can be performed with either the ARGG method or the experimental method. The latter is performed by considering an ANOVA Random(R) model with both parts and appraisers assumed as random factors.
Conversely, if all available appraisers can be included in the Gauge & study, then the MS capability study can provide information about the within-appraiser distributions of observations. Denoting as , the measured value of the quality characteristic at time by appraiser , the additive linear model linking to , can be adapted as follows: where = 1, 2 …, = 1, … , and ∼ ( , 2 , ) is the random error for each appraiser involved in the measurement process.
is a constant accounting for the linearity error of each appraiser: for the Gauge & study, we assume = 1, ∀ ; that is, no linearity error is present. Therefore, the within-appraiser observed values of the quality characteristic are random variables , ∼ ( , , 2 , ), where , = + and , = √ 2 + 2 , . The -th appraiser's bias is = , − , for = 1, … , . In a Gauge & study no overall bias error is assumed for the MS, therefore ∑ =1 = 0. The variability originated by each appraisers' bias is calculated as: and contributes to the reproducibility error . , is the -th appraiser's precision error, for = 1, … , : it includes the repeatability error and the interaction variability between parts and appraisers, if it is statistically significant. In this case, the Gauge & study may only be performed with the experimental method by considering an ANOVA Mixed(M) model with parts(appraiser) as the random(fixed) factor, which can also detect an interaction between appraisers and parts, (unrestricted model). In Gauge & studies with the Mixed(M) model, a common precision error is calculated for each appraiser, i.e. , = and , = , for = 1, … , . As a consequence, for the within-appraiser observed values the estimated distributions only differ with respect to location, i.e. , ∼ ( , , 2 ).

MS capability metrics
The following capability metrics can be calculated after a Gauge & study and are intended for qualifying a MS used to discriminate between good and bad parts: • Precision-to-tolerance ratio : this metric is useful when the MS is used for checking the conformance of a part to specifications. It is defined by doing reference to the specification interval ( , ) of the quality characteristic to be measured: where is either 5.15 or 6. The AIAG Manual considers a capable MS when ≤ 0.1. If 0.1 ≤ ≤ 0.3, then the MS can be considered acceptable, when process capability is high and the misclassification costs are low. When ≥ 0.3 the MS is not capable. If the MS is intended for use in process monitoring, then the is not a good indicator of its intrinsic capability because it does not compare the MS variability to the natural process variation.
• Misclassification rates: they quantify how well the MS discriminates between good and bad parts. A false failure occurs when ≤ ≤ and the MS classifies the part as nonconforming. The joint probability of false failure (producer's risk) is equal to: where ( , ) is the joint probability distribution function of and , which is bivariate normal with mean [ , ] ⊺ and covariance matrix: Similarly, a missed fault occurs for ≤ or ≥ and ≤ ≤ , that is when the MS wrongly classifies the part as conforming to specifications. In this case, the joint probability of missed fault (consumer's risk) is equal to: For a Gauge & study with the ANOVA Mixed(M) model, it is possible to define the -th appraiser's misclassification rates [ , ], for = 1, … , , by assuming the appraiser's population mean , instead of the pooled mean .
Other MS capability metrics are suitable to qualify a MS used to run a Statistical Process Monitoring tool, for example a control chart. In this case, the measurement precision error should be small, if compared to the process variation . Therefore, these metrics depend on = √ : • Number of Distinct Categories : it represent the number of categories that can be reliably distinguished by the MS. It is also known as the Signal-to-Noise Ratio ( ). It is calculated as: see Woodall and Borror 13 for an explanation about the formula. A value of five or greater is recommended. Another metric related to distinct categories is the discrimination ratio , non discussed here, see Woodall and Borror 13 for further details.

• Proportion of total variation
: it represents the proportion of total variation in the measurements due to the MS: For convenience of notation, the symbol is introduced here. In many statistical packages it is usually known as the Total gage & %Study Var, even if it is a ratio between two standard deviations. In the AIAG Manual it is defined as the % metric. Similarly to the , the threshold limits 0.1(0.3) have been settled by AIAG to decide if the MS is capable(uncapable) for quality monitoring. The has the advantage of directly weighing the measurement error to the total observed variability, thus being a good criterion to decide if the MS is fit for use in process monitoring or not. It can be easily shown that there is a close relationship between and : thus a ≤ 0.1 leads to ≈ 13.

THE BOOTSTRAP CIs PROCEDURE
In this Section, we discuss the integration of the proposed bootstrap procedure to a Gauge & study based either on the ARGG or the experimental method with ANOVA Random(R) or Mixed(M) model. We remember from the Appendix that denotes the number of parts, is the number of appraisers and is the number of measurement readings for every part from each appraiser. Let

PERFORMANCE STUDY
The performance study starts by comparing the proposed bootstrap approach to the nonparametric bootstrap procedure developed for the Gauge & study with ARGG method in Wang and Li. 11 The Gauge & study datasets shown in Table 1 (adapted from Hoguet 6 ), and Table 4 of Wang and Li 13 are considered. For the sake of brevity, in Tables 2 and 3 we only consider the CIs obtained with the two Bootstrap procedures for the measurement error and its reproducibility/repeatability components ( , ). To give a comprehensive overview of the proposed Bootstrap procedure, the CIs are shown for the ARGG method and the experimental method with both the Random(R) and Mixed(M) ANOVA models. The generalized CIs are shown, as well. We note that the point estimates of are larger in Wang and Li 10 than those found in this paper with the ARGG method because they did not account for the repeatability variability contamination suggested in the AIAG Manual, as shown in Appendix, Equation (A.2). Furthermore, in Table 3 the estimation values of measurement error components are different because Wang and Li 10 proposed āand control chart to implement the ARGG method although setting = 2 would suggest using the range control chart. From the two tables, it is evident that the proposed parametric bootstrap CIs are the narrowest for the repeatability error and the measurement error . Only for the reproducibility component , the Wang and Li 10 bootstrap CIs with the ARGG method are narrower than the other two procedures.
For the proposed bootstrap CIs, it is important to verify the coverage probability versus the nominal (1 − )% confidence level. For = {50, 100, 150, 300, 500}, we have simulated the coverage probability for the CIs of the measurement error , its variability components , and the process error . We have assumed ( , , ) = (10, 3, 3), (20, 6, 6), = 0.05 and for the pooled observations' distributions = = 0, = 1; furthermore, = 0.2 and = 0.2 × have been assumed as input values. With the experimental method, both the Random(R) and Mixed(M) models have been considered. For ≥ 100, the calculated coverage probability always meets the nominal 95% confidence level, whichever is the selected method to perform the Gauge & study.
The performance study has been continued by investigating the effect of Gauge & study design parameters and measurement error components on the calculation of bootstrap CIs. In particular, we have considered the design parameters = ( With respect to the process variability , the pooled measurements variability and related ratios, Tables 4-7 show that: • the uncertainty associated to is higher than . Depending on the design parameters ( , , ), the values for are between three and seven times values for . This depends on the higher simulated value of = 0.995 than = 0.1. Conversely, values are approximately the same for and = 1; • the uncertainty associated to √ = is always larger than √ = . This finding depends on the fact that = √ 2 + 2 and, therefore, it accumulates the uncertainty in the estimation of both process and MS variability. As a consequence, the estimation of metric is more uncertain than . Tables 4-7

ILLUSTRATIVE EXAMPLE
We show the implementation of the bootstrap procedure with the Gauge & study dataset presented in the AIAG Manual, 1 see Table 8. The Gauge & study is arranged by considering ( , , ) = (10,3,3). It is performed with the experimental method: for the sake of comparison, it is implemented with both the ANOVA Random(R) and Mixed(M) models. To provide more details, we assume a specification interval ( , ) = (−4.5, 4.5). Table 8 shows the AIAG MSA Manual dataset. 1 A summary of obtained results from the Gauge & study is shown in Table 9. This Table also shows the bootstrap and generalized CIs for the sources of variation and the related MS performance measures. Finally, it also presents the TA B L E 5 Sensitivity analysis on Gauge & study design parameters, = = 0, ( , ) = (0. The CIs have been obtained from = 10, 000 sampling replications. In Table 9 the estimates and CIs are provided for both ANOVA models: column "estimate" shows the results obtained with the Mixed(M) model and the Random(R) model between round parentheses. No significant interaction between appraisers and parts is present for the MSA Manual dataset, that is,ˆ= 0: therefore, the restricted version of the ANOVA mixed(M) is used in the study. As expected, the Random(R) model gets a larger estimate for the reproducibility error than the Mixed(M) model because the appraisers are considered as a random factor. This influences the estimates of the derived MS performance parameters from . Furthermore, with the Random(R) model the width of CIs is wider than the Mixed(M) model: the CIs width increase is larger for the generalized CIs than the bootstrap CIs, being the latter more robust to the ANOVA model selection. In any case, the bootstrap CIs are narrower in width than the generalized CIs, whichever is the selected ANOVA model. For the Random(R) model, they are also narrower than the MLS CIs generated by Minitab. As an outcome from the Gauge & study, the estimated TA B L E 6 Sensitivity analysis on Gauge & study design parameters, = = 0, ( , ) = (0.1, 1), 2 = 0.8 × 2 , = 10 . To show it, we assume = 6 for the calculation in this example. Figure 1   chart. All contributions investigating the effect of measurement error on the performance of control charts have always implicitly assumed the pooled distribution of observations for the MS by overlooking the within-appraiser performance.
That is, these studies implicitly assume Gauge & studies performed with the ARGG or ANOVA Random(R) model methods. However, when the estimation of the within-appraiser distributions is possible with the ANOVA Mixed(M) model, the current pooled measurement error model proposed in Linna and Woodall 8 for testing control charts performance is no longer effective. Distributions of observations should be considered for each appraiser and the control chart's performance should be investigated by assuming each appraiser schedule in the measurement process.

A C K N O W L E D G M E N T S
Open Access Funding provided by Universita degli Studi di Napoli Federico II within the CRUI-CARE Agreement.
[Correction added on 7 July 2022, after first online publication: CRUI-CARE funding statement has been added.]

D ATA AVA I L A B I L I T Y S TAT E M E N T
All datasets used in the manuscript are referenced throughout the text and available in the original papers, where they have been originally published. The dataset used in the illustrative example is fully reported in Table 8  To design a Gauge & study, the number of parts, appraisers ≤ , (fe.g., crews, lab technicians, operators or instruments), and trial measurements on each part should be fixed. In general, to conduct a Gauge & study = × × observations are needed. Measurements should be collected randomly without any part identification by each appraiser. We denote as the observed value of the quality characteristic of interest on the -th part, = 1, … , , measured by the -th appraiser, = 1, … , , at the -th trial, = 1, … , . Then, we use the following notation: and̄. = 1 × . are the total and average of measurements collected on the -th part by the -th appraiser, respectively; and̄. . = 1 × × . . are the total and average of measurements collected on the parts by the -th appraiser, respectively; . are the total and average of measurements collected on each -th part by the appraisers, respectively; are the total and average of collected measurements, respectively; Now we briefly present formulas for the sources of variability calculations with the Average and Range AGRR and the experimental methods.

A.1
Average and Range (ARGG) method With this method the Gauge & study is performed by runninḡand control charts for each appraiser involved in the study. A control chart can be used when is sufficiently large and the range is no longer an efficient estimator of the variability. With the ARGG method, the repeatability and reproducibility errors are estimated from thēand control charts plotted for each appraiser. This method overlooks a potential source of variability due to interaction between parts and appraisers.
To plot the control chart for each appraiser , for = 1, … , , the ranges = ( ) − ( ) are calculated for each part , for = 1, … , . Then, the average rangē. for each appraiser is calculated as̄. = 1 × ∑ =1̄. The grand averagē… = 1 × ∑ =1̄. is immediately derived. The grand average rangē… is used to calculate the control limits of each control chart, whose width is proportional to the estimated value of the repeatability variability (Equipment Variation EV in the AIAG Manual):ˆ=̄… * 2 ( × , ) where * 2 ( × , ) is an unbiasing constant associated to the distribution of average range. Its values can be found in the AIAG Manual, Appendix C. 1 The averagē. . of measurements collected on the parts for each appraiser is used to estimate , and to plot the central line of each̄control chart, for = 1, … , . The rangē= (̄. . ) − (̄. . ) among appraisers is calculated to get the estimateˆof the reproducibility variability (Appraiser Variation AV in the AIAG Manual), which measures the bias among appraisers:ˆ= where a fraction of repeatability variability is subtracted since theˆis contaminated by theˆ, see the AIAG manual. 1 The estimated process variabilityˆis obtained by considering the averages of measurements̄. . for each part , for  Finally, the estimated mean of combined observations from the Gauge & study is =̄… .

A.2
Experimental method with ANOVA models With this method, the Gauge & study is designed as an experimental plan. Here, we briefly recall formulas for the most common scenarios, that is, the balanced two factor Random(R) and Mixed(M) models with and without interaction. Tables showing the ANOVA sources of variation and formulas to get the point estimators are reported for each model with and without interaction. Details about these formulas can be found in Burdick et al. 2 Two-factor Random(R) model Both "parts" and "appraisers" are considered as random factors. This model should be assumed by the practitioner to carry out the Gauge & study when the appraisers are randomly selected from a set of cardinality : for example, the crew of operators who randomly perform the measurement process during quality inspections. The reference model with interaction is: where is a constant, , , ( ) and the error are jointly independent normal random variables with means of zero and variances 2 , 2 , 2 and 2 . When there is no interaction, the term ( ) is noise adding error. Tables A.1-A. 4 show the ANOVA sources of variation and the point estimators, respectively, for the Random(R) model with and without interaction.

Two-factor mixed (M) model
The practitioner considers this model when all appraisers performing the measurement process are included in the Gauge & study: thus, "Parts" should be assumed as a random factor and "appraisers" as a fixed factor, with = . This is a very frequent situation in small companies, where a few appraisers are engaged in the measurement process activities.  where , is a constant to the th appraiser, , ( ) and the error are jointly independent normal random variables with means equal to zero and variances 2 , 2 and 2 : therefore, we consider an unrestricted mixed model, as explained in Burdick et al. 2 When there is no interaction, the term ( ) is included within the error and a restricted model is studied. Tables A.5-A.8 show the ANOVA sources of variation and the point estimators, respectively, for the Mixed(M) model with and without interaction.
For the sake of brevity, the formulas of 2 , 2 , 2 and 2 are not reported here: see Burdick et al. 2 or any textbook about experimental design for more details.