A SSESSING QUALITY OF SELECTION PROCEDURES : L OWER BOUND OF FALSE POSITIVE RATE AS A FUNCTION OF INTER - RATER RELIABILITY

Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. We outline the connection between the ratings’ measurement model (used for IRR) and a binary classification framework. We develop a simple way of approximating the probability of correctly selecting the best applicants which allows us to compute error probabilities of the selection procedure (i.e., false positive and false negative rate) or their lower bounds. We draw connections between the inter-rater reliability and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures. We also discuss possible other uses of the explored connections in other contexts, such as educational testing, psychological assessment, and health-related measurement and implement the computations in IRR2FPR R package.


Introduction
High-quality applicant, grant, or manuscript (jointly referred to as applicant thereafter) selection procedures are necessary for fair and efficient allocation of resources.Selection procedures usually aim to assess applicants' quality, which can be thought of as a latent variable, based on observable (and potentially imperfect) indicators of their quality (Bartholomew et al., 2011).
Selection procedures coarsen ratings into a binary outcome: an applicant may be rated on a ten-point scale, but in the end is either hired or refused, a grant proposal is either funded or dismissed, and an article is either accepted or rejected (e.g., Cicchetti, 1991).However, the most commonly used metric for evaluating the reliability of selection procedures-the inter-rater reliability (IRR)-focuses on the ratings rather than the outcomes of selection procedures. 1his is, of course, not a novel observation, and the disconnection between evaluating raters' reliability and whether the best applicants were selected was noted earlier (e.g., Kraemer, 1991;Mayo et al., 2006;Nelson, 1991).In cases where applicant selection is based on a fixed threshold (i.e., pass/fail tests), the expected classification accuracy can be estimated by methods outlined by Rudner (2000Rudner ( , 2005) ) and Guo (2006) (also see Hanson & Brennan, 1990; W.-C. Lee, 2010;Livingston & Lewis, 1995).We extend the aforementioned approach to settings where a proportion of the best candidates is selected and show that, under the assumption of a normally distributed latent variable, the expected classification accuracy can be directly obtained from IRR and the proportion of selected candidates.Subsequently, the selection procedures can be characterized as binary classification and evaluated via well-known and interpretable metrics such as sensitivity or false positive/negative rates.Furthermore, the binary classification framework allows researchers and stakeholders to evaluate and improve selection procedures while incorporating the costs of incorrect decisions, increasing the number of raters, or modifying the rating procedure.In fact, connecting reliability to binary classification recalls classical models that evaluate selection procedures based on validity (e.g., Cronbach & Gleser, 1957;Taylor & Russell, 1939).
While our approach is related to univariate classification under measurement error, this use case differs.Compared to typical classification tasks, which aim to separate subjects into different categories (Duda, Hart, et al., 2006), we assume the existence of a single continuous latent trait (or a composite score based on a multidimensional assessment) measured by the ratings.Then, we aim to select the best applicants defined by the latent trait and evaluate the (miss)classification probabilities due to the measurement error contained in the observed ratings.Ideally, the validity of the observed indicators (and their combination into the overall assessment) would be evaluated directly, thus answering the question of how well the results of selection procedures predict applicants' success.However, such a "gold standard" measure of success needed for directly assessing the validity of selection procedure is often not available (Lauer and Nakamura, 2015;Moher and Ravaud, 2016;Superchi et al., 2019, also see Grant et al., 2022 for alternatives); either because the success measure is located too far in the future, or it is difficult to agree on the success measure itself (see Lauer et al., 2015;Li and Agha, 2015 for suggestion to use bibliometric measures in grant reviews, and Fang et al., 2016;Lindner and Nakamura, 2015 for a subsequent critique).Consequently, we are often left to assess the reliability of the selection procedures, as reliability limits the usefulness even of completely valid indicators (Thurstone, 1931).Therefore, our approach results in the lower bound on the corresponding error probabilities as it does not account for additional miss-classifications due to (a lack of) validity.
The paper proceeds as follows: First, we define a minimal measurement model for ratings in the selection procedures and characterize the selection procedures in terms of their outcome using the binary classification.Then we propose a quantile approximation connecting the measurement model with the binary classification and show the relationship between IRR and the binary classification metrics.Second, we evaluate the quality of the quantile approximation in a simulation study by comparing the empirical true positive rate to the estimated true positive rate.Third, we demonstrate the outlined methodology on an example of multiple grant peer review, where we compare multiple selection procedures in terms of their false positive rate, false negative rate, and F 1 score and then show how the false positive rate changes with increasing the number of raters and IRR.Finally, we close with a discussion of limitations and extensions.

Measurement Model
For ease of exposition, we start with a simple selection procedure that evaluates N applicants with the goal of estimating their latent abilities γ i , which would traditionally be analyzed with a one-way ANOVA model (i.e., assuming either the lack of rater overlap or lack of rater effect on the ratings).More complicated scenarios, for example, rater effects, when the same set of raters performs the ratings, would be accounted for via a two-way ANOVA or more complex models (see Brennan, 2001;Martinková and Hladká, 2023).In the selection process, each applicant is rated J times, resulting in a vector of observed scores y ij for i = 1, . . .N and j = 1, . . ., J. The resulting measurement model can be written as with the standard assumptions that (a) the measurement error ϵ ij of each applicant is independently normally distributed with zero mean and variance σ 2 ϵ and (b) the latent abilities γ i are normally distributed with mean µ and variance σ 2 γ (de Leeuw & Meijer, 2008;Searle et al., 2006).Precision of the estimates depends on the appropriateness of the specified model.In the "Simulation" section, we explore the effect of deviations from the specified model on the results.
The quality of the selection procedure is usually measured as inter-rater reliability (sometimes also referred to as single-rater IRR), which is the intra-class correlation coefficient (ICC(1,1) McGraw & Wong, 1996;Shrout & Fleiss, 1979) When the final decisions are based on the average of J raters, the multiple-rater IRR can be calculated using the Spearman-Brown formula as

Binary Classification
The endpoint of the selection procedure is, however, not estimating the latent abilities γ i per se but selecting k applicants with the highest latent ability γ i .Consequently, we can divide the applicants into two groups: high-ability applicants A that consist of k applicants with the highest latent ability γ i and "not high-ability applicants" ¬A that consist of the remaining N − k applicants.This is similar to Taylor and Russell (1939) who differentiate between "satisfactory" and "unsatisfactory" applicants.The difference between the described and the Taylor and Russell (1939) framing is that we are interested in selecting the "best" rather than "satisfactory" applicants-a much more stringent requirement, which simplifies the problem and allows us to evaluate the procedure without the knowledge of the "satisfactory" cutoff.
Denoting the selected applicants by S and the remaining applicants by ¬S, we can characterize the selection procedure by probabilities of the different types of (miss)classification (Table 1).
Selected (S) Not Selected (¬S) High-Ability Applicants (A) P (S ∩ A) P (¬S ∩ A) P (A) Not High-Ability Applicants (¬A) P (S ∩ ¬A) P (¬S ∩ ¬A) P (¬A) P (S) P (¬S) 1 Table 1: Overview of the selection procedure as a binary classification.
Since the proportion (i.e., unconditional probability) of selected applicants corresponds, by definition, to the proportion (i.e., unconditional probability) of high-ability applicants, P (S) = P (A) = k /N, the whole classification process is determined by the probability of correctly selecting the high-ability applicants P (S ∩ A), i.e., by the probability of true positives classifications (Table 2).

Selected (S)
Not Selected (¬S) High-Ability Applicants (A) Subsequently, if we knew the probability of true positive classification, we could use standard metrics for evaluating binary classification (e.g., Pepe, 2003, pp. 14-33).Since the off-diagonal probabilities of the classification outcomes are equal (Table 2), many commonly specified metrics become equal or simplify to a constant.For example, true positive rate (TPR, aka sensitivity) and positive predictive value (PPV, aka precision) are equal, and true negative rate (TNR, aka specificity) is equal to 1 /2.We can also compute the false positive rate (FPR, corresponding to type I error rate), or false negative rate (FNR, corresponding to type II error rate), Some more complex functions of classification probabilities, e.g., an F 1 score, also become identical to the true positive rate, The classification probabilities can be used with a utility function specifying the cost of each classification type (e.g., Metz, 1978).

Quantile Approximation
The probability of true positive classification is, unfortunately, not directly estimable from the observed scores y ij , unless we know the true latent ability γ i of each applicant.To deal with this hindrance, we propose a quantile approximation that allows us to estimate the true positive classification probability indirectly.Specifically, we leverage distributional assumptions of the measurement model (Equation 1), our ability to estimate the populational parameters of the measurement model, and one peculiarity of the selection procedure: the fact that we set the unconditional probability of the applicant with high ability, P (A) and the unconditional probability of the applicant selected, P (S), when designing the selection procedure.
In selection procedures, the high-ability applicants are defined by their ranking on the latent ability γ i .However, we only measure the observed scores y ij when performing the selection.According to the measurement model, the observed scores y ij are an imprecise reflection of the latent abilities γ i , which is disturbed by some measurement error with variance σ 2 ϵ .The J observed scores of each applicant are usually aggregated into a mean score ȳi with the resulting measurement error σ ϵ / √ J. Importantly, the latent abilities γ i and the random variable of the aggregated scores ȳi are jointly distributed with bivariate normal density: (e.g., Searle et al., 2006, pp. 258-259).
Since the goal of selection procedures is selecting the high-ability k applicants and since the desirability of applicants is defined by their latent ability γ i , the high-ability applicants are defined as those with the highest k latent ability scores γ i .
Then, we can think about a cut-score on the latent ability g c that separates the high-ability applicants from the remaining applicants (with the high-ability applicants satisfying the condition γ i > g c ).Similarly, since the selection procedure is performed with the observed aggregated scores ȳi and since we select the high-ability k applicants according to their observed aggregated scores ȳi , the selected applicants are defined as those with the highest k observed scores ȳi .Then, again, we can think about a cut-score on the observed aggregated scores ȳc that separates the high-ability applicants from the remaining applicants (with the high-ability applicants satisfying the condition ȳi > ȳc ).
In any particular data set, the cut-scores g c and ȳc are dependent on the actual latent abilities γ i and observed aggregated scores ȳi of applicants participating in the given selection procedure.However, under the assumption of the measurement model, we can approximate the cut-scores with marginal quantile functions of the joint distribution (Equation 5), yielding the quantile approximated cut-scores with Φ corresponding to a cumulative distribution function of a standard normal distribution.
Subsequently, the true positives classification probability in Table 2, defining the binary classification metrics, can be approximated using the latent abilities γ i , its imperfect measurement via the random variable of aggregated observed scores ȳi , and the quantile approximated cut-scores gc and ỹc , In other words, the probability of an applicant being amongst both the high-ability and selected applicants can be approximated by the probability of both having latent ability higher than the approximated cut-score on the latent ability and having the observed aggregated score higher than the approximated aggregated cut-score.
Finally, the true positives classification probability can be approximated solely with the populational parameters and integration over the joint bivariate normal density With the true positive classification probability at hand, we can assess the selection procedure in the binary classification framework and compute its error rates and other binary classification metrics.
The true positive classification probability approximation is correct to the extent to which the observed scores y ij are a valid measure of the underlying ability γ i .Often, the observed ratings are not only affected by lacking reliability but also by lacking validity.In such a case, the true positive classification probability is overestimated (i.e., additional errors occur due to the lack of validity), and the computed miss-classification metrics (FPR and FNR) become an upper bound on the true miss-classification metrics.
This approach is similar to the tetrachoric correlation coefficient (e.g., Bonett & Price, 2005;Pearson, 1900), which aims to estimate the correlation between latent continuous variables based on observed dichotomized variables instead of assessing the classification of latent dichotomous variables based on observed continuous variables.

Relationship between IRR and Binary Classification
We can further utilize the quantile approximation to draw connections between IRR and the true positives classification probability of the binary classification.
We subtract the grand mean µ from both the latent abilities γ i and the observed aggregated scores ȳi and standardize the random variables by the total variance of the observed aggregated scores σ 2 γ + σ 2 ϵ /J.This transforms the latent abilities into θ i = (γ i − µ)/ σ 2 γ + σ 2 ϵ /J and the observed aggregated scores into zi = (ȳ i − µ)/ σ 2 γ + σ 2 ϵ /J.Subsequently, the bivariate normal density of the transformed abilities and the random variable of transformed observed aggregated scores are simplified to and yields the quantile approximated cut-scores of the transformed abilities and observed aggregated measures which also highlights that the two cut-scores are equal under perfect inter-rater reliability, IRR J = 1.
The true positives classification probability can be directly estimated with only IRR J and the proportion of selected applicants k, This allows us to retrospectively evaluate selection procedures in the binary classification framework without access to the primary data.

Simulation
We conducted a simulation study to assess the performance of the quantile approximation.The simulation considers four data-generating scenarios to assess how the quantile approximation is affected by the violation of its assumptions; (1) 'normal', where the data-generating model corresponds to the assumptions of the quantile approximation, (2) 'skew', where the true abilities are positively skewed, (3) 'bias', where a proportion of applicants is rated in a reverse matter, and (4) 'dependent', where ratings across participants are dependent on the reviewers.The simulation compared the empirical true positive classification probabilities P (S ∩ A) based on the simulated abilities γ i vs. the aggregated ability estimates ȳi (i.e., the grand truth under complete knowledge) with the true positives classification probability obtained by quantile approximation P (γ > gc , ȳ > ỹc ) based on σ2 γ and σ2 ϵ estimates from a linear mixed model (Equation 7).We focus on the true positive classification probabilities rather than the error rates, or other classification metrics, as the error rates are derived quantities of the true positive classification probability.
In the 'normal' scenario, the data were simulated from the measurement model defined by Equation (1) with settings inspired by Martinková et al. (2018).We manipulated the single inter-rater reliability (intra-class correlation coefficient) IRR 1 = {0.15,0.30, 0.45} (fixing the overall variance σ 2 to 1), 2 number of applicants, N = {100, 300, 1000}, and the number of ratings, J = {3, 5, 10}.The range of IRR 1 values corresponds to values found in grand peer review evaluations (see Table 3 in the Example section) and the higher number of raters create settings where the total inter-rater reliability of the rating reaches IRR J = 0.89 (setting with IRR 1 = 0.45 and J = 10).
The remaining scenarios were generated by the following modifications of the 'normal' scenario; the 'skewed' scenario simulated the latent abilities γ i from Gamma(3, 3) distribution which generated a notable right skew, the 'biased' scenario reversed ratings for 10% of the applicants, i.e., breaking the relationship between mean ratings and the latent abilities γ i , and the 'dependent' scenario generated ratings from non-exchangeable raters where between-rater variance constituted 0.10 of the total variance.We replicated each simulation condition 1000 times and estimated the variance parameters σ 2 γ and σ 2 ϵ using linear models implemented in the lme4 R package (Version 1.1.35.1, Bates et al., 2015).In the remainder of the section, we only discuss results based on N = 100 applicants, but see (Appendix A) for similar results with N = 300 and N = 1000.
Figure 1 visualizes the bias of quantile approximated true positive classification probabilities across all possible proportions of selected applicants ( k /N, x-axis) with different data generating scenarios encoded with color.The results are presented for different inter-rater reliability coefficients (IRR 1 , rows), the number of ratings (J, columns), and 100 applicants.The 'normal' (black) scenario showed little to no bias across all conditions.The 'skew' (orange) scenario resulted in a negative bias of the true positive classification probabilities for a small proportion of applicants and a positive bias for a large proportion of applicants for all but the lowest number of ratings and the lowest single inter-rater reliability (due to estimation issues of the between-applicant variance).The 'bias' (blue) scenario showed steadily increasing positive bias for up to half of the selected applicants, with a steadily decreasing bias after that.The 'dependent' (green) scenario resulted in more distinct and opposite bias than the 'skew' scenario, i.e., positive of the true positive classification probabilities for a small proportion of applicants and negative bias for a large proportion of applicants which was extremely pronounced in the lowest number of ratings and the lowest single inter-rater reliability.
Across all but the 'bias' scenario, the magnitude of bias decreased with the increasing number of ratings and the single inter-rater reliability.Increased sample size also reduced the degree of bias although to a smaller degree.
2 Whereas the measurement model is defined in terms of the grand mean µ and variances σ 2 γ and σ 2 ϵ , Equation (8) shows that IRRJ is the only relevant quantity for the quantile approximation Importantly, the quantile approximation estimates the expected classification probabilities for a selection procedure with given populational characteristics.In other words, the quantile approximation might provide noisy estimates of the true values for any particular selection procedure, especially when the classification probabilities depend on a small number of selected applicants.Figure 2 visualizes the root mean square error (RMSE) of quantile approximated true positive classification probabilities across all possible proportions of selected applicants ( k /N, x-axis) with different data generating scenarios encoded with color.The results are presented for different inter-rater reliability coefficients (IRR 1 , rows), number of ratings (J, columns), and 100 applicants.The x-axis corresponds to the proportion of selected applicants ( k /N).Note that the true positive classification probability is bounded at the endpoints-it is zero if no candidate is selected, and it is one if all candidates are selected.As such, the RMSE of the true positive classification  Figure 2: Root mean square error of quantile approximated true positive classification probabilities for 100 applicants across all possible proportions of selected applicants (x-axis)with different data generating scenarios encoded with color.Results are presented for different inter-rater reliability coefficients (rows) and the number of ratings (columns).
probability is necessarily higher above the middle proportion of selected applicants since it can attain the widest range of values.We find that the RMSE is comparable in the 'normal', 'skew', and 'bias' scenarios.With 1000 observations, we can notice an increased RMSE for the 'skew' scenario in high proportions of selected applicants, which is a result of the biased estimates.The 'dependent' scenario produces a much larger degree of RMSE, especially in conditions with a low number of ratings and single inter-rater reliability.The RMSE of the approximation, again, improves with the increasing number of ratings and single inter-rater reliability as well as with the increasing number of applicants.

Confidence and Prediction Intervals
Dependency on any particular selection procedure can be further illustrated by visualizing the empirical vs. quantile approximated results for a single trial.For example, consider the false positive rate, the probability an applicant is selected while they are not part of the high-ability group (a transformation of the previously summarized true positive classification probability, Equation 3). Figure 3  The empirical false positive rate wildly oscillates between 0 and 1, especially when considering the selection of only a small number of applicants (x-axis).These wild oscillations, accompanied by the extremely wide 95% prediction interval (blue dotted line; based on quantile function of binomial distribution), are an inherent property of summarizing the proportion of successes of a binomial outcome with a small number of trials.In contrast, 95% confidence intervals of the false positive rate (blue dashed lines), which quantifies the uncertainty of the false positive rate estimate (based on non-parametric bootstrap) is wider in the lower proportion of the selected candidates and shorter around the endpoints.
This results from the false positive rate estimate being bounded at the endpoints, with higher uncertainty in the lower proportion of selected applications-as it is a transformation of the true positive classification probability.

Example: Grant Peer Reviews
We illustrate the methodology by estimating binary classification metrics for several grant peer reviews.See Table 3 for characteristics of the grant peer reviews as summarized in Table 1 of Erosheva et al. (2021) and extended by the results reported therein. 3It is important to note that the reported values are averages, often aggregated across disciplines and years (sometimes accompanied by a range of values).The results should, therefore, be taken as an illustration showcasing the possible interpretation and inferences rather than an evaluation of the funding agencies' grant review process.Cole et al. (1978, p. 140).

Study
2 Based on Erosheva et al. (2020).COSPUP = Committee on Science and Public Policy of the National Academy of Sciences, NSF = National Science Foundation, FWF = Austrian Science Fund, AIBS = American Institute of Biological Sciences, NIH = National Institutes of Health.J corresponds to the mean number of raters in Cicchetti (1991), Mutz et al. (2012), andErosheva et al. (2021) as the number of raters per proposal was unequal.
Figure 4 visualizes the false positive rate (left) and false negative rate (right) estimates for each grant peer-review procedure (different colors) based on quantile approximation.Thin full lines visualize the computed false positive rate across the range of possible proportions of selected applicants, and thick full lines (and points) highlight the ranges (and values) of the actual proportion of selected applicants.The estimated false positive rate ranges from 0.75 for the lowest proportion of selected applicants (5%), lowest inter-rater reliability (0.14), and two rates in the AIBS 2009-11 grant proposals data to 18% for the second highest proportion of selected applicants (51%), high inter-rater reliability (0.37) and more than four raters in the NSF & COSPUP (1985) grant proposals data.However, the false negative rate shows the opposite pattern; the lowest false negative rates of 0.029 and 0.040 are for the AIBS 2009-11 grant proposals data with the lowest proportion of selected applicants, and the highest false negative rate of 0.27 is for the FWF (1999-04) data set with the highest proportion of selected applicants (53%).
These results highlight that most differences in false positive and false negative rates can be ascribed to the difference in the proportion of selected applicants.Although the false positive rates differ by 58 percentage points across the example data sets (37 percentage points excluding the lower IRR estimates of NSF & COSPUP and AIBS), comparing the grant proposal selection procedures at an equal proportion of selected applicants (across the 5% − 53% range of selected applicants) reduces the maximum difference to 30 percentage points (15 percentage points excluding the lower IRR estimates of NSF & COSPUP and AIBS; both at 5% of selected applicants).
We can combine both types of error into an F 1 score, which in the case of selection procedures of the best candidates also corresponds to the true positive rate.Figure 5 visualizes the computed F 1 score based on the quantile approximation for each grant peer-review procedure in the same format as the previous figure.The NSF & COSPUP (1985) grant proposals data result in the highest F 1 = 0.82 at the second highest proportion of observed candidates.The AIBS 2009-11 grant proposals data results in the lowest F 1 = 0.24 at the lowest proportion of selected applicants.Again, the large majority of differences between selection procedures are based on the proportion of selected candidates, however, the minimum difference between the best and worse performing selection procedure according to F 1 score remains 0.14 (0.09 excluding the lower IRR estimates of NSF & COSPUP and AIBS; both at 53% of selected applicants).Figure 4: Estimates of the false positive and false negative rate for grant peer review procedures based on quantile approximation for published estimates of IRR and number of raters J of individual studies (different colors).Thin full lines visualize the estimated false positive rate across the range of possible proportions of selected applicants, and thick full lines (and points) highlight the ranges (and values) of the actual proportion of selected applicants.For NSF & COSPUP and AIBS, the lower IRR J estimate, resulting in a higher false positive rate, is shown as dotted lines / empty circles, whereas the higher IRR J estimate is shown as full lines / full circles.

In Detail Assessment of NIH (2014-16)
We can take a closer look at the NIH data set of Erosheva et al. (2021) where the reported IRR 1 = 0.34 was accompanied by 95% CI [0.31, 0.37].The corresponding estimate of the false positive rate is 0.397, and as in the simulation study, we can transform the 95% CI of IRR 1 to obtain the 95% CI interval [0.380, 0.416] and the 95% prediction interval [0.331, 0.465] for the false positive rate.We can also compute the false negative rate estimate (i.e., type II error rate, Equation 4) of 0.087, 95% CI [0.083, 0.091].
We might redesign the selection procedure to achieve a lower false positive rate.This can be achieved by increasing the number of raters (left panel of Figure 6) or modifying the rating protocol to improve the inter-rater reliability IRR 1 (right panel of Figure 6). 4Even though both changes lead to an increase in the overall inter-rater reliability, IRR J , and consequently lower false positive rate, each change corresponds to a different modification of the peer review process (i.e., hiring more raters vs training the raters).
The black line in Figure 6  Figure 5: Estimates of the true positive rate (equivalent to F 1 score) for grant peer review procedures based on quantile approximation for published estimates of IRR and number of raters J of individual studies (different colors).Thin full lines visualize the estimated false positive rate across the range of possible proportions of selected applicants, and thick full lines (and points) highlight the ranges (and values) of the actual proportion of selected applicants.For NSF & COSPUP and AIBS, the lower IRR J estimate, resulting in a higher false positive rate, is shown as dotted lines / empty circles, whereas the higher IRR J estimate is shown as full lines / full circles.positive error rate remains relatively high even when increasing both the number of raters and IRR 1 .Similar analysis can further incorporate the cost of each modification and weigh it against the benefit of decreasing the false positive rate of the selection procedure.
The calculations presented here are implemented in the IRR2FPR R package which can also be interactively run through the ShinyItemAnalysis application (Martinková & Drabinová, 2018).

Discussion
We outlined an approach for evaluating ratings-based selection procedures in a binary classification framework.This approach allows researchers and stakeholders to assess the quality of selection procedures by linking inter-rater reliability to the actual selection decisions.As a result, the quality of selection procedures does not need to be evaluated only via a coefficient of inter-rater reliability but also via the desired false (true) positive/negative rates and their associated costs.
These results can be used in more complex utility functions, featuring the costs of increasing the number of raters, modifying the rating guidelines, and changing the proportion of selected applicants.
The approach is based on a minimal measurement model, commonly used for assessing inter-rater reliability.We link the measurement model to binary decisions via a quantile approximation, assuming the goal of the selection procedure is selecting the best candidates.We showed how to compute the probability of correct classification and other binary classification metrics.We also showed how the binary classification relates to inter-rater reliability via the quantile approximation relying only on the number of ratings and proportion of selected applicants.The fact that the quality of the selection procedure evaluated via reliability is crucially dependent on the proportion of selected applicants echoes similar calls from past research on validity (Brogden, 1949;Taylor & Russell, 1939).
The approach, however, only evaluates the selection procedures regarding reliability, assuming that the ratings are completely valid and the only limiting factor of the selection procedure is the measurement error (Bornstein, 1991).
That is, unfortunately, rarely the case.In cases with lower than perfect validity, the computed false positive rate can be used as a lower bound on the true false positive rate, which is further contaminated with low validity.Furthermore, one might argue against optimizing raters' reliability as the disagreement between raters might indicate that raters were selected on diverse bases and focused on different aspects of applicants (e.g., Bailar, 1991;Hargens, 1991;Kiesler, 1991; C. J. Lee, 2012).
We evaluated the quantile approximation in a simulation study and found little bias when the model assumptions were met.Breaking the dependency between the observed ratings and the latent abilities resulted in an overestimation of the true positive rate-which would result in the expected underestimation of the false-positive rate; however, a positive skew of the true abilities and dependency of the ratings on the raters led to different patterns of bias and increased RMSE.In general, the bias and RMSE decreased with an increasing number of raters, single inter-rater reliability, and the number of applicants.Importantly, the estimated classification probabilities and resulting binary classification metrics correspond to the expectation of the selection procedure.The actual classification probabilities for any given selection procedure might be far from the expectations, especially when considering only a small proportion of selected applicants-as the results are dependent on a few binary events that are inherently noisy.
We showed the methodology in an example comparing false positive rates, false negative rates, and F 1 scores across multiple grant peer reviews.The results of our example should be interpreted with great caution.First, we based our results on averages across different fields that often vary in the inter-rater reliability, number of raters per proposal, and the number of selected applicants.Second, the grant agencies often base the funding decisions on a combination of the overall rating and additional information (which is not included in our calculation).While acknowledging the limitations, the results still indicate a relatively high false positive rate and better controlled false negative rates due to the generally small portions of selected applicants.
The outlined approach could be further expanded in multiple directions to accommodate different assumptions about the data-generating process.First, we used a minimal measurement model for assessing the inter-rater reliability of continuous ratings.However, each parameter of the measurement model might differ across groups of applicants (e.g., Bartoš et al., 2019;Martinková et al., 2018Martinková et al., , 2023;;Mutz et al., 2012).Additional variance components might be used for modeling dependencies in the data (e.g., the effect of raters Martinková et al., 2018).Different applicants might be rated a different number of times, the measured rankings might not be assumed to be normally distributed (e.g., see Pearce & Erosheva, 2022, for joint modeling of continuous and ordinal data), and the latent abilities might follow a different distribution as well.All these possibilities can be included in the measurement model and further propagated through the quantile approximation, either by using a mixture of distributions for the latent abilities based on the proportion of groups and group-specific parameters, specifying a different type of distributions and measurement error for the observed rankings, and different types of distributions for the latent abilities.
Second, we focused on classification probabilities and the resulting binary classification metrics themselves.Under the specified measurement model, most erroneous classifications happen close to the selection boundary.A further extension might consider weighting the classification probabilities by a utility function accounting for the degree of the error (i.e., mistakenly refusing an applicant who is right above the classification threshold is less costly than refusing an applicant much higher on the latent ability; see, e.g., Cronbach and Gleser, 1957 for different utility models applied in the context of validity).
Finally, in our real-data example, we focused on the context of the grant proposal's ratings; however, the explored connection between IRR and binary classification framework is relevant also in other areas.For example, in the context of educational measurement, the decisions about teacher hiring are based on an assessment of applicant submissions (Goldhaber et al., 2021), the medical school admissions may be informed by ratings of personal and interpersonal qualities obtained in simulation-based assessment centers (Ziv et al., 2008), and the decisions about teacher promotion may be based on classroom observations (Casabianca et al., 2013;Hill et al., 2012).In the context of psychological assessment and health-related measurements, the treatment assignment, as well as employment decisions, may be based on ratings from multiple raters or using multi-item instruments (Rasova et al., 2012).Similarly, ratings from multiple raters inform the selection of journal articles (Marsh & Ball, 1989), and decisions in many other areas.
Despite its limitations, the presented approach provides a straightforward metric for evaluating the quality of selection

Figure 1 :
Figure1: Bias of quantile approximated true positive classification probabilities for 100 applicants across all possible proportions of selected applicants (x-axis) with different data generating scenarios encoded with color.Results are presented for different inter-rater reliability coefficients (rows) and the number of ratings (columns).

Figure 3 :
Figure3: Example of quantile approximated estimate of the false positive rate (full blue line), the 95% confidence interval (blue dashed line), 95% prediction interval (blue dotted line), and the empirical false positive rate (full black line) from a random simulation with 5 ratings, IRR 1 = 0.30, and 100 applicants.
visualizes changes in the false positive rate when increasing the number of raters (left) or the IRR 1 (right).The added orange and red lines show an additional change in the false positive rate when altering both factors simultaneously.We see that for a selection procedure resulting in the selection of 18% applicants, the false

Figure 6 :
Figure6: Change in false positive rate when increasing the number of raters for different levels of IRR 1 (left) or increasing the IRR 1 for different number of raters (right) for a selection procedure based on NIH (2014-16) grant peer review.The original trial selected 18% out of 2076 applicants with J = 2.79 raters (black line in the right panel) and IRR 1 = 0.34 (black line in the left panel).Join increase of IRR 1 and the number of raters is depicted in orange and red.

Table 3 :
Overview of the grant peer reviews1 Based on Table46 from