Estimation of the ROC curve and the area under it with complex survey data

Logistic regression models are widely applied in daily practice. Hence, it is necessary to ensure they have an adequate predictive performance, which is usually estimated by means of the receiver operating characteristic (ROC) curve and the area under it (area under the curve [AUC]). Traditional estimators of these parameters are thought to be applied to simple random samples but are not appropriate for complex survey data. The goal of this work is to propose new weighted estimators for the ROC curve and AUC based on sampling weights which, in the context of complex survey data, indicate the number of units that each sampled observation represents in the population. The behaviour of the proposed estimators is evaluated and compared with the traditional unweighted ones by means of a simulation study. Finally, weighted and unweighted ROC curve and AUC estimators are applied to real survey data in order to compare the estimates in a real scenario. The results suggest the use of the weighted estimators proposed in this work in order to obtain unbiassed estimates for the ROC curve and AUC of logistic regression models fitted to complex survey data.


| INTRODUCTION
Prediction models are widely used in many different fields.Medical research, meteorology, business, and biology are just a few examples.
Although prediction models can be used for a variety of purposes, they are highly used for decision-making.For example, in finance, they can be useful for predicting loan defaults (Li et al., 2022); in ecology, particularly in fisheries, they are often used to make conservation decisions (Guisan et al., 2013;Li et al., 2020); medicine is another field in which prediction models are widely implemented as a support for decision-making, where, they can be helpful for deciding whether a patient should be admitted to an intensive care unit or not, among other purposes (Arostegui et al., 2019).Therefore, given the impact that the use of prediction models can have in many situations, it is necessary to ensure that these models are valid and applicable in practice.Thus, several aspects need to be considered in the development process of these models.A useful checklist can be found in (Steyerberg, 2009).In particular, when the goal is prediction, ensuring good model performance is essential.In this work, we focus on logistic regression models for dichotomous response variables.Model performance of logistic regression models is usually analysed by means of calibration and discrimination ability (Steyerberg, 2009).Calibration measures the agreement between outcomes and predictions (see, e.g., the goodness-of-fit test proposed by Hosmer and Lemesbow (1980)).In this study, we bring discrimination ability into focus, which measures the ability of the models to distinguish between units with the event of interest and without it.This is usually measured by means of the receiver operating characteristic (ROC) curve, which is defined as the curve formed by specificity and sensitivity parameters (i.e., probability of properly classifying individuals without and with the event of interest, respectively) across all the possible cutoff points (Green & Swets, 1966;Pepe, 2003;Swets & Pickett, 1982).The area under the ROC curve (AUC), is one of the most widely used summary measures to analyse the discrimination ability of logistic regression models (Pepe, 2003).Bamber (1975) showed the equivalence between the area under the ROC curve and the Mann-Whitney U-statistic, offering in this way an interesting interpretation of the AUC as the probability that an individual with the event of interest is given by the model a higher probability of event than an individual without the event of interest.
Complex survey data are becoming increasingly popular in various fields including health and social sciences, among others (see, e.g., Fisher et al., 2020).This type of data is collected from a finite population, concerned to be studied, by some complex sampling design, such as stratification, clustering, or a combination of them in different stages of the sampling process.One of the differences between complex survey data and simple random samples is that, in the first, each sampled observation ðiÞ has assigned a sampling weight w i , which is defined as the inverse probability weight or in other words the probability that unit i is included into the sample S, that is, w i ¼ 1=π i where π i ¼ Pði SÞ.The sampling weight assigned to each observation indicates the number of units that this observation represents in the finite population.Due to these particularities, the straightforward application of the most commonly applied statistical techniques for the development of prediction models, which assume that the data have been randomly collected and that sampled units are independent and identically distributed, is usually not appropriate for complex survey data and needs to be checked before being implemented in this context (Heeringa et al., 2017;Skinner et al., 1989).
For this reason, the effect of complex survey data on the development of prediction models in general, and logistic regression models in particular, is being widely discussed in the literature in recent years.One of the most discussed topics in the context of complex surveys is the effect of the sampling design on the estimation of model parameters (see, e.g., Binder & Roberts, 2009;Iparragirre et al., 2023;Lumley & Scott, 2017;Scott & Wild, 1986, 2002 as a summary of a large debate on this topic).Similarly, complex survey data have shown to have a great impact on the development of prediction models, and numerous advances are being made in the last years in this field.Lumley and Scott (2015) proposed new design-based estimators for estimating two widely used parameters for model selection, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), considering the sampling design.Iparragirre et al. (2023) proposed a new technique for considering complex sampling designs for variable selection with lasso regression models.Focusing on the evaluation of model performance, Archer et al. (2007) proposed a goodnessof-fit test that considers complex sampling designs to analyse the calibration of the models fitted to complex survey data.In the context of the discrimination ability, Yao et al. (2015) proposed a modification of the Mann-Whitney U-statistic in order to consider the sampling design to estimate the AUC of the models, incorporating pairwise sampling weights, which are defined as the inverse joint inclusion probability of a pair of observations ði, jÞ, that is, w ij ¼ 1=π ij where π ij ¼ P½ði SÞ \ ðj SÞ, 8i, j S (Horvitz & Thompson, 1952;Särndal et al., 2003).Finally, Iparragirre et al. (2022) proposed incorporating sampling weights into the estimation process of optimal cutoff points for individual classification as units with or without the event of interest.
As mentioned previously, in this work, we aim to focus on the evaluation of the discrimination ability of logistic regression models.Even though Yao et al. (2015) proposed a weighted estimator for the AUC, to our knowledge, there is a lack of proposals for estimating the ROC curve considering complex sampling designs.Therefore, the main goal of this work is to propose a weighted estimator for the ROC curve.In particular, we propose to use weighted specificity and sensitivity estimators defined in Iparragirre et al. (2022) to define a new weighted estimator for the ROC curve.In addition, we calculate the area under the curve in order to estimate the AUC parameter following Bamber (1975) and Tsuruta and Bax (2006), and finally, we show that this AUC estimator defined as the area under the weighted estimate of the ROC curve is equal to the weighted Mann-Whitney U-statistic considering marginal sampling weights w i , 8i S, rather than pairwise sampling weights w ij , 8i, j S as proposed by Yao et al. (2015).The estimation of the AUC is then a simple weighted expression that can easily be calculated in practice, given that the marginal sampling weights are usually explicitly available when working with complex survey data, in contrast to the pairwise sampling weights, which usually need to be calculated by means of some computational package.The performance of this proposal is analysed by means of a simulation study, in which the weighted and unweighted estimates of the ROC curve and AUC are compared with the true population ones.In addition, the proposed methods are applied to real survey data, and the weighted estimates of the ROC curve and AUC are compared with the unweighted ones.
The rest of the paper is organised as follows.In Section 2, we first set the basic notation.Then, we define the proposed weighted estimator of the ROC curve and we calculate the area under it.Finally, we show the equivalence between the area under the weighted estimate of the ROC curve and the weighted Mann-Whitney U-statistic considering marginal sampling weights.In Section 3, the simulation study conducted in order to analyse the performance of the proposed estimators is defined and the results obtained are depict and summarised.In Section 4, the proposed estimators are applied to real survey data.Finally, the paper concludes with a discussion in Section 5.

| METHODS
The goal of this section is to describe our proposal to estimate the ROC curve of logistic regression models fitted to complex survey data considering sampling weights.We calculate the area under the weighted estimate of the ROC curve in order to estimate the AUC, and we show the equivalence between this area and a modification of the Mann-Whitney U-statistic considering sampling weights, which leads us to conclude that this estimator can be used in order to obtain weighted estimates of the AUC.
The rest of the section is organised as follows.In Section 2.1 we denote the basic notation related to the logistic regression model, ROC curve, and AUC, as well as, complex survey data.In Section 2.2.1, we define our proposal to consider sampling weights to estimate the ROC curve and the area under the curve (AUC).In Section 2.2.2 we show the equivalence between the area under the weighted estimate of the ROC curve and the Mann-Whitney U-statistic considering sampling weights.

| Background and basic notation
Let X be the vector of covariates and Y the dichotomous response variable, which takes the value Y ¼ 1 for the units with the characteristic of interest (events), and Y ¼ 0 otherwise (non-events).Let the conditional probability of event for an individual i given the values of its vector of covariates x i .Let β indicate the vector of regression coefficients.Then the specific form of the logistic regression model is as follows: Based on the probability pðx i Þ and a cutoff point c, each individual can be classified as event (if However, this classification may be correct or incorrect depending on the selected cutoff point.The correct classifications, based on a particular cutoff point, are usually quantified by specificity (SpðcÞ) and sensitivity ðSeðcÞÞ parameters, which are defined as the probabilities of correctly classifying the nonevents and events, respectively, that is, The discrimination ability of a logistic regression model is usually evaluated by means of the area under the ROC curve (AUC), where the ROC curve is defined as the set of pairs 1 À SpðcÞ and SeðcÞ across all the possible cutoff points c (Green & Swets, 1966;Swets & Pickett, 1982).
The AUC ranges from 0.5 (an uninformative model) to 1 (a perfect model in terms of discrimination) (Steyerberg, 2009).
Let S indicate a sample of n observations of the vector of random variables ðY, XÞ, that is, Let S 0 and S 1 be its subsamples of sizes n 0 and n 1 formed by the units without the event of interest and with the event of interest, respectively (note that S 0 \ S 1 ¼ ; and S 0 [ S 1 ¼ S).
Let β indicate the vector of estimated regression coefficients, which are usually estimated by means of the likelihood function in Equation (3) and pi ¼ pðx i Þ the corresponding estimated probabilities of event, 8i S (McCullagh & Nelder, 1989): In practice, specificity and sensitivity parameters for a particular cutoff point c are estimated as proportions of correctly classified sampled non-events and events, respectively (see, e.g., Pepe;2003), that is, Iðp j < cÞ and c SeðcÞ ¼ where IðÁÞ denotes the indicator function.Then, the estimated ROC curve is defined by means of each estimated pair of sensitivity and specificity parameters, for each possible cutoff point (Pepe, 2003) as shown in Equation ( 5): Bamber (1975) showed that the area under the ROC curve defined in Equation ( 5) can be estimated ( d AUC) as described in Equation ( 6), by means of the Mann-Whitney U-statistic: However, in the context of complex survey data, sample S is usually obtained by sampling a finite population U ¼ f1,…, Ng of interest for the survey, following some complex sampling design.For each sampled observation 8i S, w i ¼ 1=π i , ðwhereπ i ¼ Pði SÞÞ denotes the sampling weight assigned to this observation, indicating the number of units from the finite population it represents.In this context, the regression coefficients and the corresponding probabilities of events are usually estimated ( β and pi ¼ pðx i Þ, 8i S) maximising the pseudo-likelihood function (Binder, 1983;Iparragirre et al., 2023) defined in Equation ( 7): In practice, the goal is to evaluate the fitted model's performance in the finite population U, that is, the ability of the model fitted following Equation ( 7) to discriminate individuals with and without the event of interest in the finite population.Let ðy i , x i Þ f g N i¼1 be N realisations of the set of random variables ðY, XÞ, and let N 0 and N 1 indicate the size of the subsets formed by the nonevents (U 0 ) and events (U 1 ) of the finite population U.Then, the finite population ROC curve can be defined as in Equation ( 5) where sensitivity and specificity parameters are estimated following Equation ( 4) for all the units in the finite population U, that is, The fitted model's AUC in the finite population could then be defined as follows: However, note that the set of ðy i , x i Þ f g N i¼1 is usually unknown, except for the sampled units i S, so the finite population ROC curve and AUC need to be estimated based uniquely on S. We believe that in the context of complex survey data, if the ROC curve and the AUC of the fitted model are estimated based on Equations ( 5) and ( 6), which were designed to be applied in simple random samples and do not consider the sampling weights, then biassed estimates can be obtained.For this reason, we propose a new estimator for the ROC curve and the AUC which considers the sampling weights.This proposal is described in Section 2.2 below.

| Proposal
In Section 2.2.1, we propose an estimator to estimate the ROC curve for logistic regression models fitted with complex survey data and the AUC as the area under the curve.In Section 2.2.2, we show the equivalence between the proposed AUC estimator and the Mann-Whitney U-statistic incorporating marginal sampling weights.

| Estimation of the ROC curve and the area under it
We propose to estimate the ROC curve considering the sampling weights, as follows: for which specificity and sensitivity parameters are estimated by means of the sampling weights (Iparragirre et al., 2022): and c Se w ðcÞ ¼ Therefore, we propose to calculate the area under d ROC w ðÁÞ in order to estimate the AUC (Tsuruta & Bax, 2006).Let us denote as A the area under the curve.We now proceed to describe how the area under the ROC curve defined in Equation ( 10) can be calculated.Note that in practice, we always work with finite sample sizes, and hence, the number of different estimated probabilities is finite.Let us denote as q the total number of different estimated probabilities, that is, pðqÞ < … < pð1Þ (where q ≤ n, being q ¼ n if and only if all the estimated probabilities for each sampled unit are different).Note that for every cutoff point chosen between two ordered probabilities, the same values for the specificity and sensitivity parameters will be obtained, and therefore, the same point would be defined in the ROC curve.Then, the ROC curve will be completely defined with q þ 1 different cutoff points.Specifically, the smallest possible cutoff point is c q < pðqÞ , which will classify all the sampled units as events and therefore the estimate of the sensitivity will be 1 and the specificity will be 0 (see Equation ( 11)), that is, the cutoff point c q will draw the point ð1 À c Sp w ðc q Þ, c Se w ðc q ÞÞ ¼ ð1,1Þ in the ROC curve.In the same way, the point drawn in the ROC curve for c 0 > pð1Þ will be the point Se w ðc 0 ÞÞ ¼ ð0,0Þ.Let us denote and sort the rest of the q À 1 cutoff points as follows: For ease of notation, 8m ¼ 1, …, q À 1, each cutoff point c m can be defined as the average value of the probabilities pðmþ1Þ and pðmÞ , that is, Note that in this way, all the defined cutoff points will be different from the estimated probabilities and since between any two different ordered predicted probabilities, a cutoff point has been defined, only one different predicted probability lies in the interval Se w ðc m ÞÞ.In this way, the estimated ROC curve will be a polygonal line defined by q segments.Each of these segments will define an area with the abscissa axis.Let us denote as A m , 8m f1,…, qg each of these areas.A graphical explanation can be seen in Figure 1.
We now proceed to calculate analytically the area under the ROC curve defined in Equation ( 10).In particular, as the area A 1 is a triangle of base 1 À c Sp w ðc 1 Þ h i and height c Se w ðc 1 Þ, it can be calculated as follows: F I G U R E 1 Graphical explanation of the empirical weighted receiver operating characteristic (ROC) curve, where For m ¼ 2,…, q, the areas A m are right-angled trapezoids, the area of which can be easily calculated as the sum of the triangle Se w ðc mÀ1 Þ and rectangle A 2 m of the same base and height c Se w ðc mÀ1 Þ: Then, the area under the ROC w ðÁÞ curve ðAÞ can be calculated as the sum of the areas defined in Equations ( 14) and ( 15).Note that c Se w ðc 0 Þ ¼ 0 and c Sp w ðc 0 Þ ¼ 1.Then, Equation ( 14) that defines A 1 can be rewritten in terms of those values for convenience.Finally, the area under the curve can be easily calculated as follows: 2.2.2 | Equivalence between the area under the d ROC w ðÁÞ curve and Mann-Whitney U-statistic We propose to incorporate the marginal sampling weights into the Mann-Whitney U-Statistic as follows to estimate the weighted AUC: In the following lines, we show that the area under the estimated ROC curve defined in Equation ( 16) is equivalent to the Mann-Whitney U-statistic considering marginal sampling weights as defined in Equation ( 17).In order to prove the equivalence between both approaches, our goal is to rewrite Equation ( 17) in terms of sensitivity and specificity parameters.Let us rewrite it as follows as the first step: Then, we can rewrite the expressions Iðp j < pk Þ and Iðp j ¼ pk Þ as a function of the previously defined cutoff points.Given that Then, 8j S 0 the inequality pj < pk will be satisfied if and only if pj < c m , as graphically shown in Figure 2. Thus, note that Iðp j < pk Þ can be rewritten as follows: Then, following Equation ( 19) and the definitions given in Equation ( 11), let us rewrite the first term of Equation ( 18) in terms of sensitivity and specificity parameters as follows: In the same way, we will now proceed to rewrite the expression Iðp j ¼ pk Þ.As stated above, 8k S 1 , 9! m f1,…, qg : pk c m , c mÀ1 ½ Þ .Thus, 8j S 0 , the equality pj ¼ pk will only be satisfied if pj is in the same range as pk , that is, pj c m , c mÀ1 ½ Þ (see Figure 3).Let us rewrite Following Equation ( 21) and the definitions given in Equation ( 11), let us rewrite the second therm of Equation ( 18) in terms of sensitivity and specificity parameters as follows: This image is intended to be helpful to better understand Equation ( 19) and indicates in which situations F I G U R E 3 This image is intended to be helpful to better understand Equation ( 21) and indicates in which situations Finally, Equation ( 18) can be rewritten as the sum of Equations ( 20) and ( 22): Note that Equations ( 16) and ( 23) are equal, so we have shown that A ¼ d AUC w .

| SIMULATION STUDY
The goal of this simulation study is to analyse the performance of the proposed estimators in comparison with the traditional unweighted estimators of the ROC curve and AUC.In Section 3.1, we describe the data generation process and different scenarios considered throughout the study; in Section 3.2, we describe the simulation study process, and finally, in Section 3.3, we summarise the main results.

| Data generation and scenarios
In the following lines, the data simulation process is described.Let us define as N ¼ 10,000 the finite population size.A set of p ¼ 5 covariates (X 1 , …,X 5 ) and two latent variables (Z 1 and Z 2 , which are used to define the response variable and the sampling design, but are not available in the samples when fitting models) have been generated.
A total of three different scenarios have been defined based on different sampling designs.On the one hand, a stratified sampling design without clustering was defined (let us denote this scenario as SH, hereinafter), in which different strata are defined in the finite population, and a number of individuals are sampled from each stratum.On the other hand, we defined a stratified sampling design with clustering (scenario SC), in which different strata are defined in the finite population, a number of clusters or groups of units are selected from each stratum, and finally, a number of individuals are sampled from each selected cluster.In addition, in this scenario, SC two situations have been distinguished: First, all the variables have been considered as unit-level variables (we denote this scenario as SC.0, given that there are d ¼ 0 cluster-level variables); and, second, in the other scenario, one cluster-level variable (d ¼ 1) has been considered (scenario SC.1).Note that in scenario SH, all the variables must be defined at unit-level (d ¼ 0) since there is no cluster.We proceed below to explain the data generation process for each of these scenarios: 1.For d ¼ 0 (SH and SC.0) and d ¼ 1 (SC.1),N realisations have been made following the Gaussian distribution defined in Equation ( 24): where μ ðpÀdÞ indicates the null vector of dimension 1 Â ðp À 1Þ and Σ ðpÀdÞÂðpÀdÞ a matrix of dimension ðp À dÞ Â ðp À dÞ defined by values of 1 on the diagonal and η ¼ 0:15 off-diagonal, i.e., μ ðpÀdÞ ¼ ð0,…,0Þ T and Σ ðpÀdÞÂðpÀdÞ ¼ ð1 À ηÞ being I ðpÀdÞÂðpÀdÞ the identity matrix and J ðpÀdÞÂðpÀdÞ the matrix of 1s.
2. Let us denote as fz i ¼ ðz i,1 , z i,2 Þg N i¼1 the set of N realisations of Z 1 and Z 2 .Data are sort based on z i β Z , 8i ¼ 1,…,N, where Strata are defined by partitioning the ordered population data set on sets of the same size (H ¼ 10 strata) in all the scenarios, being each stratum of size N h ¼ 1000, 8h ¼ 1,…, H.In addition, in scenarios SC.0 and SC.1, each stratum has been partitioned into A h ¼ 10 clusters 8h ¼ 1,…, H.In this way, a total of A * ¼ 100 clusters of size N h,α ¼ 100 are generated, 8h ¼ 1, …, H and 8α ¼ 1,…A h .
3. If d ¼ 1, then X 1 is a cluster-level variable (SC.1).We generate it by making A * ¼ P H h¼1 A h realisations of X 1 $ Nð0,1Þ.Note that for two different units in the same cluster, their corresponding cluster-level covariates should take the same value, that is, 8i, j in the same cluster, x i,1 ¼ x j,1 .Therefore, we repeat each realisation N h,α times.4. We now have defined the values corresponding to X 1 ,…, X 5 variables for all the units in the finite population: fx i ¼ ðx i,1 , …, x i,5 Þg N i¼1 .Let us define β X as follows: Then, we generate the probabilities of event as follows: and the value for the response variable y i is randomly generated by following Bernoulliðpðx i , z i ÞÞ.We set β 0 ¼ À5, defining in this way a prevalence (i.e., probability of event) of around 25%.
The finite population U is defined as the set of values corresponding to the response variable y i and the covariates x i (excluding the latent variables z i ), 8i ¼ 1, …, N as well as strata and cluster indicators corresponding to each of them.
5. Different sampling schemes have been considered in this simulation study.On the one hand, in the scenario in which a stratified sampling design without clustering is defined (SH), the following number of units have been considered from each stratum (n h , 8h ¼ 1, …H): On the other hand, in the scenarios in which a stratified sampling design with clustering is considered (SC.0 and SC.1), a h ¼ 2, 8h ¼ 1,…, H clusters have been sampled in the first place.Then, from each sampled cluster of stratum h, the following number of units have been sampled (n h,α , 8α ¼ 1,2): It should be noted that due to the way in which the sample design has been defined, the probabilities of event given the covariates are roughly ordered from highest to lowest in the different strata.Therefore, by sampling many units from the strata at the edges, we are sampling more individuals with higher and lower probabilities (scheme (a)).In contrast, when sampling more individuals from the central strata, more individuals with medium probabilities of event are sampled (scheme (b)).The two sampling schemes differ on this point.
6. Depending on the sampling design defined in each scenario, sampling weights are calculated as follows.In scenarios without clustering (SH), the sampling weight for each unit i from stratum h is defined as the total number of units in the population in stratum h (N h ) divided by the number of units sampled from it (n h ), that is, In the same way, in the scenarios which consider clustering (SC.0 and SC.1), the sampling weight for each sampled unit i S from cluster α from stratum h is defined as follows:

| Setup
Considering the scenarios described in Section 3.1, a finite population was simulated in each scenario.The theoretical model is fitted to the finite population, and the ROC curve and AUC of this model are calculated following Equations ( 8) and ( 9) (let us denote as ROC theo and AUC theo the theoretical ROC curve and AUC, corresponding to the finite population model).Note that these parameters measure the performance of the theoretical finite population model.Each population is sampled R ¼ 500 times, following in each case the corresponding complex sampling design.In each of the samples, a weighted logistic regression model was fitted and its ROC curve and AUC were estimated, ignoring the sampling weights (unweighted method) and considering them (weighted method).Note that in practice, we aim to analyse how those estimators perform in order to estimate the fitted model's ROC curve and AUC in the finite population.Therefore, in order to analyse and compare the performance of both estimators, we compare each of the estimates to the true finite population ROC curve and AUC estimates of the model fitted to the sample (which is calculated by extending the fitted sample model to the finite population), rather than to the theoretical population model parameters.
We will denote these true finite population parameters as d ROC pop and d AUC pop .Note that these parameters indicate the true performance of the fitted sample model in the finite population.This process is described in detail below and summarised in Figure 4.For r ¼ 1, …, R: Step 1. Obtain a sample S r & U by means of one of the sampling schemes described in Section 3.1 and calculate the sampling weights w r i , 8i S r following the corresponding Equation ( 29) or (30).
Step 2. Fit the model to S r maximising the pseudo-likelihood function in Equation ( 7 17)), to obtain unweighted and weighted estimates, respectively.In addition, we estimate the AUC by means of pairwise sampling weights following the proposal of Yao et al. (2015), which will be denoted as d AUC r pairw .
Step 4. By means of the βr estimated in Step 2., estimate the probabilities of event for all the units in the finite population, pr i , 8i ¼ 1, …, N. Estimate the true ROC curve and AUC in the population following Equations ( 8) and ( 9 F I G U R E 4 Graphical explanation of the simulation study setup. Step 5. Calculate the difference between the unweighted or weighted estimates and the true population AUC (obtained based on the model fitted to the sample): In addition, in order to compare our proposal that considers marginal sampling weights to the proposal considering pairwise sampling weights, we define the difference between pairwise estimates to the true population model and to our proposal as follows: All computations were performed in (64 bit) R 4.2.2 (R Core Team, 2022) and a MacBook Pro equipped with 16 GB of RAM, Apple M1 Chip, and macOS Monterey 12.2.1 operating system.Logistic regression models were fitted by means of survey R package Lumley (2010Lumley ( , 2020)).

| Results
In this section, we summarise the main results we obtained from the simulation study.In Figure 5, the theoretical ROC curve of the finite population model, as well as the true population ROC curves and the weighted and unweighted estimates obtained based on the models fitted across R ¼ 500 samples are shown.Figure 6 depicts the boxplots of the differences between the unweighted (diff r unw ) and weighted (diff r w ) estimates and the true population AUC of the models fitted to the samples (see Equation ( 31)), while Figure 7 depicts the boxplots of the differences between the AUC estimates obtained by means of the pairwise and marginal sampling weights (wdiff r ).Table 1 summarises the numerical results.Due to F I G U R E 5 Unweighted (unw; see Equation ( 5)) and weighted (w, Equation ( 10)) estimates of the receiver operating characteristic curves, as well as the true population ROC curves (pop) of the models fitted across r ¼ 1,…, 500 samples, together with the theoretical ROC curve (theo) of the model fitted to the finite population in each scenario drawn in the simulation study.
the large number of results we obtained, we begin by summarising the main conclusions, and then we proceed to analyse the differences between the different scenarios.
As shown in Figure 5, the theoretical ROC curve of the population model is above most of the population ROC curves of the models fitted to the samples.Similarly, as can be regarded in Table 1, the average true population AUCs are lower than the theoretical AUCs.This indicates that population models have a greater discrimination ability than the models fitted to the samples.Therefore, in order to make fairer comparisons and compare the AUCs of the same models, we compare the ROC curve and AUC estimates obtained with different methods, to the true population parameters rather than the theoretical ones.In general terms, the results of the simulation study show that, under the scenarios that have been considered, the weighted estimates of the AUC are closer than the unweighted ones to the true population AUC.The weighted estimates are slightly optimistic on average, given that a bit greater AUCs than the true ones have been estimated.In contrast, unweighted estimates sometimes overestimate the true finite population AUC, and other times underestimate it, depending on the scenario (in any case, showing a greater absolute bias than the weighted estimates).In terms of variability, no major differences have been observed between the two estimators, and, depending on the scenario, one estimator or the other shows more variability.The marginal and pairwise weighted estimators perform quite similarly in all the scenarios, both in terms of bias as well as variability.However, it should be noted that as shown in Figure 7, the estimates based on pairwise sampling weights are slightly greater than the ones obtained based on marginal sampling weights.Thus, the estimates based on pairwise weights overestimate a little more the true population AUC than the estimates based on marginal weights, even though, those differences are minimal in terms of bias.In contrast, computation times are considerably improved with the estimator proposed in this work (up to five times more efficient as can be seen in Table 1), given that the pairwise sampling weights need to be calculated for each particular sampled pair, in contrast to the marginal ones, which are easily available in most cases when working with this kind of data.
We now proceed to comment on the results in more detail.The sampling schemes (a) and (b) differ in the number of units sampled from each stratum, and more specifically, in the number of units sampled with (a) higher and lower predicted probabilities or (b) central predicted probabilities.These differences have an effect on the unweighted estimates in terms of the difference in comparison with the true population AUC, which in scenarios with sampling scheme (a) underestimates and in scenarios with sampling scheme (b) overestimates it.In contrast, for the weighted estimates, no great differences have been observed in terms of difference in comparison to the true population AUC depending on the sampling schemes (a) and (b).For example, as can be observed in Table 1, in Scenario SC.0 (a) the average difference between the unweighted estimates F I G U R E 6 Boxplots of the difference (see Equation ( 31)) between the estimated area under the curves (AUCs) by means of the unweighted (unw, Equation ( 6)) and weighted (w, Equation ( 17)) estimators and the true population AUC of the models fitted across r ¼ 1,…, 500 samples in all the scenarios drawn in the simulation study.and the true population AUC is À0.081, while in Scenario SC.0 (b) the average difference is 0.073.For the weighted estimates, under the same scenarios, the average differences are 0.005 and 0.008, respectively.These differences can also be observed in Figure 5, where the unweighted ROC curves are under the true population ROC curves in scenarios (a), while in scenarios (b) the unweighted ROC curves are over the true ones, as well as, over the weighted ones, indicating that the unweighted estimates overestimate more than the weighted ones in these scenarios.
However, in terms of variability, the performance of the unweighted and weighted estimates differ under sampling schemes (a) and (b).In scenarios considering sampling scheme (a), the variability of the unweighted estimates is greater than the variability of the weighted ones, while in scenarios considering sampling scheme (b), the difference is reversed.As shown in Table 1, in Scenario SH (a), the standard deviation of the unweighted estimates is 0.018, slightly greater than the variability of the weighted estimates which is 0.014.In contrast, in Scenarios SH (b), the standard deviation of the unweighted and weighted estimates are 0.012 and 0.020, respectively.In addition, the variability of the unweighted estimates is greater in (a) than in (b) (for the weighted estimates this difference is not as remarkable as for the unweighted estimates).For example, in SC.0 (a) the standard deviation of the unweighted estimates (0.035) is 2.5 times greater than the standard deviation in SC.0 (b) (0.014).
Results also show that the performance of the two estimators differs depending on the sampling design.In particular, a greater optimism of the weighted estimates has been observed in scenarios with cluster-level variables SC.1 than in scenarios SC.0 and SH.For example, in scenario SC.1 (a), the average difference between the weighted estimates and the true population has been 0.023, while in scenario SH (a), the average difference has been 0.005.This effect can also be observed in Figure 6.The ROC curves depicted in Figure 5 also show that in Scenarios SC.1 (a) and SC.1 (b), most of the weighted ROC curves are above the true population curves, while in the rest of the scenarios, the true population ROC curves are more or less in the center of the weighted ROC curves' band.This effect has not been observed for the unweighted estimates.In contrast, the sampling design has affected the variability of both, unweighted and weighted estimates.Specifically, the standard deviation of the estimates in scenarios SH is lower than that in scenarios SC.0 which, in turn, is lower than the standard deviation in scenarios SC.1 (see Table 1 for more details).It should also be noted that the standard deviation of the true population AUCs across R ¼ 500 samples is greater in scenarios SC.1 than in the rest of the scenarios (Table 1).This can also be observed in Figure 5, where the true population ROC curves show the greatest variability in scenarios SC.1.
F I G U R E 7 Boxplots of the differences between the estimated area under the curves (AUCs) by means of the AUC estimator based on pairwise sampling weights and the one that considers marginal sampling weights (wdiff; see Equation ( 32)), when estimating the AUC of the models fitted across r ¼ 1,…, 500 samples in all the scenarios drawn in the simulation study.
The methodology proposed in Section 2 has been applied to the Survey on the Information Society in Companies 1 (ESIE survey), which was described in detail in Iparragirre et al. (2022).This survey was carried out among the companies in the Basque Country (BC) in order to collect information about the use of technology.In particular, the response variable considered here is the same as the one used in the above-mentioned study, that is, a dichotomous response variable indicating whether a company has its own web page (1) or not (0).A sample of n ¼ 7725 was considered for this application, and the AUC of the model fitted in Iparragirre et al. (2023) was estimated.Covariates included in the model represent the activity of the company, the number of employees, and the ownership.
The unweighted and weighted AUC estimates and the corresponding Bootstrap 95% confidence intervals (CIs) are shown in Table 2.For the 95% CI of the unweighted estimate, the Bootstrap CI is calculated by means of the pROC R package (Robin et al., 2011), while the 95% CI of the weighted estimate is calculated by generating Bootstrap resamples based on replicate weights (Rao & Wu, 1988) using the survey R package (Lumley, 2020), both of them considering B ¼ 2000 Bootstrap resamples.The unweighted and weighted ROC curve estimates are depicted in Figure 8.Note that in this case, as we are working with real survey data; we cannot know which the true population ROC curve and AUC are.
Even though the differences between the unweighted and weighted estimates are not as large as the ones analysed in the simulation study, the unweighted estimate is larger than the weighted estimate, as it happens in scenarios (b) of the simulation study.Considering the results of the simulation study, we can assume that the weighted estimate will be a bit above the true population AUC, and therefore, we can conclude that probably the unweighted estimate of the AUC is overestimating it.In addition, note that the overlap between the two CIs is very slight.
T A B L E 1 Numerical results of the minimum value (min), maximum value (max), average (mean) and standard deviation (sd) of the population AUC (pop, Equation ( 9)), unweighted (unw, Equation ( 6)), weighted (w, Equation ( 17)) and pairwise (pairw) (Yao et al., 2015) estimates of the AUC of the models fitted across r ¼ 1,…,500 samples.Note: The average difference (av.diff) of the unweighted, weighted and pairwise estimates to the true population AUC estimates (see Equations ( 31) and ( 32)) with their standard deviations (sd), and the average computational times (av.time, in seconds) of each method with their standard deviations (sd) are also shown.In addition, the theoretical AUC (AUC theo ) of the finite population model in each scenario is available.Abbreviation: AUC, area area under the ROC curve.
In this work, we propose two new weighted estimators to estimate the ROC curve and AUC of logistic regression models fitted to complex survey data.In addition, we show that the area under the proposed weighted estimator of the ROC curve is equivalent to the weighted Mann-Whitney U-statistic incorporating marginal sampling weights, which are defined as the inverse probability weights for each sampled unit.A simulation study has been conducted in order to analyse the performance of the proposed estimators, and they have also been applied to real survey data.
The results of the simulation study suggest the use of the proposed weighted estimators rather than the unweighted ones.The unweighted estimators overestimate or underestimate the true population parameters, depending on the proportion of units sampled from each stratum.In particular, as more units with extreme (higher and lower) predicted probabilities are sampled in proportion, more nonevents with higher predicted probabilities as well as events with lower predicted probabilities are also sampled.This results in a lower estimate of the AUC.In contrast, as more central predicted probabilities are sampled, less extreme (higher and lower) predicted probabilities than necessary to properly represent the finite population will be sampled, leading to a greater estimate of the AUC due to the same reason.Weighted estimates correct for this bias providing ROC curve and AUC estimates that are closer to the true finite population parameters, since the presence of sampling weights gives each pair of individuals with and without the event of interest the relevance that they should have in representing the finite population.
F I G U R E 8 Weighted and unweighted receiver operating characteristic (ROC) curves of the models fitted to the ESIE survey data.
) by means of the covariate values x i and the sampling weights w r i , 8i S r (note that the latent variable values z i are only considered to define the sampling design and are not considered in the model estimation process).Obtain βr and the estimated probabilities of event pr i , 8i S r .Step 3. Estimate the ROC curve ( T A B L E 2 Estimated unweighted and weighted AUCs and the corresponding Bootstrap 95% CI of the model fitted to ESIE survey data.