Variable selection with LASSO regression for complex survey data

Variable selection is an important step to end up with good prediction models. LASSO regression models are one of the most commonly used methods for this purpose, for which cross‐validation is the most widely applied validation technique to choose the tuning parameter (λ) . Validation techniques in a complex survey framework are closely related to “replicate weights”. However, to our knowledge, they have never been used in a LASSO regression context. Applying LASSO regression models to complex survey data could be challenging. The goal of this paper is twofold. On the one hand, we analyze the performance of replicate weights methods to select the tuning parameter for fitting LASSO regression models to complex survey data. On the other hand, we propose new replicate weights methods for the same purpose. In particular, we propose a new design‐based cross‐validation method as a combination of the traditional cross‐validation and replicate weights. The performance of all these methods has been analyzed and compared by means of an extensive simulation study to the traditional cross‐validation technique to select the tuning parameter for LASSO regression models. The results suggest a considerable improvement when the new proposal design‐based cross‐validation is used instead of the traditional cross‐validation.


| INTRODUCTION
Complex survey data are becoming increasingly relevant in a number of fields, including social and health sciences, among others.In this framework, the finite population of interest for the study is usually sampled following a complex sampling design, which may include techniques such as stratification, clustering, or a combination of them in different stages of the sampling scheme.In this context, a sampling weight is assigned to each sampled unit, indicating the number of units that this observation represents in the finite population.Suppose, for example, we consider a stratified sampling scheme, in which several strata are defined in the finite population and a number of units or a number of clusters of units (which will be denoted as primary sampling units [PSU], hereinafter) is sampled from each stratum.In this case, the sampling weight for each sampled unit is usually defined as its inverse inclusion probability, that is, the probability for each unit being sampled.Due to these particularities, complex survey data do not satisfy the independence and identically distributed conditions, and hence, the validity of traditional statistical techniques should be checked before applying them to data collected from complex surveys (see, e.g., Skinner et al., 1989, for more information on this topic).Among other issues, the need for and use of sampling weights and the design effect on the development of prediction models has been widely discussed in the literature (see, e.g., Iparragirre et al., 2023;Lumley & Scott, 2015, 2017;Pfeffermann & Sverchkov, 2009;Smith, 1988, as a summary of this debate that is still alive).
In this paper, we focus on the development of prediction models and, more specifically, on the variable selection for complex survey data in linear and logistic regression frameworks.Least Absolute Shrinkage and Selection Operator (LASSO) regression (Tibshirani, 1996) is nowadays a widely used technique for variable selection, especially when a large amount of predictor variables are available, in order to obtain more parsimonious, and hence, more interpretable prediction models.Very briefly, one goal of LASSO regression models is to set some model coefficients to zero, reducing in this way the dimension of the model by selecting a subset of the available predictor variables.The selection of this subset depends in turn on the election of a tuning parameter (λ) for which techniques such as bootstrap or cross-validation can be applied, the latter being the most widely used technique, in practice.These techniques, commonly known as validation methods, are used in order to select the tuning parameter with which the error of the final model, evaluated in a sample different from the one used to develop the model, is minimized (see, e.g., Hastie et al., 2009;James et al., 2013).Shortly, those techniques consist in defining different training sets (in which models are fitted considering several tuning parameters) and test sets (in which the error of the models is estimated).The tuning parameter that minimizes the error of the training models in the test sets is selected for fitting the LASSO model to the whole sample.
However, fitting LASSO regression models to complex survey data could be problematic for two reasons.In the first place, the previously mentioned debate about the need for sampling weights when fitting prediction models could be extended to LASSO regression models.In addition, with the traditional above-mentioned validation methods training and test sets are randomly defined, without considering the sampling design in the process.This may be a problem when working with complex survey data given that PSUs could be split into training and test sets, which may lead the training sets to underestimate the variability produced due to the sampling process and underestimate population error.This problem is usually known as "data leakage" (Kaufman et al., 2012).Both of these problems (weights-related as well as designrelated) have recently been discussed in the literature.McConville et al. (2017) proposed incorporating sampling weights into the LASSO linear regression estimation process and Kshirsagar et al. (2017) extended this proposal to logistic regression models.Nonetheless, both of them applied the traditional K-fold cross-validation, which consists in randomly splitting sampled units into K subsamples (or folds) and defining K training sets, excluding a different fold (test set) each time.Nevertheless, if we apply this method to complex survey data we may come across two types of problems.In the first place, sampling weights of the units in neither the training sets nor the test sets properly represent the entire finite population.Besides, and more importantly, as mentioned above, sampling design is not reflected in the way the folds are defined.Wieczorek et al. (2022) warned about this problem and proposed mimicking the structure of the sample obtained from the finite population in each fold.For example, for stratified sampling designs, Wieczorek et al. (2022) proposed making each fold a stratified sample of PSUs from each stratum, that is, creating simple random sample folds separately within each stratum (being all the elements from a given PSU placed in the same fold) and then combine them across strata.In this way, the weights of the units in the training and test sets represent the finite population properly and the variability of the data is also represented.However, as pointed out by the authors, it should be noted that in this way the number of folds could be limited by means of the sampled PSUs in each stratum (cannot be defined more folds than the maximum number of sampled PSUs per stratum).In other words, we need at least K PSUs per stratum for the proper application of this method.Furthermore, if we have a different (and nonproportional to K) number of PSUs in each stratum, the sampling weights of the training and test sets would also incorrectly represent the finite population.
In complex survey frameworks, other approaches, different from the abovementioned validation techniques, are usually used to define partially independent subsets of the sample.Those approaches are known as "replicate weights" methods.These methods consist of modifying the sampling weights to define new subsamples that replicate the original sample, in the way that these subsamples by means of these new weights (i.e., the "replicate weights") correctly represent the finite population.The most well-known replicate weights methods that are implemented in the survey R package (Lumley (2011), Lumley (2020) are Jackknife Repeated Replication (JKn), Balanced Repeated Replication (BRR), and Rescaling Bootstrap (Bootstrap).Note that Jackknife term is usually used in variance estimation framework, but this term is commonly denoted as leave-one-cluster-out (LOCO) (Merkle et al., 2019) or leave-one-group-out (LOGO) cross-validation (Kuhn & Johnson, 2019) when the goal is validation.However, in order to be consistent with the terminology used in the survey R package (Lumley, 2020), we denote this method as Jackknife throughout the paper.
Therefore, the aim of the present work is twofold.On the one hand, we aim to analyze the performance of the above-mentioned replicate weights methods, instead of traditional validation techniques, to select the tuning parameter for fitting LASSO regression models.On the other hand, our goal is to propose new methods to this end based on the idea of replicating weights.In particular, due to the popularity of crossvalidation in this context, we propose a new design-based cross-validation method based on replicate weights, which will be more flexible than the one proposed by Wieczorek et al. (2022).In addition to the cross-validation, we also propose two new techniques (which we denote as splitsample repeated replication [split] and extrapolation [extrap]) to select the tuning parameter for LASSO models.In this study, we aim to analyze (a) the impact of considering complex designs when using validation techniques for the selection of the tuning parameter and (b) the impact of the sampling weights when fitting LASSO models.Therefore, we compare by means of a simulation study the performance of different proposals based on replicate weights, to (a) the traditional K-fold cross-validation that defines the folds by ignoring the sampling design but considers sampling weights for fitting LASSO models (weighted simple random sample cross-validation, w-SRSCV), and (b) the unweighted simple random sample K-fold cross-validation ignoring weights for fitting LASSO models (unw-SRSCV).
The rest of the paper is organized as follows.In Section 2, the basic notation on linear and logistic regression models and LASSO regression are given, existing replication methods applied in this work for the selection of the tuning parameter are defined and new methods based on the idea of replicating weights are also proposed.The performance of all the methods is analyzed by means of a simulation study, which is described in Section 3. Finally, we close the paper with the main conclusions in Section 4.

| METHODS
This section is divided into three different parts.In Section 2.1, the basic notation is set, in Section 2.2 LASSO regression is described, and finally, Section 2.3 describes replicate weights methods considered in this study together with the new methods proposed by the authors.

| Basic notation
Let U denote a finite population of N units, for which information of the vector of p covariates X ¼ ðX 1 , …, X p Þ is available, that is, Let S be a sample obtained from U following some complex sampling design.In addition to the information of covariates, values for the response variable Y are also known for sampled units: y i , 8i S. Furthermore, for each sampled unit i S, a sampling weight is assigned as w i ¼ 1=π i , where π i indicates the probability of being sampled, 8i S.
For a continuous response variable Y, the linear regression model for the observed data is defined as follows: and the vector of regression coefficients β ¼ ðβ 0 ,β 1 , …, β p Þ T are estimated ( β) based on sample S by minimizing the residual sum of square (RSS): In a similar way, if Y is a dichotomous response variable, the logistic regression model is defined as where pðx i Þ ¼ e x i β 1þe x i β and β is obtained by maximizing the log-likelihood function ℓðβÞ (or equivalently, minimizing ÀℓðβÞ): However, when working with complex survey data, sampling weights should be considered when estimating model coefficients as proposed by Fuller (1975) and Binder (1983), and the weighted residual sum of square (WRSS) and the pseudo-log-likelihood (pℓ) functions are usually considered instead of ( 2) and ( 4), respectively: After estimating regression coefficients, a value for the response variable can be estimated given the values of covariates x i for unit i U as ŷi ¼ x i β in linear regression framework and as

| LASSO regression for variable selection
When a large amount of predictor variables are available, the LASSO regression model is commonly used for variable selection.Briefly, this method forces some regression coefficients to zero, and thus more interpretable models are obtained.This variable selection method is described below.
For a given value of the tuning parameter λ, linear and logistic LASSO regression models are fitted by minimizing the following functions, respectively: In practice, K-fold cross-validation is usually applied to select the optimum value for λ in order to minimize the error of the fitted model.Sampled units are randomly split into K subsamples of the same size.For 8k ¼ 1, …, K, the k th subsample is set as test set (S testðkÞ ), while the rest K À 1 subsets form the training set (S trðkÞ ).Then, a grid for λ values is defined For 8k ¼ 1,…, K, the error in subset k and λ l , 8l ¼ 1,…,L is then estimated as follows: being n testðkÞ the size of S testðkÞ , 8k ¼ 1,…, K.This process is repeated K times, by setting a different subset k as the test set each time.The crossvalidated error for λ l is estimated as follows: Finally, the value that minimizes the cross-validated error is selected as the penalty parameter, and the model is fitted to the whole sample S including Λ as the tuning parameter in expression (6).
However, in all the process explained above, sampling design and sampling weights are not considered (let us denote it as the unweighted simple random sample cross-validation (unw-SRSCV), hereinafter).We believe that when working with complex survey data, sampling design should be considered in the whole process: (1) When fitting the model, (2) when defining training and test sets, and (3) when estimating the error.
Below, we explain how we propose to address these three points as a whole.
In the first place, when fitting the LASSO regression models, sampling weights should be considered as follows instead of (6) for linear and logistic regression models, respectively (Kshirsagar et al., 2017;McConville et al., 2017): Second, in Section 2.3, we describe different methods based on replicate weights that could be considered to take into account the sampling design when defining training and test sets.Finally, sampling weights should also be considered when estimating the error.In particular, if we focus on the above-mentioned cross-validation method, we could rewrite Equation ( 8) as follows in order to consider the weights when estimating the error in subset k: We denote as weighted simple random sample cross-validation (w-SRSCV) the method that considers (11) to fit the model and ( 12) to estimate the error but defines the folds by randomly splitting sampled units into different subsets ignoring the sampling design as described previously.

| Selecting LASSO model's tuning parameter with complex survey data
In this work, we propose to use replicate weights for the selection of the tuning parameter λ.In the following lines, we describe the six replicate weights methods we considered in this work to define training and test sets when selecting the tuning parameter for LASSO models.The goal of replicate weights methods is to modify the sampling weights to define new partially independent subsamples that replicate the original sample, in the way that the finite population is properly represented in each subsample by means of the modified weights known as "replicate weights".
Some of the replicate weights methods that are described below are commonly applied for other purposes, such as variance estimation, when working with complex survey data.These methods are known as Jackknife Repeated Replication (JKn), Bootstrap, and BRR (see, e.g., Heeringa et al., 2017;Wolter, 2007, for more information about these methods).However, as far as we know, they have never been used for selecting the tuning parameter for LASSO regression models.In this work, we propose to incorporate the abovementioned replicate weights methods in this context.In addition, we also propose three new methods based on the idea of replicating weights, which we denote as the design-based K-fold cross-validation (dCV), Split-sample Repeated Replication (split), and extrapolation (extrap) for the same purpose.Figure 1 depicts a graphical summary of all these methods (note that the figures are not self-explanatory and should be read in combination with the descriptions below for a correct understanding of each method).
Let a h indicate the number of sampled PSUs from stratum h, 8h ¼ 1, …, H, where P H h¼1 a h ¼ a indicates the total number of sampled PSUs.In this study, a stratified cluster sampling has been considered; thus, PSUs are the clusters sampled from each stratum but note that all the methods described below can be extended to one-step stratified samples in which different numbers of individuals (which would be the PSUs in that case) are sampled from each stratum.
• Jackknife Repeated Replication (JKn): This method leaves one PSU α, 8α ¼ 1,…, a as test set (S testðαÞ ), being the corresponding training sets (S trðαÞ ) defined by the rest a À 1 PSUs, excluding the α th one.A total of T JKn ¼ a training and test sets are defined in this way, excluding one PSU from the training set each time.8α ¼ 1,…, a let us suppose that α indicates a PSU from stratum h, 8h ¼ 1,…, H. Replicate weights are defined 8i S, for the α th training set as follows: , if i S trðαÞ and it is in stratum h: Even though each test set is formed by the PSU excluded from the corresponding training set, the error is usually estimated considering the whole sample, so the replicate weights corresponding to the test set can be defined as the original sampling weights, that is, w * ,test i,JKn ¼ w i .
• Rescaling Bootstrap (Bootstrap): T Bootstrap ¼ B bootstrap resamples are generated as proposed by Rao and Wu (1988).8h ¼ 1,…, H a h À 1 PSUs are selected with replacement, which form the b th training set, 8b ¼ 1,…,B.8α ¼ 1, …, a let us denote as v ðbÞ α , the number of times that the α th PSU is selected to be part of the bootstrap resample b.Note that if the α th PSU is not selected for resample b, then v ðbÞ α ¼ 0.Then, 8b ¼ 1,…,B the replicate weights for the b th training set are defined as follows.8i S, suppose i is a unit from the α th PSU: The corresponding test set for each training set is the original sample S with its sampling weights, that is, w * ,testðbÞ i,Bootstrap ¼ w i .
• Balanced Repeated Replication (BRR): This method was originally designed to be applied in samples with 2 PSUs per stratum.8h ¼ 1, …, H one of the PSUs from the stratum is set to the training set while the other is set to the test set.There are 2 H different possible training and test sets to define in this way, which may usually be computationally unfeasible.Instead, T BRR (where T BRR ≤ H þ 4) the number of different sets are usually defined by selecting the PSU splits in a particular way by means of the Hadamard matrix as proposed by McCarthy (1966).Nowadays, this method is extended to be also applied when an even number of PSUs per stratum are available (Lumley, 2020;Wolter, 2007).8i S, the replicate weights for the t th training set are defined as follows, 8t ¼ 1,…, T BRR : being a * h the number of PSUs from stratum h in the k th test set.In the same way, replicate weights for the k th test sets (which will be denoted as w * ,testðkÞ i,dCV hereinafter) can be defined in the same way exchanging the roles of the training and test sets with each other in ( 16).
Note that our goal is to get at least one PSU from each stratum in every training set (not in every fold), and this will happen as long as no stratum ends up with all its PSUs in the same fold.In order to ensure this, after PSUs are randomly assigned to folds, dCV needs to check whether this condition is satisfied or not.In case any stratum has all its PSUs in the same fold, then it needs to reassign folds until this condition is satisfied.
Therefore, at least two PSUs per stratum are needed, each of them classified in a different fold.This is an advantage over the method proposed by Wieczorek et al. (2022), which requires at least K PSUs per stratum.

| SIMULATION STUDY
This section describes the simulation study conducted in order to analyze the performance of different methods when selecting the tuning parameter for fitting LASSO regression models.Our goal is to compare the performance of the replication methods proposed in Section 2.3 (i.e., JKn, Bootstrap, BRR, dCV, split-cv, split-boot, and extrap) to the methods described in Section 2.2 (unw-SRSCV and w-SRSCV).The goal is to compare the differences between the tuning parameters selected with different methods, the number of covariates that would be selected if the models were fitted considering that tuning parameter, and the error we would obtain with that model.We compare those results with the "true" results we would obtain if the finite population were known in practice.
The rest of the section is organized as follows: Section 3.1 describes the process of data simulation and scenarios, Section 3.2 describes the simulation study process, and Section 3.3 depicts and summarizes the main results.

| Data generation and sampling design
In the following lines, data simulation process is described.Let us define as N ¼ 100000 the finite population size and as p ¼ 75 the number of variables denoted as X 1 , …,X 50 , Z 1 ,…, Z 25 .In this simulation study, we consider the variables Z 1 , …, Z 25 to be latent variables, which are used to define the response variable, but are not available in the samples to fit the models.In this way, we aim to define more realistic scenarios, in which the perfect models cannot be fitted.Instead, Z 1 , …,Z 25 are used to define the sampling design.
For a given value of p * , where p * ≤ p, let μ p * indicate the null vector of dimension 1 Â p * and Σ p * Âp * a matrix of dimension p * Â p * of values of η ¼ 0:15 off-diagonal and values of 1 on the diagonal, that is, being I p * Âp * the identity matrix and J p * Âp * the matrix of 1s.In addition, the vector of regression coefficients β ¼ ðβ X , β Z Þ T is defined as follows: We describe below the steps followed to generate the finite populations of this simulation study.Two scenarios are defined generating two different populations.In Scenario 1 (S1), the p ¼ 75 covariates are unit-level variables (i.e., there are q ¼ 0 cluster-level variables), while in Scenario 2 (S2), q ¼ 5 variables are defined as cluster-level variables, while the rest of p À q ¼ 70 variables are unit-level.In each population, two response variables have been generated: (a) A continuous response variable (linear regression), and (b) a dichotomous response variable (logistic regression).
1.For q ¼ 0 (S1) and q ¼ 5 (S2), two finite populations are generated by making N realizations of 2. Let us denote as Strata are defined by partitioning the population data set on sets of the same size (H ¼ 5) and clusters by partitioning each stratum on sets of the same size (A h ¼ 20, 8h ¼ 1,…, H, being A h the number of clusters generated in stratum h in the population).In this way, a total of 100 clusters of size 3. If q ≠ 0, generate q cluster-level variables by making Note that for two different units in the same cluster, their corresponding cluster-level covariates should take the same values, that is, 8i, j in the same cluster, ðx i,1 , …, x i,q Þ ¼ ðx j,1 , …, x j,q Þ.Therefore, we repeat each realization N h,α times.We now have defined the values corresponding to X 1 , …, X 50 variables for all the units in the finite population: fx i ¼ ðx i,1 , …, x i,50 Þg N i¼1 .4. Generate the values for the response variables as follows: a. Linear regression framework: and the value for the response variable y i is randomly generated by following Bernoulliðpðx i , z i ÞÞ.We set β 0 ¼ À10, defining in this way a prevalence (i.e., probability of event) of around 25%.Then, the finite population U is defined as the set of values corresponding to the response variable y i and the covariates x i (excluding the latent variables z i ), 8i ¼ 1, …, N as well as strata and cluster indicators corresponding to each of them. 5. Sampling design is defined as a stratified cluster sampling scheme.First, a h ¼ 4, 8h ¼ 1, …, H clusters or PSUs are sampled per stratum.Afterward, a different number of units is sampled from each PSU of each stratum (n h,α ).In particular, 8α ¼ 1, …, 4 PSUs for each stratum, the following number of units have been sampled in each scenario: Note that the names of the scenarios refer to the distribution of the response variable: "G" for Gaussian distribution with reference to the framework of the linear regression and "B" for the Bernoulli distribution indicating the logistic regression framework.
6.For i S, suppose i is a unit sampled from PSU α, 8α ¼ 1, …, a h in stratum h, 8h ¼ 1, …, H.The sampling weight for unit i is then calculated as follows:

| Set up
As explained above, in this simulation study we aim to compare the performance of the methods described in Section 2.3 to the w-SRSCV and to the unw-SRSCV (both of them described in Section 2.2).In order to ease the notation, we could define replicate weights for training and test sets of the w-SRSCV as the original weights, that is, w * ,trðtÞ i,w-SRSCV ¼ w * ,testðtÞ i,w-SRSCV ¼ w i , 8i ¼ 1,…, n for the t th training and test sets.In the same way, the unw-SRSCV would be equivalent to setting all the replicate weights for training and test sets to one: w * ,trðtÞ i,unw-SRSCV ¼ w * ,testðtÞ i,unw-SRSCV ¼ 1, 8i ¼ 1,…, n for the t th training and test sets.Considering this notation, the lines below describe the process of the simulation study for all the methods, including w-SRSCV and unw-SRSCV.For r ¼ 1, …, R: Step 1. Obtain the sample S r .
Step 2. Define the penalty grid based on S r using the default approach in glmnet (Friedman et al., 2010) Step 3.For each value λ r l , 8l ¼ 1, …, L: Step 3.1 Fit the model to S r ( fr,l ðÁÞ) considering the vector of covariates x i , sampling weights w i , 8i S r and λ r l following Equation (11).
Step 3.2 Apply the model fr,l ðÁÞ to the finite population and estimate the response: Step 3.3 Calculate the error of the model fr,l ðÁÞ in the finite population (i.e., true population error of the model fitted to S r ): where Lðy i , fr,l ðx i ÞÞ is calculated as in (7).
Step 3.4 Define the "true" optimal tuning parameter (not available in practice): Step 4. For each method m, where m fJKn; dCV; Bootstrap; BRR; split-cv; split-boot; extrap; w-SRSCV; unw-SRSCVg: Step 4.1 Define training and test sets (S r,m trðtÞ and S r,m testðtÞ , 8t ¼ 1, …, T m , respectively) and calculate the corresponding replicate weights for the sampled units: w F I G U R E 2 Box-plots of the differences between the logarithm of the true optimal tuning parameter and the one obtained based on each analyzed method (see ( 28)) across R ¼ 500 samples in S1(G) (linear regression with q ¼ 0 cluster-level variables), S1(B) (logistic regression with q ¼ 0 cluster-level variables), S2(G) (linear regression with q ¼ 5 cluster-level variables), and S2(B) (logistic regression with q ¼ 5 cluster-level variables) F I G U R E 3 Box-plots of the number of variables considered in models fitted across R ¼ 500 samples from S1(G) (linear regression with q ¼ 0 cluster-level variables), S1(B) (logistic regression with q ¼ 0 cluster-level variables), S2(G) (linear regression with q ¼ 5 cluster-level variables), and S2(B) (logistic regression with q ¼ 5 cluster-level variables) considering the optimal tuning parameters selected based on each method m (i.e., d r m ).Gray box-plot (denoted as "True") indicates the number of variables of the models fitted to the same samples considering the true optimal tuning parameters that minimize the population error defined in (24) (i.e., d r true ) T A B L E 1 Summary of the numerical results obtained in the simulation study.The performance of all those methods for selecting the tuning parameter for LASSO models has been compared by means of an extensive simulation study.The sampling design considered in this study is a stratified cluster sampling in which a different number of units are sampled from each cluster.In order to reduce the number of results shown in the paper, we set the number of folds as K ¼ 10, given that it is the one most commonly used in the literature (see, e.g., Witten et al., 2016).Let us highlight some of the most interesting conclusions of the simulation study in the following lines.
In the first place, the bad performance of the unw-SRSCV, which leads to very complex regression models selecting almost all variables, shows the need of incorporating sampling weights into the estimation process of LASSO regression models.It should be noted that in this paper we have not considered the option to fit perfect prediction models for which the sampling design is uninformative given the covariates included in the model.In line with previous works (see, e.g., Pfeffermann, 1993;Scott & Wild, 1986;Sugden & Smith, 1984), if we try to fit perfect models, sampling weights are not needed in the estimation process of linear and logistic regression models.These conclusions can also be extended to LASSO models, and hence, this method would perform properly in that situation.However, it is important to point out that when working with LASSO regression models, as we are using a sparse shrinkage estimator, the sampling design must be uninformative given, not all the covariates, but the ones that actually end up in the final model, which is even more complicated and beyond the researcher's control.In addition, it should also be noted that when we work with real data, we will hardly ever be able to fit "perfect" regression models.Therefore, we would not recommend the use of this method in practice, in order to avoid fitting too complex regression models with biased estimates of regression coefficients.
The second point that should be mentioned is the similarities and differences between the performance of the w-SRSCV (which does not consider the sampling design when defining folds) and the new proposal dCV.It is striking that for the same sampling design, such different results were obtained across different scenarios.This fact could be explained as follows.In the scenario where no cluster-level variable was incorporated, most of the variability offered by the cluster could be explained by means of the sampling weights.In contrast, the inclusion of cluster-level variables leads to an increase in the effect of the sampling design relative to clustering, thus offering greater differences between one method and the other.When no cluster-level variable is considered in the model, both methods perform properly and lead to reasonable and parsimonious regression models.In contrast, when including cluster-level variables into the process, models selected based on those methods differ considerably, being the ones selected by the w-SRSCV more complex than necessary.This is in line with the results obtained by Lumley and Scott (2015), in which the effect of the sampling design has shown an important role in the model selection, in particular, on the Akaike information criterion (AIC) and Bayesian information criterion (BIC).Briefly, this work shows that for samples with greater design effect, more parsimonious models are selected, given that the design effect penalizes more strongly the incorporation of the covariates into the model.Coming back to our study, we have observed that the greater the cluster effect, the greater the differences between the tuning parameters selected for fitting the models and the number of variables selected based on those methods.The w-SRSCV tends to select a larger number of variables than the dCV.Therefore, we recommend the use of the dCV rather than the w-SRSCV, in order to select more parsimonious models when fitting LASSO regression models to complex survey data.In addition, given the similarities of both works, we believe that the trace of the variance-covariance matrix (defined by means of the information matrix and score vector) which is used to estimate the design-based AIC proposed by Lumley and Scott (2015) could be used to analyze the cluster effect and thus diagnose whether there may be differences between the dCV and w-SRSCV.Some numerical results related to this issue are shown as Supporting Information.
T A B L E 1 (Continued) Note: For each scenario, the minimum (min) and maximum (max) values, the average (mean) and standard deviation (sd) and the median and interquartile range (Q1-Q3) are displayed for the logarithm of the optimal tuning parameters (Λ r true and Λ r,m test ) and the corresponding population errors ( c Err

r
true ðΛ r true Þ and c Err r true ðΛ r,m test Þ) obtained across R ¼ 500 samples for each analyzed method m.
and for each of these values, a model is fitted to each training set S trðkÞ , 8k ¼ 1, …, K following (6) (let us denote this model (either linear or logistic) as fl trðkÞ ðÁÞ) and applied to the test set (let fl trðkÞ ðx i Þ indicate the predicted value, 8i S testðkÞ ).The estimation error for each unit is calculated by means of the loss function as follows: 2, in linear regression framework, Ày i lnð fl trðkÞ ðx i ÞÞ À ð1 À y i Þ lnð1 À fl trðkÞ ðx i ÞÞ, in logistic regression framework:

1
Graphical summary of replicate weights methods considered in this work.Existing methods are shown in (a) Jackknife Repeated Replication (JKn), (b) Rescaling Bootstrap (Bootstrap), and (c) Balanced Repeated Replication (BRR).New proposals are depict in (d) Design-based cross validation (dCV), (e) Split-sample Repeated Replication (split), and (f) Extrapolation (extrap).Note that each dot represents a PSU and gray lines define the strata.Replicate weights for the corresponding test sets (w * ,testðtÞ i,BRR) are defined by exchanging the roles of test and training sets in (15).•Design-based K-fold cross-validation (dCV): The a sampled PSUs are randomly split into K subsets.For k ¼ 1, …, K the k th subset is set as test set (S testðkÞ ), being the training set (S trðkÞ ) formed by the rest K À 1 subsets.In this way, T dCV ¼ K training and test sets are defined.For each sampled unit i S from stratum h, we propose to define replicate weights as follows for the k th training set: A given percentage of strata are set as training set and the rest as the test set.The process is repeated T extrap times with a different split each time, defining T extrap different training and test sets.In this case, replicate weights are equal to sampling weights for units in the training set and 0 for units in the test set when fitting the models, 8i S. 8t ¼ 1, …, T extrap : • Split-sample Repeated Replication (split): A given percentage of PSUs is randomly set into the training set and the rest into the test set.This process could be repeated T split times with a different split each time, defining in this way T split training and test sets.Replicate weights for units in both, training and test sets, can be defined in two different ways, 8t ¼ 1,…, T split : ◯ split-cv: As described in Equation (16) for the dCV ðw * ,trðtÞ i,split-cv , w * ,testðtÞ i,split-cv Þ.◯ split-boot: Replicating by replacement the PSUs until having a h À 1 in each stratum and calculating the weights as in Equation (14) for the Bootstrap ðw * ,trðtÞ i,split-boot , w * ,testðtÞ i,split-boot Þ.• Extrapolation (extrap): Wieczorek et al. (2022)s based on the idea of replicating weights have been proposed, among others, the dCV.This method could be seen as an extension of the Survey CV method proposed byWieczorek et al. (2022), which in combination with replicate weights allows us to be more flexible when defining different folds, and thus, it is valid for more types of designs, for example when different number of PSUs per stratum is available, or a few numbers of PSUs per stratum are sampled.