Objective model selection with parallel genetic algorithms using an eradication strategy

In supervised learning, feature selection methods identify the most relevant predictors to include in a model. For linear models, the inclusion or exclusion of each variable may be represented as a vector of bits playing the role of the genetic material that defines the model. Genetic algorithms reproduce the strategies of natural selection on a population of models to identify the best. We derive the distribution of the importance scores for parallel genetic algorithms under the null hypothesis that none of the features has predictive power. They, hence, provide an objective threshold for feature selection that does not require the visual inspection of a bubble plot. We also introduce the eradication strategy, akin to forward stepwise selection, where the genes of useful variables are sequentially forced into the models. The method is illustrated on real data, and simulation studies are run to describe its performance.


INTRODUCTION
In predictive models, feature selection can boost performance and help avoid overfitting.Classical approaches such as stepwise regression and all subset selection based on penalized statistics such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are routinely used (see e.g., Draper & Smith, 2014).Penalized regression methods such as the lasso of Tibshirani (1996) select features and fit a model simultaneously.Genetic algorithms have also been proposed (see e.g., Chatterjee et al., 1996), including a parallel version by Zhu & Chipman (2006).
Biomimetics consists of imitating nature to solve complex human problems.For instance, genetic algorithms imitate natural selection.For feature selection, models are individuals whose DNA is a vector of binary markers that indicate whether each feature belongs to the model.Successive generations of models are generated by selecting the fittest parents of the current population and creating offspring by selecting genes randomly from both parents, a step called crossover.The richness of the population may be improved when models undergo mutation where genes are subject to being flipped.While the general steps of a genetic algorithm are clear, the actual recipe for each application offers numerous options.
Genetic algorithms often involve a single population that evolves until the fittest individual is born.A generalization proposed by Zhu & Chipman (2006) considers a fixed number of parallel populations that evolve independently of the others.After a predetermined number of steps, each feature's importance is defined by its relative frequency in the last generation.A bubble plot may then be used to visually identify where the importance drops to separate essential features from those that should be dismissed.
To illustrate the bubble plot, let us consider the Pollution dataset of McDonald & Schwing (1973), where linear regression from 15 features on 60 data points predicts a measure of mortality.The left panel of Figure 1 shows a bubble plot, a tool proposed by Zhu & Chipman (2006).The right panel shows the associated ordered plot, which is identical to the bubble plot except that the features appear in order of decreasing importance.The figure suggests the selection of features  1 ,  9 , and  14 .Although Zhu & Chipman (2006) analyzed the Pollution dataset, Figure 1 was generated with a different genetic algorithm; hence our figure differs from theirs.While the bubble plot is interesting and intuitive, it requires a visual inspection, which makes the method subjective and limits its automation, as human intervention is required to complete the selection.
To make parallel genetic algorithms more objective, we derive the distribution of the importance score of every feature under the null hypothesis that none of them is a good predictor of the response.We use the null distribution of the importance to determine an objective threshold at a global significance level .Drawing the bubble plot then becomes optional, as the importance of each feature may be compared against a given threshold.We also introduce an eradication strategy for parallel genetic algorithms: when a variable is deemed sufficiently important, it is forced into all future models, and the creation of new populations respects that constraint, thereby eradicating the zero gene for that variable.We perform multiple eradication steps until all the remaining features stay below the significance threshold we established.
FIGURE 1: Example of a bubble plot and its associated ordered plot as suggested by Zhu & Chipman (2006).The importance of each feature is measured by its prevalence among the individuals of the last generation in a parallel genetic algorithm.
DOI: 10.1002/cjs.11775 The Canadian Journal of Statistics / La revue canadienne de statistique Feature selection has a rich body of literature.Review papers such as Chandrashekar & Sahin (2014) or Miao & Niu (2016) cover different approaches from a machine learning perspective.Based on their descriptions, filtering methods rank features based on their potential importance in order to dismiss the less worthy.Embedded methods proceed with feature selection while training-the lasso is one example of such methods.Genetic algorithms, as well as classical methods such as stepwise regression, fall in the last category, that of wrapper methods, which consist in selecting a subset of features based on their ability to predict well.Through the years, a number of different methods have been proposed, either in methodological papers like Rahman & Khan (2012) or through applications such as Onan (2015) and Onan & Korukoglu (2017).
The main contribution of this paper is to provide theoretical results on the importance scores of Zhu & Chipman (2006).
The empirical results of this article focus on illustrating the theoretical results derived for the distribution of the importance scores of Zhu & Chipman (2006).As a point of reference, we will also provide results obtained with the classical stepwise method, namely the AIC-based stepwise method implemented as the step function in R, see R Core Team (2022).
Section 2 of this article introduces the basics of genetic algorithms for model selection as well as some notation.Our method, which we call the Eradicative Parallel Genetic Algorithm (EPGA), is described in Section 3, along with the mathematical results that support it.Case studies and Monte Carlo simulations are analyzed in Section 4. Section 5 discusses the choice of hyperparameters in an EPGA.Section 6 offers a short conclusion.

GENETIC ALGORITHMS FOR MODEL SELECTION
Genetic algorithms are meta-heuristic optimization techniques mimicking a Darwinian vision of evolution in order to solve complex, numerically intractable problems (see Shonkwiler & Mendivil, 2009).The general functioning of genetic algorithms consists of the successive application of three genetic operators-selection, crossover, and mutation-to an initial population of candidate solutions or individuals.For model selection, individuals are binary strings indicating whether each feature is present.At each generation, the likelihood of an individual generating an offspring depends on its fitness.
We first describe the genetic operator in the more classic setting of a single-thread genetic algorithm that requires the evolution of only one population.We then extend this notation to the parallel universes proposed by Zhu & Chipman (2006).

Single-Thread Genetic Algorithm (SGA)
In a single-thread algorithm, individual  ∈ {1, … , } of generation  ∈ {1, … , } is represented by the binary vector   = [ 1 , … ,   ], where   is 1 when feature  ∈ {1, … , } is active for this individual and 0 otherwise.The number of individuals in the population is usually held fixed from generation to generation.A typical application for model selection is to consider linear regression where each explanatory variable is included or not in the model as encoded by   .

Initial Population
The initial population of individuals is generated randomly.The  0 's are independent Bernoulli variables with a probability of success  0 , the activation rate.

Fitness
In a general genetic algorithm, the fitness function  (  ) measures the ability of an individual to "solve the problem".In a model selection setting,  should measure the predictive ability of the corresponding model.Models with different numbers of features should be treated fairly by The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11775 an ideal measure of fitness.Zhu & Chipman (2006) used a leave-one-out cross-validated residual sum of squares (RSS), but we prefer to use a validation subset to evaluate an appropriately scaled RSS.Since better individuals should have a larger fitness, our fitness function is a negative power of the rescaled RSS.

Selection
At generation ,  individuals are available to become parents.To create an offspring, we select two individuals randomly with replacement (i.e., parthenogenesis is allowed).Zhu & Chipman (2006) determine a survival pool in which all individuals have the same probability of being selected as parents.The survivors also make it directly to the following generation-an operation called elitism.We opt to implement another popular selection operator where the probability of selection of an individual is proportional to its fitness.We do not implement elitism: all members of a generation are offspring born of the previous generation.That is, for  fixed, the probability of selection of individual

Crossover
Crossover determines how the genes of the two selected parents are combined to generate one offspring.Zhu & Chipman (2006) used one-point crossover where an integer is chosen randomly between 1 and  − 1 inclusively.The father's genes are kept up to that integer inclusively, and those of the mother prevail for the rest.One disadvantage of this approach is that the arbitrary order of the variables influences the models that are possible to generate.We have a preference for a uniform crossover where each gene of the offspring is taken randomly from its mother or father.Let C  be a diagonal matrix where the  diagonal entries are independent Bernoulli variates with a probability of success 1∕2.Then, if individuals  1 and  2 of generation  are selected as the parents of individual  in generation  + 1, the crossover will first generate the embryo where I is the  ×  identity matrix.The embryo  * will become an individual in generation  + 1 once the mutation step is complete.

Mutation
Mutation contributes to genetic diversity by making each gene of the embryo subject to a random flip.Let M  be a diagonal matrix whose diagonal is filled with  independent Bernoulli random variables with parameter   , the probability of a mutation, which could vary by generation .
A decreasing   , for instance, is akin to simulated annealing (see e.g., Kirkpatrick et al., 1983).Individual  of generation  + 1 may then be obtained as

Evolution
The process just described is repeated recursively.While it is possible to iterate until satisfying a convergence criterion, we continue for a fixed number of generations .For a single-thread algorithm, the fittest individual of the last generation, arg max  (  ), is usually output as the solution.
As a meta-heuristic algorithm, there exists a vast number of variants for the genetic operators and their hyperparameters.Our description of the operators focuses on the method that we have developed.Readers who are interested in learning more about the numerous uses of genetic algorithms and many variants of genetic operators may consult Mitchell (1998), Cantú-Paz (2000), Haupt & Haupt (2004) Fitness: Since we use out-of-sample validation, the fitness functions in parallel universes use different holdout datasets.The notation   shows this dependence of the fitness function on the universe.
Selection, crossover, and mutation: The genetic operators are identical in parallel universes.They are applied independently in each of the parallel populations in the same manner as previously described.
With parallel worlds, the output is not a single fittest individual but an importance score based on the frequency of each gene in the final generation of all populations.For feature , the importance score is the proportion and we denote by π = [ π1 , … , π ] the vector of those importance scores for all features.The bubble and ordered plots of Figure 1 are a graphical representation of π.
In the next section, we describe how to derive a threshold for the values of π.We use it to determine which features to retain and which to dismiss.

ERADICATIVE PARALLEL GENETIC ALGORITHMS (EPGAS)
Using PGA for feature selection yields importance scores for the variables rather than a single solution, as SGA would do.The visual inspection of a figure drives the final decision of which variables to retain.The purpose of this article is to offer an objective and automatable way to make that final decision.
Let us focus on a linear model setting where the predictor  has a parameter   and where   = 0 means that the feature has no effect on the target.To establish a threshold value, we derive the distribution of π for  ∈ {1, … , } under a null hypothesis, which implies some symmetry between individuals.That is, under  0 ∶  1 = • • • =   = 0, the fitness of any individual is assumed to be approximately equal since they all have an equally low ability to predict the target variable.At the selection step, this means that all individuals are equally fit and hence have an equal probability of being selected.This hypothesis of symmetry gets reinforced through careful choices in the design of the genetic algorithm.

Holdout Validation Sample
In each universe, the set of  observations gets split into a training set and a validation set.The ñ observations from the training set form the vector of response Ỹ , and the predictors are encoded in the ñ × ( + 1) design matrix X whose first column filled with 1s accounts for the intercept.The rest of the data becomes the validation set, and we denote the corresponding values , Y  , and X  .When working with a massive dataset, sample sizes with  + ñ <  could be considered, with each universe drawing possibly different data.
The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11775

Choice of Fitness Function
The distribution of the importance scores assumes nearly equal fitness for all models under the null hypothesis.Using out-of-sample validation helps to yield a fitness that is not influenced by the number of active features.The fitness we proposed uses a rescaled sum-of-squared residuals arising from the validation sample, but other choices would also be acceptable.
Consider an arbitrary individual  with  active features, such that |||| 2 = .Let B be a ( + 1) × ( + 1) matrix generated by removing some columns from the ( + 1) × ( + 1) identity matrix.That is, column 1 always remains for the intercept, and the following are matched to the binary values in , all columns associated with a 0 being removed.The B matrix allows us to select appropriate columns from the design matrices, and the usual matrix notation for regression estimates may be adjusted by adding a B beside every design matrix.In universe , the estimate of the regression slope parameters is then given by (B ⊤ X⊤  X B) −1 B ⊤ X⊤  Ỹ , which may then be used to calculate the out-of-sample predictions Assuming a linear regression model with independent homoscedastic normal errors, and considering the independence of the error in the training and the validation datasets, this vector of residuals is a multivariate normal with mean zero and covariance where I is the  ×  identity matrix.Since the selection is based on the relative value of the fitness, the actual value of  2 is irrelevant in the calculation of the fitness function where  is a positive scalar that can enhance the peakedness of the function.We found empirically that  = 1.5 works well.Since the residuals are asymptotically normal, the distribution of the expression inside the brackets in Equation ( 2) up to a multiplicative constant is chi-square with  degrees of freedom.Most importantly, the degrees of freedom do not depend on , the number of active features.
Remark 1.Using the Woodbury matrix identity and some algebra, Equation (1) may be written in the equivalent form which may be used to evaluate Equation ( 2) with one less matrix inversion.This result is similar in nature to Problem 7.3a of Friedman et al. ( 2001).
There are numerous other possible fitness functions in the literature.As long as a fitness function is sufficiently unbiased to offer an approximately equal fitness score to all models under the null hypothesis, the results of the next section should hold.DOI: 10.1002/cjs.11775 The Canadian Journal of Statistics / La revue canadienne de statistique

Distribution Under the Null Hypothesis
In generation  = 0, all genes are generated at random from  independent Bernoulli variates with a probability of success  0 .As the populations evolve, these genes recombine into new offspring.We look into the distribution of those genes at generation  when the evolution of all populations terminates.Note that under the null hypothesis, the expected value of the genes of an embryo  * is equal to that of its parents since all generated genes have the same probability of being selected.The mutation step, however, has an effect on the expected proportion of 1's.Fixing  and , and then conditioning on the presence of a mutation, we get the recurrence which may be written alternatively as which tends to 1∕2 as  → ∞ for most sequences with   ∈ (0, 1∕2).In practice, except for the trivial case where the mutation is eventually disabled, the values of   are likely to be bounded away from 0 or 1∕2, yielding that log(1 Many other limits are, however, possible if   tends to zero fast enough to make the above sum converge. Remark 2. If the probability of mutation is fixed for all generations such that   = , then Equation (3) simplifies into Since there is no exchange between parallel universes, bits   1 and   2 will always be independent when they come from different universes.By virtue of uniform crossover as well as the assumption that all models are equally likely to be selected, genes in different positions   1  and   2  should also be uncorrelated.For fixed  and , two different individuals may, however, be correlated as they are likely to share common ancestors.In the process of creating generation  = 1, we may condition on having inherited gene  from a common parent to get  * 0 = cov( *  1 0 ,  *  2 0 ) =  0 (1 −  0 )∕ for the embryo or, more generally, when creating embryos from parents of generation : where v  = var(  ) =   (1 −   ) and   = cov(  1  ,   2  ) for arbitrary , , and , and with  0 = 0 for the initial population.Let   identify the diagonal elements of M  , and note for simplicity that  1 =   1  ,  * 1 =  *  1  and similarly for  2 and  * 2 of individual  2 .We can then write explicitly after simplifications due to the independence of the mutation binary markers and properties of the covariance.Substituting Equation ( 4) in ( 5) yields the recurrence The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11775 Despite their unwieldy expressions, the sequences   and   are easy to determine numerically with a simple loop.
The decision of which features to retain is made from the importance scores π , which can be seen as the average of  independent and identically distributed random variables.As the number of parallel universes,  , increases, the central limit theorem (see Casella & Berger, 2002) guarantees the convergence of √  ( π −   ) to a normal distribution with mean 0 and variance The decision to retain any given variable may therefore be made from a normal quantile.Since all variables are simultaneously tested, and since their individual tests are uncorrelated, a Šidák correction (see Šidák, 1967) is applied to account for multiple comparisons.After setting a global level , the user should therefore keep those variables for which where Except for π itself, none of those values needs to be estimated.The values come from deterministic functions of the hyperparameters of the genetic algorithm.This result does not depend on the fitness function either, as long as it evaluates all models as approximately equally fit under the null hypothesis.In particular, a constant fitness function will yield this result but would display no ability to detect relevant variables when they are present.Empirical evidence is provided in Section 4 to show that the fitness function in Equation (2) behaves adequately.

Eradication
It should be clear that useful features give their models higher fitness, become more prevalent in their populations, and, as a consequence, obtain higher importance scores.The distribution of π is derived under the hypothesis that all features are unrelated to the target variable, but model selection makes sense only when some features are useful predictors.
We introduce an eradication strategy, which consists of performing multiple runs of PGA, after which important features are forced into all future models, hence eradicating the zero gene for that position.After the first run of PGA, variables that exceed the threshold in Equation ( 6) are deemed important.Let  1 be the set of indices of the retained variables.A new set of universes is then generated where   = 1 for all  ∈  1 .In other words, features in  1 are systematically active and not prone to mutation.In practice, this means that we now run a PGA with  − # 1 potential features and that we use this new number of features to compute the threshold in Equation ( 6).Any features that exceed the threshold are then added to the set of forced features to form  2 , the union of  1 with that new list of indices.Additional PGAs add variables to the model until all remaining features are within the bounds defined by Equation (6).At that stage, the argument of symmetry in Section 3.3 holds, which means that there are no other useful features.
Remark 3. The default eradication strategy described above adds multiple features at each step.Features could also be added one at a time: the feature with the largest importance gets added at each step if it exceeds the threshold of Equation ( 6).The computational cost then increases significantly, especially if a large number of features are active in the final model, as each retained feature requires one run of a PGA.DOI: 10.1002/cjs.11775 The Canadian Journal of Statistics / La revue canadienne de statistique Remark 4. Multiple runs of a PGA involve an increased number of tests.One could modify the level  to account for this.In our numerical examples, we keep  fixed through the multiple runs of a PGA involved for two reasons.Unless we add features one at a time, only a few iterations of a PGA are typically required.Also, the parameter  provides some form of control over the complexity of the chosen model, and getting an exact significant level is not so critical as it is with inference.

Extension to Generalized Linear Models
Let us consider the generalized linear model (GLM) as described in McCullagh & Nelder (1989) for a response variable  that follows an exponential family, whose density may be written as where  is called the canonical or location parameter,  the dispersion parameter, and the functions , , and  are distribution-specific known functions.Following simple calculations, as per Section 2.2.2 of McCullagh & Nelder (1989), the following expressions for the mean and variance may be derived: E( ) =  =  ′ () and var( ) =  ′′ ()().These equations are used to find the variance function  () = var( )∕(), which must be expressed as a function of .
Actual values of those functions for known distributions may be found in different references, including Table 2.1 of McCullagh & Nelder (1989).
In the definition of the GLM, a link function connects an independent observation  from an exponential family with a linear combination  of the predictors.Reversing the expression yields  = (), where  is the inverse link function.With the convention that  is applied componentwise to a vector, we can write  = (X) for a model with all features, where  = ( 0 ,  1 , … ,   ) is the vector of parameters of the model.For some families,  may be a known constant, but in other cases it is a nuisance parameter.
For an exponential family GLM, the estimate of  may be found through maximum likelihood estimation (MLE), retaining the usual properties thereof.In the context of this article, we only consider canonical link functions that provide additional simplifications, including the fact that  = .
Moving to the context of PGA with the notation previously introduced, universe  has the training sets Ỹ and X and the  active features in  are encoded through the selection of appropriate columns in B. The maximum likelihood equation is then based on the relation E( Ỹ ) = ( X BB ⊤ ), which depends only on a subset of  + 1 elements of , namely B ⊤ .The MLE β () is therefore an asymptotically normal vector of  + 1 elements with limiting mean B ⊤  and variance {B ⊤ X⊤  Ṽ () X B} −1 , where Ṽ () is a diagonal matrix with the variance function  applied componentwise to μ () = ( X BB ⊤ ) on its diagonal.The dependence on  does not change the dimension of Ṽ (), but it affects the values therein.The same applies to V  (), which is based on   () = (X  BB ⊤ ).The distributional result follows from the properties of the MLE and some algebra to determine the Fisher information for B ⊤ .Applying the chain rule for second-order derivatives helps those calculations, which are further simplified by properties emerging from the choice of the canonical link.In practice, μ may need to be estimated by replacing B ⊤  with its MLE, β ().
Building toward a fitness function based on a rescaled RSS, we may write the out-of-sample predictions as The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11775 From an application of the multivariate delta method (see Section 3.4 of DasGupta, 2008), Ŷ () is an asymptotically multivariate normal with mean   () and variance The Jacobian of the link function appears as V  () because of their equality resulting from the use of the canonical link.The unscaled residuals ε() = Y  − Ŷ () then have mean 0 and variance Σ  (), which is equal to which will be required up to a multiplicative constant, and hence the nuisance parameter (or constant)  may be ignored.Pearson's  2 statistic uses a sum of squared rescaled residuals to measure the goodness of fit.With out-of-sample validation, Equation ( 7) provides an appropriate factor.For GLM, we therefore use the same fitness function shown in Equation ( 2) but with modified values for ε() and Σ  (), which must be calculated up to a multiplicative constant.
Remark 5.The identity in Remark 1 and some algebra may be used to express the term in brackets in Equation ( 7) as the inverse of which allows evaluating the fitness with one less matrix inversion when considering that inverting the diagonal matrix V  () is much less computationally intensive than inverting a comparable full matrix.

Notable Differences with PGA
While we adopt the parallel populations of Zhu & Chipman (2006), some of the choices they make in the hyperparameters of their genetic algorithms are not suitable for EPGA, as they violate the assumptions that we use to derive the null distribution of the importance scores.That is, the following elements are different: 2006) use a one-point crossover, but as we mentioned in the description of crossover, our preference is for uniform crossover to make the arbitrary order of the features irrelevant.
Early-stopping: Using a finite, predetermined number of generations yields importance scores that are averages of  independent values of equal variance.Early-stopping means that the number of generations depends on their diversity.While the theoretical complications are significant, the benefits of early-stopping are less clear.
Elitism: Zhu & Chipman (2006) copy the fittest half of a generation directly to the following generation.The derivation of the null distribution for the importance scores assumes no such elitism.
Selection: At each generation of Zhu & Chipman (2006), half the population survives and all parents come from that survival pool.The expected value and variance that we derive for the importance score use a different mechanism in which all individuals may be parents and their probability of selection is proportional to their fitness.Fitness function: To enhance the symmetry between all possible models under the null, we use out-of-sample validation.One challenge with in-sample measures of fitness is to ensure that it does not display systematic preference (e.g., a bias towards models with more active features).
In Section 4, case studies and Monte Carlo methods are used to explore the behaviour of the importance scores under the null hypothesis.

EMPIRICAL EXPLORATION
We use real datasets and Monte Carlo studies to explore the behaviour of EPGA in practice.The definition of the "best" model is subject to debate and could be based, for instance, on the predictive abilities of the selected model according to different metrics.We prefer to adopt the same view as Zhu & Chipman (2006) and look at the ability of the method to detect the true features, which we call activated features in the context of the simulations.Those should, in turn, yield the best possible performance in realistic scenarios.We also designed part of the exploration to empirically verify the distribution of π derived in Section 3.3.
Although the scope of this article is to present the objective threshold that we developed for the approach of Zhu & Chipman (2006), the simulations include one common classical method as a point of comparison.For all the scenarios that we present, we also provide the results obtained with the step function in R, which proposes an AIC-based stepwise selection approach.
PGAs have heavy requirements in computation, but they can be parallelized.Our implementation splits the  universes among eight threads.Runtimes that are reported were obtained with an Intel i7-3770 CPU with 32 GB of RAM running under Ubuntu 20.04.4 LTS.Times are provided for the step function as a reference, but comparisons should be made carefully as the time complexity of genetic algorithms is proportional to  but the AIC-based stepwise algorithm does not have such hyperparameters.

Illustrative Data
In the illustrative example of Zhu & Chipman (2006), 20 features, labelled  1 to  20 , are simulated as independent normal variates with mean 0 and variance 1.We took the liberty to increase the sample size from 40 to  = 200 points, which were generated along with the response where  are independent normal variates with mean 0 and variance 1.We used the same hyperparameters as Zhu & Chipman (2006) for the genetic algorithm, namely  = 20,  = 8,  = 25, and   = 1∕ for all .The value of the activation rate does not seem to be reported in their paper; we used  0 = 0.3.The validation set has  = 40, which leaves ñ = 160 for training.
Figure 2 displays the boxplots of the importance scores obtained by one run of PGA on each of 1000 datasets generated following Equation ( 8).The horizontal solid line shows   , the expected importance.On the left panel, setting all regression coefficients to zero in the same equation when simulating  yields the null model.The solid dots show the empirical 95th quantile of the importance scores for each feature and can be compared to the dashed line, which shows their theoretical values under the null.On the right panel, features 5, 10, and 15 are activated, and the dashed line is the threshold to determine which features to select after one run of PGA, taking into consideration the correction for multiple comparisons.
Under the null, both the expected value and the 95th quantile of the importance scores match their theoretical values remarkably well.When a signal is present, we note that the correct relevant variables are always detected, but occasionally other features get selected too at rates between 0.005 and 0.017.where  is independent of the features.The right panel shows the importance scores when the signal described in Equation ( 8) is present.While the solid line shows the expected importance   , the dashed line displays the threshold   +  0.95  π ∕ √  on the left panel and its Šidák-adjusted equivalent on the right panel.The dots show the empirical 95th quantile of the importance scores for each variable to compare against the theoretical value.The boxplots of the activated features are in grey.
The eradication strategy described in Section 3.4 involves running additional PGAs where important features are sequentially forced into the genes of populations.EPGA found the three active variables for all of the 1000 generated datasets.As reported in Table 1, additional variables appeared wrongly in the selected models at a rate between 0.013 and 0.032.The alternative strategy of adding variables one by one with EPGA also found the three activated features in all the 1000 samples, and the rate of occurrence of inactive variables decreased to values between 0.001 and 0.012.The latter strategy, however, involves a higher computational cost.
In terms of computation time, one run of PGA takes about 0.78 s on the machine that we described previously.EPGA requires 1.85 s when adding variables in batches but 3.17 s when they are added one by one.In comparison, running the step function takes about 0.11 s.While the step function found the three active features for all the 1000 repetitions of the simulation, it also wrongly included inactive features at rates between 0.155 and 0.189, showing a much poorer performance than the genetic contenders.
In real data, covariates are typically correlated.We ran the same simulation where the covariates  1 to  20 are multivariate normal with a compound symmetry covariance matrix.Pearson correlation values  of 0.7 and 0.9 were used.Under the null, the plots of importance scores, which we do not reproduce here, are indistinguishable when compared to the left panel of Figure 2. Table 1 shows the frequency of selection of the different features under the two compound symmetry scenarios.The rates at which inactive features are wrongly included in the models remain unchanged despite a correlation as high as 0.9.In this latter scenario, the contribution of the weakest feature,  5 , sometimes remains undetected.

Illustrative Data for Logistic Regression
We reuse the illustrative example to illustrate PGA and EPGA for a GLM.The target variable is Bernoulli-distributed with a probability of success based on the logit of Equation ( 8) without the error term.A logistic regression is then fitted to samples of size ñ = 320, with an additional  = 80 data kept for validation.Figure 3 shows boxplots of the importance scores obtained.
On the left panel, shuffling the values of  before fitting the logistic regression simulates the null hypothesis.We note that the empirical 95th quantiles match their theoretical values very DOI: 10.1002/cjs.11775 The Canadian Journal of Statistics / La revue canadienne de statistique TABLE 1: Frequency of selection (out of 1000 repetitions) for the different features under the "illustrative example" scenario when the covariates are multivariate normal with a compound symmetry covariance and different levels of correlation  ∈ {0, 0.7, 0.9}.To summarize the frequency of selection of the 17 inactive features, only their minimum and maximum values are reported.
FIGURE Importance scores obtained from the PGA of 1000 different samples generated from the GLM version of the "illustrative example" scenario.The left panel shows the importance scores under the null hypothesis, where  is a binary response independent of the features.The right panel shows the importance scores when the binary response comes from the logit of Equation ( 8) without its error term.While the solid line shows the expected importance   , the dashed line displays the threshold   +  0.95  π ∕ √  on the left panel and its Šidák-adjusted equivalent on the right panel.The dots show the empirical 95th quantile of the importance scores for each variable to compare against that theoretical value.The boxplots of the activated features are in grey.

A Harder Problem
Let us now consider the hard problem that Zhu & Chipman (2006) attribute to George & McCulloch (1993).We preserved the sample size of 120 for the training sets by simulating samples of 150 observations, of which 20% is dedicated to validation.A total of 60 features of variance 2 with a compound symmetry correlation structure are simulated.A correlation of 0.5 is present between any two predictors.The error term has mean 0 and variance 4. The regression parameters have four different values with   = ⌊( − 1)∕15⌋, making 45 of the 60 features activated.Figure 4 displays the boxplots of the importance scores of 1000 samples, under the null and with a signal, respectively.The hyperparameters of the genetic algorithm were set to  = 60,  = 15,  = 100,  0 = 0.9, and   = 1∕ for all .
Under the null, the values of the 95th quantile and their expected value match fairly well, as do the expected importance.On the right panel, the 45 activated features are all found with very few exceptions where the smallest signals are left out-less than 4% of the time.The inactivated variables sometimes get picked up, but not more often than under the null, with a false positive rate of about 3% when using the Šidák-corrected level.These results are as good as Figure 3 of Zhu & Chipman (2006).Interestingly, those figures require a large  0 .Smaller values show excellent performances as well, but they are less stellar for variables 16-30, which see their rates of error increase.The ideal  0 seems linked with the unknown true number of features.If the initial population has too few or too many active features, the genetic algorithm is more likely to miss its target.In the absence of information about the likely number of features, comparing the results obtained with large and small values of  0 may be a good option.We ran the PGA of Zhu & Chipman (2006) using their crossover and selection mechanisms, but with a smaller  0 .Intuitively, their pointwise crossover could help in detecting the inactivated features that all come first.Empirically, this did not seem to be a driving factor, as we observed a similar decrease in performance for smaller values of  0 .
The results from EPGA on 1000 samples with either the bulk or the one-by-one approach show that variables  31 to  60 belonged to the selected model 100% of the time.For variables  16 to  30 , the rates of inclusion ranged from 0.979 to 1.For variables  1 to  15 , the rates of DOI: 10.1002/cjs.11775 The Canadian Journal of Statistics / La revue canadienne de statistique inclusion were systematically lower for the one-by-one approach.They ranged from 0.047 to 0.085 for each of the 15 variables compared to 0.064 to 0.105 when EPGA adds variables in bulk.With 60 variables, the one-by-one approach is, however, much slower, as more than 45 runs of PGA are required to get a final answer, taking more than 1800 s compared to 113 s when the variables are added in bulk.A single run of PGA takes about 39.5 s, and for reference, the step function took 0.50 s to run.This classical method found all the active variables at all times but also incorrectly included inactive features at rates between 0.254 and 0.298.

Pollution Data
Let us consider the pollution data used in Example 4.1 of Zhu & Chipman (2006) and based on data used by Miller (2002) but initially published by McDonald & Schwing (1973).Let us assume that the three-variable model suggested in Figure 1  The theoretical mean under the null matches the empirical median closely, but contrary to the illustrative example, we observe discrepancies between the theoretical and empirical values of the 95th quantiles.While the correlation between the features and the presence of some outliers may play a role, an investigation led us to conclude that the small sample size was the real culprit here.We first ran the same simulation but with features simulated as multivariate normal with the same mean and covariance as the original data.The ensuing plot was qualitatively identical to Figure 5 small sample size caused even smaller training and validation sets, which are a lot more prone to spurious correlations, and the assumption of equal fitness fails within some universes.Although the variables are equally likely to be favoured by chance, they are globally picked more often in their respective wrongly favourable universes; hence the increased importance scores.In cases where the sample is small, it may be advisable to consider a fitness function that does not rely on an out-of-sample validation set.Our distributional results depend on the approximate equality of the fitness under the null, not on the specific choice that we propose.As we may expect from the right panel of Figure 5, EPGA manages to detect the activated features most of the time.For the bulk approach, variables 1, 9, and 14 appear in the selected model at rates 0.952, 0.982, and 0.993, respectively.Other variables do appear at rates ranging from 0.256 to 0.542 as a consequence of their artificially increased importance.The one-by-one approach selects variables 1, 9, and 14 at rates 0.908, 0.895, and 0.966 but also includes the other variables at rates ranging from 0.218 to 0.256.
For the Pollution data, the step function finds variables 1, 9, and 14 in proportions 0.903, 0.897, and 0.923, respectively, out of 1000 repetitions.It also wrongly includes inactive features at rates between 0.223 and 0.343, which means that it also struggles with this trickier dataset.While each run of the step function requires only 0.053 s of runtime, a run of PGA requires 2.07 s, and the EPGA requires 6.4 s when variables are added in batches or 13.8 s when they are added one by one.

TUNING OF HYPERPARAMETERS
Genetic algorithms depend on many hyperparameters whose value must be arbitrarily selected, and so does PGA.Experience helps in choosing appropriate values, as do occasional simulations to explore the behaviour of the method.There are, however, general guidelines that may help.

Fitness Function
Under the null hypothesis that no features are useful to predict the target variable, the fitness function must yield an approximately equal probability of selection to all possible models.The construction of Equation ( 2) was designed with this property in mind, and the Monte Carlo DOI: 10.1002/cjs.11775 The Canadian Journal of Statistics / La revue canadienne de statistique studies in Section 4 verified that the assumption is valid in realistic settings.A similar empirical endeavour could contribute to validating newly proposed fitness functions.As long as this property stands, the choice of fitness has no bearing on the distribution of selection ratios under the null hypothesis.Of course, the ability of the fitness score to identify desirable models outside of the null hypothesis is also key to good performance.The peakedness parameter () highlights fitter individuals further, thus helping to speed up evolution.As such, it may help when computational resources are limited.A large  may, however, also highlight naturally occurring fluctuations, so larger values are not systematically better.While the reported results used  = 1.5, we also experimented with values as high as  = 4, which reminded us of the ARCX4 algorithm of Breiman (1996), but a fourth power did not have the same beneficial effect here.

Rates
The generation and evolution of populations depend on random elements that appear at some manually set rates.
Activation rate ( 0 ): The value of  0 affects the expected value of the importance scores.One should avoid values close to 0 or 1, as they lower the diversity of the initial population, which then relies on mutations to explore numerous models.A rate of  0 = 0.5 appears to be a good choice that maximizes initial diversity, but, in practice, the best results occur when   is close to the true proportion of activated features in the models.With prior information about the expected number of active features, it would make the most sense to start  0 in the neighbourhood of that value.Since models may often be sparse, smaller values of  0 appear to be more appropriate.
Mutation rates (  ,  ∈ {1, … , }): Mutation breaks stagnation by infusing diversity into each generation and helps in exploring the space of all candidate models, but it also precludes the convergence of a population.A lack of convergence is detrimental for single-thread genetic algorithms, but it helps PGA, which seeks diversity between the universes and looks for a common signal in the aggregation of all individuals.With a finite number of generations, a large mutation rate is appropriate, but it should stay far below   = 1∕2, which would lend dominance to randomness, rather than heredity.As Zhu & Chipman (2006) pointed out, many instances in the literature suggest that   = 1∕, which we also found to work well.
Although we allow for a changing rate of mutation, this appears to be especially appropriate for single-thread genetic algorithms where diversity has value for the first generations, but convergence eventually requires low mutation rates.Hence for PGA, we typically kept the mutation rate constant.

Number of Individuals, Universes, and Generations
The computational cost of a PGA depends highly on the number of individuals generated, namely  .If we denote by (, ) the time cost to fit one model and assess its fitness, then the time cost of one run of PGA should be  (, ).
Number of parallel universes (U): Universes are conditionally independent by the design of the algorithm, and a larger number of universes improves the accuracy of the selection ratios.
To justify the central limit theorem, at least 30 universes is advised, but that number could be far greater if the computational resources are available.Note also that the computation of the universes is embarrassingly parallel, so it provides a trivial way to scale the algorithm on multicore platforms.
The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11775 Number of generations (G): A large number of generations may be detrimental because the covariance between two individuals (  ) will tend to increase as  increases, thus yielding a larger threshold.On the other hand, too few generations might not allow sufficient time for the genetic operators to detect the best models under the alternative when some features are good predictors.To complicate things further, the speed of convergence to fit individuals also depends on the probability of activation and the mutation rates, especially if the mutation varies with .Values of 10-15 appeared to work well for .Population size (I): Contrary to intuition, larger population sizes do not yield better results.In a vast population, the probability of selection of an individual twice as fit as anybody else may be so diluted that its genes will not survive to the next generation.In a smaller population, its relative competitive advantage is much more significant.The population size () should thus be kept as small as possible but not to the point of being detrimental to the exploration of the possible models.Zhu & Chipman (2006) suggest having a population size () of the same magnitude as the dimension of the data (), and we found that this guideline worked well, although  <  also has merits when the number of dimensions gets large.
With the total computational cost of PGA being { (, )}, it appears advisable to find a moderate value of , and then make  as large as possible.

CONCLUSION
With independent universes that can be generated on different machines, the parallel genetic algorithm of Zhu & Chipman (2006) is an "embarrassingly parallel" approach to feature selection.It computes importance scores for each variable from a summary of all models in several universes that are purposefully not fully evolved to enhance diversity.
We derived the distribution of the importance score when none of the covariates is a good predictor, which yields an objective criterion for variable selection.Simulations show that the distributional results hold well under the null hypothesis for both linear regression and logistic regression.Model selection is most useful when some features are good predictors.We introduced eradication strategies that involve a sequence of PGAs where features previously deemed important are forced into all future models.For the last iteration, none of the remaining features is a good predictor.Hence the distribution we derived for the importance scores applies.Simulations showed that EPGA has an excellent ability to detect activated features.
The distribution of the importance scores under the null depends not on the actual fitness function but on its ability to assess models fairly.The out-of-sample RSS that we proposed works well for linear regression.Other fitness functions are indeed possible and could be the topic of future research.More importantly, our results apply to feature selection for any model, linear or not, as long as an appropriate fitness function is used.

FIGURE 4 :
FIGURE 4: Importance scores obtained from the PGA of 1000 different samples generated from the same "hard problem" scenario.The left panel shows the importance scores under the null hypothesis, where  is independent of the features.The right panel shows the importance scores when   = ⌊( − 1)∕15⌋ for feature .While the solid line shows the expected importance   , the dashed line displays the threshold   +  0.95  π ∕ √  on the left panel and its Šidák-adjusted equivalent on the right panel.The solid dots show the empirical 95th quantile of the importance scores for each variable to compare against that theoretical value.The boxplots of the activated features are in grey.
is correct.The same genetic hyperparameters are used as in Figure1, namely  = 50,  = 20,  = 75,  0 = 0.3, and   = 1∕ for all .The right panel of Figure5displays the boxplot of the importance scores of each feature obtained from 1000 datasets with the original values of the covariates and a new  generated from a linear regression model where all regression coefficients equal 0 except for  0 = 796.5, 1 = 2.347,  9 = 2.961, and  14 = 0.3911.The standard deviation of the error term is 38.58.Those parameters were determined from a fit on the entire original dataset and then deemed true values for the simulation.The lines and solid points have the same meaning as in Figure2presented in Section 4.1.The left panel under the null is obtained by randomly permuting the values of  .

FIGURE 5 :
FIGURE 5: Importance scores obtained for the PGA of 1000 different samples generated from the same scenario based on the Pollution dataset.The left panel shows the importance scores under the null hypothesis, where  is independent of the features.The right panel shows the importance scores when the signal depends on covariates 1, 9, and 14.While the solid line shows the expected importance   , the dashed line displays the threshold   +  0.95  π ∕√ on the left and the Šidák-adjusted equivalent on the right.The solid dots show the empirical 95th quantile of the importance scores for each variable to compare against that theoretical value.The boxplots of the activated features are in grey.
Zhu & Chipman (2006)8).Zhu & Chipman (2006)suggest the creation of  universes in which populations evolve in parallel.New indices are required to account for the universe.They are as follows:Genes:   = [ 1 , … ,   ] is a binary vector with the genetic code of individual  of generation  in universe  ∈ {1, … , }.
Importance scores obtained from the PGA of 1000 different samples generated from the same "illustrative example" scenario.The left panel shows the importance scores under the null hypothesis, The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs.11775FIGURE