Error‐controlled feature selection for ultrahigh‐dimensional and highly correlated feature space using deep learning

Deep learning has been at the center of analytics in recent years due to its impressive empirical success in analyzing complex data objects. Despite this success, most existing tools behave like black‐box machines, thus the increasing interest in interpretable, reliable, and robust deep learning models applicable to a broad class of applications. Feature‐selected deep learning has emerged as a promising tool in this realm. However, the recent developments do not accommodate ultrahigh‐dimensional and highly correlated features or high noise levels. In this article, we propose a novel screening and cleaning method with the aid of deep learning for a data‐adaptive multi‐resolutional discovery of highly correlated predictors with a controlled error rate. Extensive empirical evaluations over a wide range of simulated scenarios and several real datasets demonstrate the effectiveness of the proposed method in achieving high power while keeping the false discovery rate at a minimum.


Introduction
In modern applications (e.g., genetics and imaging studies), the investigator is often interested in uncovering the true pattern between a quantitative response and a large number of features.The key working assumption, oftentimes, is that there is an underlying sparsity pattern buried in the high dimensional data setting.Selecting the essential features aids in further scientific investigations by offering improved interpretability and explainability, reduced computational cost for prediction and estimation, and less memory usage due to lower dimensional manifolds of the feature space being estimated.Under the linear model (LM) framework, this problem has been extensively studied over the past few decades producing popular algorithms such as Lasso, Elastic net, SCAD, and MCP.A detailed review of this literature can be found in Fan and Lv [12] and thus is omitted here.However, regardless of their ubiquitous applications, the LM has limited usage, especially when the underlying mechanism is highly nonlinear, with potential interaction effects.Relaxing the linearity assumption, the Artificial Neural Network (ANN) models are well known for efficiently approximating complicated functions.From an information-theoretic viewpoint, [10] established that deep neural networks (DNN) provide an optimal approximation of a nonlinear function, covering a wide range of functional classes used in signal processing.This property has promoted the use of Deep Learning (DL) models for feature selection, an approach that has generated much research interest over the past few years.A major caveat, however, is that the DL models are often used as a black box in many applications.Following the intriguing arguments in [32], caution must be exercised regarding the application of DL models for decision-making in real-world problems.Employing only the relevant predictors to construct a predictive model is obviously a right step toward explainable machine learning.However, as suggested in [14], oftentimes, the feature importance in DL-based algorithms varied drastically under small perturbations in the input or in the presence of added noise.
As a solution to this problem, we focus on the reproducible nonlinear variable selection using DL models with some error control.We adopt the False Discovery Rate (FDR) first proposed by [3], known for being suitable for large-scale multiple testing problems.To formally define the FDR, we consider the random variable, F DP , representing the False Discovery Proportion: F DP = e0 N+ 1 (N + > 0), where e 0 = number of falsely selected variables and N + = number of total discoveries.Then, FDR is defined as F DR = E(F DP ).Estimating this expectation poses a unique challenge for the model-free variable selection problem, which many authors have tried to solve from various perspectives.For example, a p-value approach has been proposed as a feature importance criterion in multiple testing literature; see [38,42,23,21] for a more detailed overview.However, for DL models, generating interpretable p-values is still an unrevealed research problem.To circumvent this limitation, the knockoff framework has been proposed by [7].Essentially, this is a model-free variable selection algorithm with provable FDR control, assuming one has prior knowledge of the predictors' distribution.[28] further proposed the DeepPINK algorithm by integrating the knockoff framework with the DL architecture for improved explainability of the DL models.However, in real-world applications, the predictor's distribution needs to be estimated to generate the knockoff variables, which adds another layer of uncertainty to the analysis.Recently [1] showed that the knockoff framework might yield inflation in false discoveries, consistent with the error incurred in estimating the predictor's distribution.This problem is exacerbated for highly correlated features.An empirical illustration is provided in section 3 of the supplementary material (SM), showing how model-X knockoff [7] typically fails to control FDR under a simplistic setting with high multicollinearity.In some cases, it may be possible to have prior knowledge of the correlation pattern among the features.For example, in genetics studies, there is a common notion of linkage disequilibrium, which helps specify the dependence pattern among the alleles at polymorphisms ( [33]).In other domain sciences, however, this information is typically not available.Hence any model-specific knockoff generation [7,33] would be inefficient in those contexts.Recently, DL-based flexible knockoff generating algorithms have been proposed [27,18,31]; however they are trained in a typical big-n-small-p setting, and it is unclear how they will perform when the sample size n is significantly smaller than the dimension of the covariates p, and the predictors are highly correlated.We next discuss in details the multicollinearity issue.
In many modern high-dimensional datasets arising in genetics and imaging studies, the other challenge is extreme multicollinearity -the predictors are typically correlated among themselves in a complex manner, often with pairwise sample correlations exceeding 0.99.Because extremely correlated features become almost indistinguishable, in a regression setup, it would be unrealistic to claim that only one of the features of a cluster is associated with the outcome.Hence, accounting for the uncertainty, it would be pragmatic to aim for group-level variable selection and claim that at least one variable from a densely correlated group is important for the outcome.In this context, the term 'true discovery' implies that the selected cluster can serve as a good proxy for at least one element in the true index set of significant features.However, a complication of this approach is that the notion of FDR becomes non-trivial.For this reason, following [36], we adopt the cluster version of the FDR as the expected value of the "proportion of clusters that are falsely declared among all declared clusters".We denote this as cFDR henceforth.Looking at the extreme multicollinearity problem from a slightly different angle, several algorithms have been proposed in the hierarchical testing literature including CAVIAR [17], SUSIE [40], KnockoffZoom [34].While the knockoff-based procedures have the limitation of generating knockoffs from an unknown distribution with a very small sample size, other methods lack applicability in non-linear-nonparametric setups as they typically depend on p-values.
Our contribution To address the complications mentioned above in variable selection and unexplored gap while applying DL, we propose SciDNet-Screening & Cleaning Incorporated Deep Neural Network -a novel method for the reproducible high-dimensional nonlinear-nonparametric feature selection with highly correlated predictors.The screening step is a dimension reduction step.We screen out most of the null features and create a set of multiresolution clusters that collectively contain all the proxy variables needed to cover the truly significant features with high probability.In the cleaning step, using a properly tuned DL model under an appropriate resampling scheme, an estimator of the falsely discovered clusters is proposed, followed by constructing a surrogate of the FDR.Finally, we select some clusters of highly correlated predictors by controlling the surrogate FDR.To this end, 'FDR observed for SciDNet' would implicitly mean the value of the cFDR discussed here.
To the best of our knowledge, in a high-dimensional setting, no other method in the literature accommodates the multicollinearity issue via data adaptive cluster formation, followed by a nonlinear-nonparametric controlled feature selection integrated with DL.Our thorough empirical study demonstrates the proposed method's validity in general as a proof of concept by achieving higher power, controlled FDR and higher prediction accuracy.The proposed approach doesn't rely on any modelling assumptions and is completely free of p-value unlike existing state-of-the-art methods, providing a better understanding of the sparse relationship between the outcome and the high-dimensional predictors.While theoretically guaranteed FDR control in DL-framework is still an active area of research, the current study further opens up avenues for a theoretical investigation of a rigorous mathematical foundation for generalizability and broader validity.
For the rest of the article, in Section 2, we describe the proposed screening and cleaning method, followed by an extensive simulation study in Section 3 and analysis of two real-world gene expression dataset in Section 4. Finally, Section 5 concludes with a summary and future directions.

Notation and assumptions
Under the supervised learning framework, let Y denote a continuous response variable, and X = (X 1 , . . ., X p ) denote p continuous covariates.Let F y (•) denote the CDF of the response variable Y , and let F k (•) denote the CDF of the predictor X k .Assuming a sample size n, we consider the ultrahigh-dimensional setting where p = O(exp (n τ )), τ > 0. We assume no specific functional relationship between the outcome Y and the predictors X but we impose a high-level assumption on the distribution of X.In the spirit of [26], we assume that the predictors follow the nonparanomal distribution; i.e., there exist unknown functions f (X) = {f j (X j ), j ∈ {1, 2, . . ., p}}, such that f (X) ∼ N (µ p×1 , Σ p×p ).This nonparanomal distribution covers a wide range of parametric family of distributions and its main beauty lies in the fact that f (X) preserves the conditional dependency structure of the original variables X. Maintaining the sparsity condition, we may assume that there exists a subset S 0 ⊂ {1, 2, . . ., p}, |S 0 | = O(1), such that, conditional on features in S 0 , the response Y is independent of features in S c 0 .In other words, S 0 = {k : f (y|X) depends on X k }, where f (y|X) is the conditional density of y given X.Our goal is to learn the sparsity structure by estimating S 0 .

Screening Step
Under the assumption that cardinality of S 0 is much smaller than the feature space dimension p, most of the features belong to S c 0 .Hence in the screening step, we focus primarily on finding an active set Ŝn with | Ŝn | << p such that P (S 0 ⊂ Ŝn ) → 1 as n → ∞.This property is called the sure screening property [11], which ensures that all the significant predictors are still retained in Ŝn and the other predictors {X j , j ∈ Ŝc n } are henceforth eliminated from the remaining analysis.As these active variables are highly correlated among themselves, in second step we further cluster them by exploiting the conditional dependency structure.

Finding the active set of variables
To find the active set, we first consider the nonparanomal transformation on (Y, X) and then perform the Henze-Zirkler's (HZ) test on the transformed variable.While the first transforms all the variables to a joint Gaussian variable maintaining their conditional covariance structure; the second test confirms, by pairwise testing, if there are significant dependence in the transformed response and predictors.This workflow has been proposed by [26], [16], [43].The strategy proceeds as follows: 1. Nonparanomal transformation: We first consider the following transformation: , . . ., p, where Φ(•) denotes the CDF of the standard Gaussian distribution.However, in practice, the cdf of Y and X k are unknown, we can estimate it by the truncated empirical cdf as suggested by [26].Henceforth, let ( Ty (Y ), Tk (X k )) denote the corresponding transformations.2. HZ test: By the basic properties of CDF, it is easy to see that (T y (Y ), T k (X k )) will jointly follow a bivariate Gaussian distribution N 2 (0, I 2 ) if and only if Y is independent of X k .This can be tested using HZ test, [16], where the test statistic for the predictor X k can be expressed as 2, . . ., p; where ψ k (t) is the characteristic function of (T y (Y ), T k (X k )) and exp(− 1 2 t t) represents the characteristic function of N 2 (0, I 2 ).It typically measures the disparity between the joint distribution of (T y (Y ), T k (X k )) and N 2 (0, I 2 ) and is expected to be typically high for the non-null predictors X j , j ∈ S 0 indicating significant evidence against the independence of the transformed variable (T y (Y ), T k (X k )).Next, as in practice, we proceed with ( Ty (Y ), Tk (X k )), we calculate the the HZ test statistic as where d ij = ( Tk (x ki ) − Tk (x kj )) 2 + ( TY (y i ) − TY (y j )) 2 and Consistent with the existing literature, we choose the value of the smoothing parameter β as (1.25n) , which corresponds to the optimal bandwidth for a nonparametric kernel density estimator with Gaussian kernel ( [16]).The observed test statistics w * k converges to w k as shown in [43].This active set Ŝn contains all the predictors significantly correlated with the response marginally.Under very mild regularity conditions on signal strength of the nonnull predictors where min k∈S0 w k ≥ 2cn −κ with c as a constant and 0 ≤ κ ≤ 1 4 , the screening process enjoys the advantage of sure screening property, i.e., P (S 0 ⊂ Ŝn ) → 1, as n → ∞.More details on the theoretical guarantee can be found in [43].A common practice is to set the active set size | Ŝn | at ν n = [n/log(n)].However, as we further cluster the active variables in the next step, our proposed method is fairly robust in terms of the | Ŝn | as long as we retain most of the significant variables.We propose to select a bigger active set with size proportional to ν n .

Clustering the active predictors using the precision matrix
Algorithm 1: Finding clusters and the representatives Estimate the precision matrix: Σ−1 = (σ ij ) i,j∈{1,2,...,p} using Nodewise Lasso Define the clusters As the Henze-Zirkler test focuses on (pairwise) marginal correlation among the predictors and response, it typically includes the null predictors with strong associations with a significant predictor; and thus they are highly correlated among themselves.Hence to reduce the high correlation in the active set, our strategy is to exploit their conditional dependency structure and divide the active variables {X j : j ∈ Ŝn } into p c (<< p) non-overlapping clusters: C 1 , C 2 , . . ., C pc .By sure screening property, with asymptotically high probability, S 0 ⊂ pc j=1 C j .The use of sparse precision matrix to understand the dependence structure in a high-dimensional feature space has been well acknowledged in statistics literature (e.g., [19], [35]) due to its scalability.In some contexts, it brings more insight compared to the analysis of simple covariance matrix.For example, in human brain, two separate regions can be highly correlated with no direct relation and only due to their strong interaction with a common third region.So, understanding the conditional dependence structure and using it in clustering the brain regions is more informative in the context of understanding the functional connectivity in human brain [8].Otherwise simple correlation based clustering will result in huge cluster sizes with less interpretable group of brain regions.
To this end, in order to estimate the precision matrix we implement the nodewise Lasso algorithm [39] on the transformed variables ( Ty (Y ), Tk (X k )), k ∈ Ŝn .Nodewise Lasso regression is generally entertained to estimate a sparse precision matrix in the context of Gaussian graphical model by performing simultaneous Lasso regression on each predictor.The tuning parameters in each nodewise Lasso are typically selected using cross-validation.More details on this algorithm and its theoretical guarantees can be found in [30].Let Σ−1 be the estimated precision matrix by the nodewise Lasso algorithm and ρ(Z, Z ) denotes any correlation metric for two random variables Z and Z .Algorithm 1 summarizes the clustering step.
Here not only we are clustering the active predictors, but also select an appropriate representative from each cluster.First, for each active predictor X i ∈ Ŝn , we collect all the other active predictors conditionally dependent on X i , and make cluster C i .Although clustering using the conditional dependence produces smaller clusters, there might be some overlaps owing to the complex association in the original predictor space.Hence, to reduce the excessive intercluster correlation, we merge all those clusters having maximum correlation greater then some pre-specified threshold r (we typically set r=0.9).Next, each cluster is updated by adding all the other features conditionally dependent with the existing cluster members.Finally to find the appropriate cluster representatives, we focus on the HZ-test statistic w * k in 1 which measures the extend of resemblance between the distribution of each nonparanomally transformed variable in a cluster and the null distribution N (0, I 2 ).So, for cluster C i , we select the variable R i = argmax j∈Ci { w * j } indicating its strongest association with the response variable compared to the other predictors in the cluster.

Cleaning with Deep Neural Network (DNN)
We start the cleaning step by modeling the response Y and the cluster representatives X Sn obtained through 2.2.2.In order to perform the error controlled variable selection, each representative will be assigned an importance score followed by a resampling algorithm to finally control the FDR.
While it is possible to adopt any other generic sparsity inducing DNN procedure, here we focus on the LassoNet algorithm recently proposed by [22] for its elegant mathematical frameworks which naturally sets the stage for nonlinear feature selection.To approximate the unknown functional connection, it considers the class of all fully connected feed-forward residual neural networks; namely, Here, W denotes the network parameters, K denotes the size of the first hidden layer, W (0) ∈ R p×K denotes the first hidden layer parameters, θ ∈ R p denotes the residual layer's weights.In order to minimize the reconstruction error: the LassoNet objective function can be formulated as: With L(θ, W ) = 1 n n i=1 l(f θ,W (x i ), y i ) as the empirical loss on the training data and x i as the vector of cluster representatives observed for the i th individual.While the main feature sparsity is induced by the L 1 norm on residual layer parameter θ, the second constraint controls the total amount of nonlinearity of the predictors.As mentioned in [22], LassoNet can be argued as an extension of the celebrated Lasso algorithm to nonlinear variable selection.
In L 1 penalization framework, importance of a specific feature is naturally embedded into the highest penalization level upto which it can survive into the model.So, to measure the importance of each representative, the LassoNet algorithm is executed over a long range of tuning parameter In practice, a small value is fixed for λ 1 where all the variables are present in the model.Then we gradually increase the value of the tuning parameter and stop at λ r , where there are no variables present in the model.Next the importance score for the j-th cluster is defined as λj = maximum value of λ up to which the j-th representative exists in the model, and then the following rank statistic is computed: I j = j =j 1 λj ≤ λj for j = 1, 2, . . ., |C|.A lower I j means that the j-th cluster representative stays in the model upto a higher value λ implying its high potential as a significant cluster; whereas a higher I j indicates the corresponding cluster leaves the model even for a smaller value of λ as a consequence of being simply a collection of null features.Hence, we should only focus on the clusters with lower ranks.Additionally, in order to control the FDR, understanding the behaviour of the predictors under the null distribution is important.In traditional FDR controlling algorithms, this is typically done by generating the p-values.Here, as a p-value free algorithm, we propose the following resampling based approach: 5. The F DR is sequentially estimated with δ = Ī(1) , Ī(2) , . . ., Ī(|C|) and the optimum threshold is ∆ * = max{δ > 0 : F DR(δ) < q} for some pre-specific FDR control level q.The final selected set of clusters with controlled FDR is given by Dn = {C j , j = 1, 2, . . ., |C| : Īj ≤ ∆ * } The proposed method certainly has a close resemblance with an FDR-controlling approach.The notion of a false discovery is incorporated into the algorithm via the resampling: if a null predictor gets relatively higher importance score, that is most possibly due to that specific bootstrap version which creates the spurious relation, however, that would not be consistent for all the other bootstraps in general.On the other hand, all the bootstrap versions should consistently produce higher importance scores for the significant predictors.As a consequence, the variability in the ranks of the importance scores will be much higher for the null predictors compared to their nonnull counterpart.This notion was introduced in the statistics literature in the last twenty years as bagging methods [4,5] for reducing variance of a black-box prediction.The proposed method utilizes this phase transition in the feature selection framework to effectively identify the false discoveries; an empirical illustration of which is further provided in Section 2 of the SM.

Numerical Illustrations
In this section, the finite-sample performance of SciDNet has been investigated using a wide spectrum of simulation scenarios.We first consider the single index model for the data-generating mechanism, a straightforward yet flexible example of nonlinear models.Here the response is related to a linear combination of the features through an unknown nonlinear, monotonic link function, i.e., y = g(x β) + .We choose the following two link functions: (1) Polynomial: g(x) = x 3 10 + 3 x 10 and (2) ReLu: g(x) = max(0, x).We set n = 400 and p = 1000.The coefficients β ∈ R p is sparse with the true nonzero locations S 0 = {50, 150, 250, 350, 450}, s = |S 0 | = 5, where β S c 0 = 0, β S0 ∼ N 5 (uβ 0 , 0.1I 5×5 ) with u = {±1} 5 .The value of β 0 is set at β 0 = 2, 4 to incorporate varying signal strength .The random error ∼ N (0, σ 2 ), with three increasing noise level as σ 2 = 1, 5, 10.The high dimensional predictors are generated from X ∼ N p (0, Σ) where the covariance matrix Σ is chosen as a Toeplitz matrix with Σ ij = ρ |i−j| .We set ρ = 0.95. .In table 1 we compare the performance of SciDNet with the basic prediction-optimal LassoNet algorithm.The results indicate the need for a 'cleaning' while using a DL-based model for feature selection.Our method enjoys quick recovery in power when we gradually increase the error-controlling threshold q from 0.01 to 0.15 while maintaining the number of false discoveries below the required level.
Additionally, in table 2, we present a more data-oriented statistical evaluation for further endorsement of SciDNet's discoveries.From the prediction aspect, one would expect that a prediction model implemented only on a handful of features selected by a successful feature selection algorithm would maintain the similar performance of a model implemented on the whole feature space; in some cases, it might enhance the accuracy.To validate this, we randomly split the whole data into 8:2 for training and testing.First, the SciDNet is implemented in the training part, and then two separate prediction models are used only focusing on the selected features : (1) an MLP -a feed-forward multi-layer perceptron with two hidden layers and (2) bagged regression tree [4].These two experiments are henceforth called as: "SciDNet+MLP" and "SciDNet+RT".Next, the test data is used to check the out-of-sample prediction accuracy.The reward for a successful feature selection by SciDNet is indicated by the huge gain in test accuracy observed with a lower Mean Square Error (MSE) on the test data.This experiment demonstrates that a black-box predictive model produces more accurate results when applied on the features selected by SciDNet rather than its implementation on the whole feature space, indicating SciDNet's potential use in both feature selection and prediction.
A further simulation study is also conducted considering several nonlinear models as data-generating processes with gaussian and non-gaussian features.The results are promising and demonstrate that SciDNet maintains a satisfactory power-FDR balance for various complicated nonlinear models with and without interaction terms.The results and more details are presented in section 2 of the SM.The hyperparameter selection and further implementation details of SciDNet are relegated to sections 1 and 4 of the SM, respectively.Additionally, to motivate the need for cluster-level feature discovery following a screening step, we implemented two recently proposed DL-based FDR-controlled feature selection methods: DeepPink [28] and SurvNet [37].Our empirical study of these models demonstrates that these algorithms are efficient and competitive, given the availability of huge training data and moderate correlation among the features.However, they typically fail with lower power and excessive false discoveries under the current setting of low sample size with correlated features.This is somehow expected, as these methods are not tailored for handling such huge multicollinearity.For example, with the ReLu setting considered above with β = 2, σ 2 = 1, ρ = 0.95, DeepPink and SurvNet resulted in 0.067(0.125)and 0.440(0.314)as power with observed FDR at 0.111(0.194)and 0.742(0.169)respectively, at q = 15% FDR control level.Corresponding standard errors are mentioned in the parenthesis.The detailed result is presented in section 4 of the SM.

Real Data Analysis
In addition to the simulation studies, we implemented the proposed algorithm SciDNet in the following two publicly available gene-expression data sets.We substantiate the findings in two ways: We first provide supporting evidence from the domain research.Additionally, as a more data-aligned validation, we demonstrate that several generic prediction models significantly gain in test accuracy when applied only on the few features selected by SciDNet compared to the prediction result considering the whole feature space.For this purpose, consistent with the other genomic studies, we use the prediction correlation Corr(Y P red , Y T est ) in addition to the test MSE as a metric to measure the test performance.To overcome the extra burden of the low sample size and ultrahigh dimensionality in these data sets, we consider 50 independent replications where the data is divided into training and testing maintaining 8 : 2 ratio to get the metrics for the test performance.The final estimate is obtained by averaging all the test MSEs calculated on each of these replications.Similar approach is considered for the correlation metric as well.

Selection of Drug Sensitive Genes using CCLE dataset
A recent large-scale pharmacogenomics study, namely, cancer cell line encyclopedia (CCLE, link available here), investigated multiple anticancer drugs over hundreds of cell lines.Its main objective is to untangle the response mechanism of anticancer drugs which is critical to precision medicine.The data set consists of dose-response curves for 24 different drugs across over n = 400 cell lines.For each cell line, it consists of the expression data of p = 18, 926 genes, which we consider as features.For the response, we used the activity area [2] to measure the sensitivity of a drug for each cell line.Here we seek to uncover the set of genes associated with the following five specific anticancer drugs sensitivity: Topotecan, 17-AAG, Irinotecan, Paclitaxel and AEW541.These drugs have been used to treat ovarian cancer, lung cancer and other cancer types.Previous research outputs on these drugs and related gene expression data can be found elsewhere [2].
SciDNet produces multi-resolutional clusters of genes for each of the five drugs considered which are interpretable from the domain science perspective.For example, SciDNet discovers SLFN11 as the top drug-sensitive gene for the drugs Topotecan and Irinotecan.This is consistent with the previous findings as [2,45] reported the gene SLFN11 to be highly predictive for both the drugs.For another drug 17-AAG, SciDNet discovers the gene NQO1 as the topmost important gene which is known to be highly sensitive to 17-AAG [15].The full table containing all the genes selected by SciDNet at q = 15% error-control level and relevant findings from previous genomic studies have been relegated to the section 5 of the SM.Furthermore, similar to the simulation study in section 3, we separately implement the LassoNet on the training data for its simultaneous sparsity-induced prediction-optimal characteristics.The summary of the results are presented in table S8 which indicates several interesting points.First, in order to get the prediction optimal result, LassoNet fails to capture the sparsity and discovers a huge number of genes.This is somehow expected as most of the prediction-optimal sparse methods tend to select a larger set of features to maintain the prediction quality [41].On the other hand, SciDNet produces only ∼ 10 clusters with an average cluster size ∼ 2.5.Even with this huge dimension reduction, the added gain in test MSE and Corr(Y P red , Y T est ) further proves that ScidNet successfully retains all the significant predictors.One would further notice, SciDNet+RT achieves the best stable performance which is consistent for all the drugs and our simulation study as well.

Selection of associated genes in Riboflavin production data set
We further implement the SciDNet in the context of riboflavin (vitamin B2) production with bacillus subtilis data, a publicly accessible dataset available in the 'hdi' package in R.Here the continuous response is the logarithm of the riboflavin production rate, observed for n = 71 samples along with the logarithm of the expression level of p = 4088 genes which are treated here as the predictors.Unlike the previous CCLE data, significant multicollinearity is present in most part of the Riboflavin data set, as demonstrated in Figure 1.Hence, in order to determine which genes are important for the riboflavin production, SciDNet resulted in finding 9 clusters of total 160 correlated genes at the q = 15% FDR control level, making the average cluster size of ∼ 17.78, which is much bigger compared to the previous analysis.SciDNet discovered the gene YCIC_at as one of the expressive genes related to riboflavin production which was identified by [6] as a causal gene in this context.The full list of selected cluster of genes by SciDNet are relegated to the section 5 of the SM.
The results from the empirical evaluation of SciDNet's feature selection is presented in table 4.However, for the empirical evaluation, SciDNet+MLP is performing poorly as the inflated cluster dimensions make the input layer of the MLP comparatively large where the number of training data point is ≈ 57.This necessitates the need for a sparse model here, and we adopt the idea of Relaxed Lasso, first proposed by [29].Here we implement the LassoNet again on the selected features by SciDNet, which certainly improves the prediction accuracy.Consistent to the previous experiments, SciDNet+RT effectively maintains its prediction performance.This further consolidates the need for applying an apt feature selection method prior to fit a predictive model for an explainable research outcome.

Discussion
While the explainable AI is the need of the hour, statistical models coupled with cutting-edge ML techniques have to push forward because of their solid theoretical foundation clipped with principled algorithmic advancement.The proposed method SciDNet efficiently exploits several existing tools in statistics and ML literature to circumvent some of the complexities that current DL-based models fail to address properly.The basic intuition and exciting empirical results of SciDNet on simulated and real datasets open avenues for further research.For example, one may be interested in developing a theoretical foundation for this 'screening' and 'cleaning' strategy for provable FDR control.It would be worth mentioning that although we used the sure independence screening with HZ-test and LassoNet as the main tools, SciDNet puts forward a more generic framework and can be implemented with any other model-free feature screening method and sparsity-inducing DL algorithms like [13].In the screening part, a further methodological extension would consider relaxing the assumption of nonparanomally distributed features for a more flexible approach.Additionally, after the screening step, as the dimensionality is reduced, it would be interesting to implement a model-free knockoff generating algorithms like [31] in the cleaning step as further algorithmic development.One limitation is that we mainly focus on the regression setup with the continuous outcome because of the requirements of HZ sure Independence test used in the screening step.For a classification task, any model-free feature screening method like [44] can be applied in a more general framework.
Hence, for a fixed cutoff δ, our proposed error bound can be considered as an surrogate estimate of an upper bound of the FDR.This is further supported by our empirical study on simulated and real datasets, that SciDNet successfully retains all the important variables while minimising the number of false discoveries and maintains the prediction accuracy even after tremendous dimension reduction.We discuss a practical way to tune all these hyperparameters here: 1. To choose the size of the active set, we propose to select a bigger active set with size proportional to ν n = [n/log(n)].As we further cluster the active variables in the clustering step, a slightly bigger active set with boost up the confidence of sure screening property, See the section D for an example.
2. After clustering, the intra-cluster correlation bound r should be fixed at some higher value (usually at 0.9 or 0.95) otherwise the cluster sizes will be inflated.
3. In cleaning step, a thorough grid search has been done over λ considering λ 1 ≤ λ 2 ≤ • • • ≤ λ r ; in practice, a small value is fixed for λ 1 where all the variables are present in the model.Then the value of the tuning parameter gradually increased up to λ r , where there are no variables present in the model.The other hyperparameter for LassoNet is the hierarchy coefficient M for which we follow the path considered in [22] and set M = 10.However, a more flexible approach would be a parallel grid search for M as well.
4. Finally, choosing an appropriate value of κ has a significant effect on the performance of SciDNet.The higher value of κ might lead to weaker control over the inclusion of false discoveries, whereas choosing a small κ will stricten the error control resulting in reduced power.However, we propose an effective way to tune the κ with the assist of phase transition in the ranks of the importance score Īj of the cluster representatives.For an illustration, in Figure 2 a simple nonlinear additive model (see section B.1 for more details of the model) is considered with 122 active representatives obtained from the screening step.The first 5 representative features are the only relevant predictors (indicated by the vertical dotted red line) and the ranks of their importance scores are presented along the y-axis.Here the dense lines are the bootstrap ranks I b j , observed at 50 bootstrap replications and the solid blue line represents their averaged rank Īj for all the 122 representatives.One would observe a clear phase transition in the bootstrap distribution of the ranks.For the significant features, the ranks are lower with extremely precise estimates whereas for the rest of the null-features, the averaged ranks posses much higher values coupled with huge variability.Hence, for a compact neighbourhood N ( Īj , κ) to capture only the small variability in the bootstrap ranks of the significant features, we simply fix κ= K* (in figure 2), the phase transition point for the averaged rank.

A.3 Phase transition observed for the CCLE data
The main reason for the phase transition is that, for a null predictor X j , j ∈ S c 0 , different bootstrap replicates reshuffle its feature importance each time, whereas for a nonnull predictor X j , j ∈ S 0 , the feature importance is much stable in different bootstrap replicates.SciDNet effectively captures this characteristic to identify the null features.As a demonstration, here we present in Figure 3, the bootstrap distribution of rank of the importance scores for the top 25 important cluster representatives via box plots.The green and purple colors respectively indicate if the cluster representatives are selected or rejected by the SciDNet.We can observe the phase transition consistently for all five drugs, and SciDNet selects only those important representatives with reduced variability over the bootstrap replicates.

D Model implementation details and Sensitivity Analysis
In this section, we mention the implementation details of SciDNet that we consider for the simulation study and real data analysis.To select the size of the active set Ŝn in the screening step, in consistence with [43], we set | Ŝn | = [ 2n log(n) ] by selecting the predictors with the top | Ŝn | Henze-Zirkler test statistic w * k , where [z] denotes the integer part of z.In all our simulation scenarios, we set r=0.9, the hyperparameter for intra-cluster correlation bound to further integrate highly correlated conditionally dependent clusters.In cleaning step, for LassoNet 100 dimensional one-hidden-layer feed-forward neural network has been used; more detailed model architecture can be found at appendix in [22].For creating the compact neighbourhood in the cleaning step, each time we choose the value of κ utilizing the phase transition property mentioned in section 2.2 of main manuscript.The feature selection performance of the SciDNet is demonstrated by calculating the average power and cFDR along with their standard error observed in 50 Monte Carlo replications.Each data set is randomly divided into train, validation and test with a 70-10-20 split.To asses the prediction performance, the test Mean Square Error (MSE) before and after the variable selection has been shown as part of the simulation study.For the prediction model, a 40-dimensional two-hidden-layer feed-forward neural network with ReLU and linear activation function is considered with Adam as optimizer.For the regression tree, we used the bagging for further stabilization, as mentioned in [4].The number of leaves and nodes are chosen by minimising the MSE on validation set.
To access the error bar for the sensitivity analysis, we generate a typical data using the polynomial setup (section 3 in main manuscript, with β = 2, σ 2 = 1) and rerun SciDNet 50 times on the same data and set q = 0.1 as FDR-control threshold.The mean and standard deviations from these 50 replications are following: power = 0.99 (0.01), observed FDR = 0.03 (0.03), test error by LassoNet = 1.048 (0.101), test error by SciDNet+RT = 0.710 (0.002).From computational complexity, after the screening,the bootstraped LassoNet can be run in parallel loop and we conduct all the experiments in a high-performing computing facility with Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and 4 Tesla V100S.The codes are available at an anonymous repository (https://anonymous.4open.science/r/SciDNet-3CA8).

E Important clusters of gene discovered by SciDNet E.1 For CCLE dataset
Following Table S8 presents all the selected cluster of genes by SciDNet for the five anticancer drugs considered.The genes in a single cluster are mentioned in the "{}".Previous research on this gene-expression data has revealed several genes as biologically associated with the corresponding drugs.SciDNet successfully discovers these genes as the top-most important gene associated with the drugs.In Table S8, the selected genes which are confirmed by previous domain research, are highlighted and corresponding references are mentioned in the column 3.

3 .
Next, we select the active set of predictors Ŝn according to the larger values of w * k , i.e., Ŝn = {1 ≤ k ≤ p : w * k > cn −κ } where where c and κ are predetermined threshold values.

1 . 2 .
Generate B bootstrap versions of the data Y b , X b Sn B b=1 considering only the cluster representatives Sn .For each bootstrap version, run the LassoNet algorithm parallelly; and calculate the importance of each representative by measuring λb j = maximum value of λ up to which the j-th predictor exists in the model for b-th bootstrap version, and then the ranks I b j = j =j 1 λb j ≤ λb j .Therefore, the averaged rank is: Īj = 1 For an arbitrary threshold δ, we would select the cluster representatives with averaged rank Īj lower than δ; so we define, N + (δ) = j∈ Sn 1( Īj ≤ δ) representing the number of selected clusters with respect to the cutoff δ.

3 .
Next, to estimate the expected number of falsely discovered clusters, define R b = {j : I b j ≤ δ}, the number of cluster representatives with higher importance score so that the corresponding rank is lower than the cutoff δ in the b-th bootstrap version.Additionally, define a neighbourhood N ( Īj , κ) = {l ∈ {1, 2, . . ., |C|} : | Īj − l| ≤ κ}, for some specific small number κ.
Two metrics are calculated to measure the feature selection performance of the algorithms: (1) Power= | Dn∩S0|

Figure 1 :
Figure 1: A snapshot of correlation strength for first 100 genes considered for the drug Topotecan in CCLE dataset (left) and riboflavin dataset (right)

Figure 2 :11
Figure 2: Illustration of phase transition in ranks of feature importance scores

Table 1 :
Power and empirical FDR with standard error in parentheses showing the feature selection performance

Table 2 :
Test MSE with standard error in parentheses showing the prediction performance before and after the feature selection 4. Further, we estimate the number of falsely discovered clusters and hence an estimator of the FDR can be constructed as F DR(δ) = ê0(δ) N+(δ) where,

Table 3 :
Drug-sensitive genes identified by SciDNet and related prediction performance

Table 4 :
Prediction performance of SciDNet for Riboflavin production data set Algorithm Test MSE Corr(Y P red , Y T est )

Table S5 :
Empirical power and observed FDR of SciDNet with standard error in parentheses for gaussian features Recently developed Deep Learning (DL) models are generally governed by several hyperparameters and properly tuning them is necessary to get effective results.The proposed SciDNet relies on the following hyperparameters: (1) size of the active set Ŝn , (2) the intracluster correlation bound r, (3) LassoNet tuning parameters λ and M and (4) κ used in neighbourhood selection in cleaning step.SciDNet is fairly robust to most of the associated hyperparameters.

Table S7 :
Empirical power and observed FDR of various feature selection algorithms with standard error in parentheses