Sufficient dimension reduction based on distance‐weighted discrimination

In this paper, we introduce a sufficient dimension reduction (SDR) algorithm based on distance‐weighted discrimination (DWD). Our methods is shown to be robust on the dimension p of the predictors in our problem, and it also utilizes some new computational results in the DWD literature to propose a computationally faster algorithm than previous classification‐based algorithms in the SDR literature. In addition to the theoretical results of similar methods we prove the consistency of our estimate for fixed p. Finally, we demonstrate the advantages of our algorithm using simulated and real datasets.


INTRODUCTION
Sufficient dimension reduction (SDR) is a class of feature extraction techniques introduced in regression settings with high-dimensional predictors. Let X be a p-dimensional predictor vector and Y be a response variable (which is assumed univariate for the time being). In linear SDR our effort is to reduce the dimension of the predictors, X, without losing information of the conditional distribution Y |X. In other words we are trying to find a p × d (d < p) matrix such that the following conditional independence model holds: The space spanned by the columns of is called a Dimension Reduction Subspace (DRS). The intersection of all possible DRS's, if it is itself a DRS, is called the Central Dimension Reduction Subspace (CDRS) or the Central Space (CS) and is denoted with  Y |X . Conditions of existence of the CS are mild (see Yin, Li, & Cook, 2008) therefore we assume it exists throughout this paper. Some literature on SDR includes and is not limited to Sliced Inverse Regression (SIR) by Li (1991), Sliced Average Variance Estimation (SAVE) by Cook and Weisberg (1991), principal Hessian directions (pHd) by Li (1992), Contour Regression by Li, Zha, and Chiaromonte (2005) and Directional Regression by Li and Wang (2007) among others.
In recent years, there is an interest in nonlinear SDR, where we extract linear or nonlinear functions of the predictors. That is, we work under the nonlinear conditional independence model: where ∶ R p → R d denotes linear or nonlinear functions of the predictors. Some examples include the work by Wu (2008) and by Yeh, Huang, and Lee (2009) which introduced Kernel SIR, the work by Fukumizu, Bach, and Jordan (2009) which used kernel regression and the work by Li, Artemiou, and Li (2011) who used Support Vector Machine (SVM) algorithms to achieve linear and nonlinear dimension reduction under a unified framework. The idea of using SVM and different algorithms have since been expanded in a number of directions. Artemiou and Dong (2016) used LqSVM which ensures the uniqueness of the solution, Zhou and Zhu (2016) used a minimax variation for sparse SDR, Shin and Artemiou (2017) replaced the hinge loss with a logistic loss to achieve the desired result, Shin, Wu, Zhang, and Liu (2017) used weighted SVM approach for binary responses and Artemiou and Shu (2014) and Smallman and Artemiou (2017) focused on removing the bias due to imbalance. One of the most interesting variations of SVM was proposed by Marron, Todd, and Ahn (2007) and is known as Distance-Weighted Discrimination (DWD). The interest of DWD lies on the fact that it works much better than SVM as the dimension of the predictors X increases. This is due to the fact that SVM suffers from data piling when the dimension of the predictor space is large. Data piling occurs in high-dimensional low sample size (HDLSS) settings and it describes the tendency of the data in each class to project to a single point on a discriminant direction. In the upper plot of Figure 1, we can see that for p = 100, SVM and DWD project the data in a similar manner. However, the lower plot in Figure 1 shows that when p = 500, SVM (black points) projects almost all the points to 1 for one class and to −1 for the other class while DWD spreads the data in each class on a wider range of values. In a classification setting, data piling makes generalization of the classification results difficult.
In this paper we will investigate whether DWD has similar advantages over SVM in the SDR framework, as the ones it has in the classification framework. We will create a similar method as the one in Li et al. (2011) with the difference that the objective function of DWD will replace the objective function of SVM. We call our method Principal DWD following a similar pattern to Li et al. (2011) calling their method Principal SVM (principal support vector machines (PSVM)). Interestingly, results show that actually DWD works better than SVM for low-dimensional problems and as the dimension increases PSVM gets closer to the performance of principal distance weighted discrimination (PDWD). Thus, data piling seems to help in the dimension reduction framework in the regression setting. This observation may be explained due to the fact that in the regression setting we are more interested in a hyperplane alignment than reducing F I G U R E 1 Density of projections for n = 1,000. Top panel: p = 500; bottom panel: p = 1,000. The datasets consists of the projection of points after discretizing the response in two slices under the model Y = X 1 + misclassification error. Therefore, data piling may help "stabilize" the alignment of the hyperplane on the correct direction for PSVM.
The paper is constructed as follows. In Section 2 we discuss DWD and we introduce Principal DWD and in Section 3 we discuss its asymptotic properties. In Section 4 we present nonlinear feature extraction and in Section 5 our numerical studies. We close with a discussion section.

PRINCIPAL DWD
In this section we develop the idea of using DWD for SDR. We discuss first DWD as it was presented by Marron et al. (2007) and then we demonstrate how it can be incorporated into the SDR framework, giving some theoretical results, a sample estimation algorithm and a method for determining the dimension of the central subspace.

Review of DWD
Let (X i ,Y i ), i = 1, … , n be an iid sample of (X,Y ). Denote X = n −1 ∑ n i=1 X i and = var(X). Now suppose Y is a binary random variable, which takes values ±1. DWD is defined by the following optimisation problem: where r is a vector of all r i 's and is the vector of all i 's. Here > 0 is a tuning parameter also called the cost (or misclassification penalty) and is a penalization vector where i = 0 for correctly classified points and i > 0 for misclassified points. The above optimization problem can be written slightly differently using the following vector form (for details, see Qiao & Zhang, 2015): where the first term comes from the constraint || || ≤ 1 and the rest from replacing i with the hinge loss ) + . In the dimension reduction framework we are interested to work with the population version of the above objective function. The population version is the following: There were a number of extensions of the DWD algorithm. Some include the weighted DWD approach by Qiao, Zhang, Liu, Todd, and Marron (2010) and the sparse DWD approach by Wang and Zou (2016). Marron et al. (2007) as well as the extensions discussed above used cone programming to solve the optimization problem in (3) (or the respective one for each extension). More recently, Wang and Zou (2018) proposed the generalized DWD algorithm which allow for faster computational algorithm. In this paper, we utilize their idea and thus our estimation algorithm is much faster than previous methodology in the SVM-based SDR framework. Another computationally fast algorithm appeared in Lam, Marron, Sun, and Toh (2018).

DWD for SDR
One of the tricks that many classic algorithms are using for SDR is the idea of slicing/discretizing the response which in most regression settings is a continuous random variable (e.g., Li, 1991;Li et al., 2011). When the response is discrete this step is ignored as each discrete value is considered a slice. To unify the notation though and following the idea of Li et al. (2011) we define Ω Y to be the support of Y and A 1 and A 2 to be disjoint subsets of Ω Y (not necessarily exhaustive subsets). Then one can defineỸ Replacing this into the population objective function of DWD we get the following objective function in the SDR framework: Following Li et al. (2011) we note that we have also inserted into the first term to ensure the resulting DWD estimate is unbiased and to provide the unified framework for nonlinear SDR. Assuming E(X) = 0 without loss of generality, and by setting u =Ỹ [ T X − t] we can simplify the above objective function to: The following Lemma is used to prove the convexity of the objective function. This lemma will be crucial in proving the theorem which shows that the normal vector of the optimal hyperplane, developed by the PDWD, is indeed in the CS.
Proof. To prove convexity we need to show for all u ∈ R and ∈ [0, 1]. Firstly we can rewrite f as For u ≥ 1 √ we have 2 √ − u ≤ 1 u and for u 1 ≤ u 2 we have f (u 1 ) ≥ f (u 2 ) since f is a decreasing function. We need to consider three cases: Here we have u 1 + (1 − )u 2 < 1 √ and hence Since the gradient of f is equal when approaching from the left and right of 1 √ we can assume without loss of generality that u 1 + (1 − )u 2 < 1 √ and so (iii) When u 1 ≥ 1 √ , u 2 ≥ 1 √ which gives u 1 + (1 − )u 2 ≥ 1 √ . In this case we can simply prove that the second derivative of f (u) = 1 u only gives positive values as follows Hence we have for all u ∈ R, and therefore f is convex. ▪ Having verified the convexity of the objective function then one can prove the following theorem which demonstrates that the normal vector of the hyperplane is in  Y |X . This follows directly from the proof in Li et al. (2011) due to the fact that the hinge loss in SVM is replaced with another convex function and as Li et al. (2011) claim their proof holds for every convex function. (1) and if ( * , t * ) minimizes the objective function (13)

Theorem 1. If E(X| T X) is a linear function of T X, where is defined as in
Proof. It is important to note that under the conditions of the theorem we can write the conditional expectation Beginning with the first term we have Now lets look at the second term. Again we can write Since f is a convex function, we can use Jensen's inequality as follows: Thus combining this with (9) we get If does not belong to  Y |X , then var[ T X| T X] > 0 and the inequality in (9) becomes strict. Hence the inequality in (10) is strict. Therefore, such cannot be the minimizer of L( , t). ▪

Sample estimation algorithm
Having established the theoretical properties of the minimizer of the objective function in PDWD we now look into the sample estimation algorithm of our method. Before giving the algorithm though we look at available packages in solving the optimization problem of DWD. As the available packages solve the objective function of DWD which does not include in the first term, we demonstrate below that by standardizing the data the objective function of PDWD becomes equivalent to the objective function of DWD and therefore available packages can be used.
As was mentioned above the objective function of DWD is and the one for PDWD is Now if we let = 1∕2 and Z = −1∕2 (X − X), and substitute these into (12) we have which we can see is of the same form as (11). Hence, as we stated above we can see that standardizing X modifies the PDWD in a desired way. We emphasize here that this fact allows us to use existing algorithms for DWD in the literature to estimate the PDWD solution. Hence, in our algorithm below we require the standardization of the data.
To solve (13) Wang and Zou (2018) proposed an algorithm to iteratively calculate the hyperplane. This is a fast algorithm that calculates the hyperplane until convergence. To calculate the solution in iteration (m + 1) the following formula is used: Finally, This iterative process replaces the quadratic programming process which was used in PSVM and therefore the PDWD algorithm becomes computationally much faster. For more details the interested reader is referred to Wang and Zou (2018).
We will define two methods for generatingỸ , which were first proposed in Li et al. (2011). These are named left versus right (LVR) which is more appropriate when the response is continuous or discrete with a sense of ordering between the values and one versus another (OVA) which is more appropriate when the response is categorical with no sense of ordering between the values. When using LVR you chooseỸ where h is the number of slices. OVA follows a similar method but we chooseỸ The estimation procedure is as follows: 1. Compute the sample mean X and sample variance matrix̂. 2. We find the minimizer using the algorithm in Wang and Zou (2018). In more detail: (LVR) Let q r , r = 1, … h − 1, be h − 1 dividing points and let for i = 1, … , n. Then using DWD, let (̂r,t r ) be the minimizers of (OVA) Apply DWD to each pair of slices from the h slices. More specifically, Let (̂r s ,t rs ) be the minimizers of 3. Letv 1 , … ,v d be the d leading eigenvectors of one of the matriceŝ We can now estimate S Y |X using the subspace spanned byv = (v 1 , … ,v d ).

Order determination
Now we turn our attention to the estimation of the dimension d. Developing an effective method for determining the dimension is vital when developing methods for SDR and plays an important role in the performance of such method. For PSVM, Li et al. (2011) opted to use method based on a cross-validated bayesian information criterion. We propose to use a relatively new approach to order determination developed by Luo and Li (2016) which is called the ladle estimate. The ladle estimator is a combination of the scree plot method and the Ye-Weiss plot developed by Ye and Weiss (2003). LetM be defined as one of the matrices in (14) and let̂i define the ith eigenvalue ofM. Now sinceM is a consistent estimator of M and M has rank d we can establish that̂d +1 will be much smaller than̂d. Using this the following function is defined The eigenvalues have been shifted so that n takes a small values at k = d rather than at k = d + 1.
Next we turn our attention to the Ye-Weiss plot. Let F be the distribution of (X,Y ) and let F n be the empirical distribution based on S = (X 1 , be the eigenvalues and eigenvectors ofM and M * respectively. For each k < p, let and define the function where B * k,i denotes the ith bootstrap sample. From Ye and Weiss (2003), it can be established that the function f 0 n (k) gives a measure of the variability of the bootstrap estimates around the full sample estimateB k . The range of f 0 n is [0,1], where 0 indicates each B * k,i spans the same column space asB k and 1 occurs when B * k,i spans a space orthogonal toB k . So if we define the function Ye and Weiss (2003) determined that f n is small for k = d and larger for k > d.
Lastly, the ladle estimator of the rank d is defined to bê where g n (k) = n (k) + f n (k). Consider the regression model Choosing n = 100 and p = 10, Figure 2 shows the ladle plot for model (19). As we can see, the ladle plot correctly estimates d to be 2. We have tried this simulation many times with approximately 98% accuracy. More detailed simulations will be discussed in our numerical studies section. F I G U R E 2 Ladle plot of model (19) with n = 100 and p = 10

ASYMPTOTIC ANALYSIS OF PDWD
In this section we discuss the asymptotic properties of the PDWD. We find the Hessian matrix and the influence function before proving consistency. We demonstrate the consistency when p is fixed, as well as when p is not fixed and tends to infinity, although we still require it to be less than n. To make the proofs easier to read we use the following notation. Let = ( T , t) T , Z = (X T ,Ỹ ) T , X * = (X T , −1) T and * = diag( , 0), then u = T X * Ỹ and thus We denote this function by m( , Z). Let Ω Z be the support of Z and let h ∶ Θ × Ω Z → R + be a function of ( , Z). Let D denote the (p + 1)-dimensional column vector of differential operators ( ∕ 1 , … , ∕ p+1 ) T . Before we consider the gradient of the DWD objective function, we prove that the function f is differentiable at all points.

Lemma 2. The function f, as defined in Lemma 1, is differentiable at all points.
Proof. We need to prove that the gradient of f as we approach 1 √ from below is equal to the gradient as we approach from the above. We have The next theorem gives the gradient of the DWD objective function E[m( , Z)]. The proof follows straight from Lemma 2 and is therefore omitted. Let D 2 denote the operator D D T . Thus, D 2 m( , Z) is the (p + 1) × (p + 1) matrix whose (i, j)th entry is 2 m∕ i j .

Theorem 2. The gradient of E(m( , z)) takes the form
The next step is to find the Hessian matrix. Before doing so, we state some helpful results. First we use the following notation. Let n( , z) = D m( , z) and for each ∈ Θ, N (n) be the set of x for which a function n(z,⋅) is not differentiable at . That is,

Then D [n( , Z)] is integrable, E[D n( , Z)] is differentiable and D E[n( , Z)] = E[D n( , Z)].
(21) Lemma 4. For c > 0 we have the following identity Now we have the necessary results which will be helpful in finding the Hessian matrix as the following theorem states.
Proof. Let H( , a) denote the hyperplane {x ∶ T x = a}. We first need to verify the two assumptions in Lemma 3. In our case, Since the Lebesgue measure of H is 0 forỹ ∈ {−1, 1}, so by assumption 1 of the theorem, the above probability is 0. Thus condition 1 of Lemma 3 is satisfied.
Let n 1 ( , z) = * and n 2 ( , z) = −x * ỹ Then n( , z) = 2n 1 ( , z) + n 2 ( , z). Since n 1 is nonrandom and differentiable, it obviously satisfies E[D n 1 ( , z)] = D E[n 1 ( , z)]. To verify that n 2 is Lipschitz, let 1 , 2 ∈ R p+1 . Then From Lemma 4 we get: This verifies condition 2 of Lemma 3. Finally, by direct calculations we find that, for z ∉ N (n), The theorem follows now from Lemma 3. ▪ The following theorem gives the influence function of PDWD. A similar result in the SVM literature can be found in Jiang, Zhang, and Cai (2008).
and H is the Hessian matrix defined previously.
Proof. Let a = ( a T , t a ) T and now we write and since H is the Hessian of a convex function we can establish that it is symmetric and positive definite.
We can see that A n (s) is convex with respect to s and is therefore minimized by √ n(̂− 0 ). Now we can write where r n,0 (s) = o(||s|| 2 ) → 0 for fixed s and r n,

in probability since it has mean zero and variance o(||s|| 2 ).
Since H is positive definite, and the covariance matrix var[X] is finite, it follows from the basic corollary of Hjort and Pollard (1993) that (23) Then it can be shown that where D( 0r , z) = −n −1 F r ∑ n i=1B i (z) + o p (n −1∕2 ).
We start by considering the first term on the right-hand side. This gives Before we consider the remaining terms of the above equation, we first note, for a, b ∈ R and c > 0 we have Therefore, last terms become Following the definition in Li et al. (2011) we say that a function ∈  is unbiased for nonlinear SDR if it has a version that is measurable {X}. Using this then we prove the following theorem which proves that the minimizer of the objective function (31) estimates the CS. Theorem 6. Suppose the mapping  → L 2 (P X ), f  → f is continuous and If ( * , t * ) minimizes (31) among all ( , t) ∈  × R, then * (X) is unbiased.
Proof. Similar to Theorem 1, we have ) + ]−1 ) + Hence if there is no version of that is measurable with respect to {X}, then Since  ⊂ L 2 (P X ), belongs to L 2 (P X ), for any > 0, there is a 1 ∈  such that || 1 − ( )|| L 2 (P X ) < .
By Lemma 5, we can choose to be sufficiently small so that Λ( , t) > Λ( 1 , t), which means cannot be * . ▪

Estimation algorithm
Let  be a linear space of functions from Ω X to R spanned by  n = { 1 , … , n }. These functions are chosen, such that, Hence the sample version of (31) becomeŝ where c ∈ R n and i the ith column of . This problem differs from the kernel objective function, given in Wang and Zou (2018), where T is replaced by the kernel matrix K n = { (i, j) ∶ i, j = 1, … , n} for some positive definite bivariate mapping ∶ Ω X × Ω X → R. For the function class , the reproducing kernel Hilbert space is based on the mapping . Many choices of exist, some of the more popular choices are the radial basis kernel, the polynomial kernel and many more. To be more exact Wang and Zou (2018) proposed the use of the following equation iteratively until convergence for solving their objective function in the classification framework: Finally, This of course was to address the classification problem that DWD is proposed for. Since we are interested for dimension reduction, our objective function is different and takes the form (32). By replacing K n with T we have the formulas for the dimension reduction framework to be: Finally, As we can see the solution now does not depend on K n but rather on T . Now, let Q n = I n − J n /n, where I n is the n × n identity matrix and J n is an n × n matrix with entries 1. The following Proposition was used in Li et al. (2011) to show that the eigenfucntions of the operator Σ n can be instead estimated by using the eigenvectors of Q n K n Q n .
The following statements are equivalent: 1.
is an eigenvector of the matrix Q n K n Q n with eigenvalue . 2. w is an eigenfunction of the operator n with eigenvalue ∕n.
If ≠ 0, then either statement implies ( (X 1 ), … , (X n )) = T . As we mentioned above in the derivations for our problem we need T , where one can estimate it using the above proposition as = W = ( 1 , … , n ). Since i is an eigenvector of Q n K n Q n , T becomes the identity matrix. Therefore the objective function in (32) becomes independent of X. For this reason we propose a slight modification that does not affect our theoretical results and we replace in (32) with̃= K 1∕2 n W to reintroduce the dependence of the problem and its solution on X. Therefore the objective function we try to solve is: Notice that, with this modification in DWD, we achieve two things in comparison with the PSVM algorithm. One is that we remove one tuning parameter by not having to estimate k (less than n) basis functions (X), that is our matrix is an n × n matrix and not a k × n matrix. The second is the ability to estimate directly the sufficient predictors therefore removing one step in the algorithm. Therefore, the kernel PDWD estimation algorithm is as follows: 1. (Optional) Marginally standardize X 1 , … , X n . This step can be omitted if the components of X i have similar variances. 2. Choose a kernel and create the kernel matrix K. Find the eigenvectors 1 , … , n of Q n K n Q n .Calculatẽ= K

NUMERICAL STUDIES
In this section we demonstrate the advantages of PDWD over PSVM through a simulation study and through a real data experiment.

Simulation studies
We use the following three synthetic models: where X ∼ N(0, I p ), ∼ N(0, 1) and = 0.2. We choose n = 100 and h = 20 unless stated otherwise. Although all models use only two predictors we add noise to the data by introducing appropriate number of predictors such that p takes the values p = 20, 30, 50, 100. Also, notice that for the first model the effective dimension d = 1 and for the other two models, d = 2.
We will use the distance method defined in Li et al. (2005) to estimate the performance of the algorithms. Let ∈ R p×d denote the basis of the central space and let̂be its estimator. Then we estimate the performance of̂as with the following distance measure T , that is the projection matrix, and ||⋅|| is the Frobenius norm.
We compare our method with PSVM and the results are shown in Table 1. The results show that PDWD and PSVM have similar performance for values of p close to n or close to 0 but for values in between PDWD has a clear advantage. In the classification literature (see Marron et al., 2007) it was shown that DWD clearly outperforms SVM for larger p due to the SVM suffering from data piling. The fact that here the two methods are equivalent as p tends to n we believe is due to the different nature of the problem. Remember that while in classification the performance of the classifier is measured on the percentage of correctly classified points, which will be hindered by data piling, here we are interested for dimension reduction through hyperplane alignment. It seems that in the dimension reduction framework data piling actually "hinders" the performance of both PSVM and PDWD by causing them to overfit the data and that's why the performance of the two algorithms is becoming equivalent as p gets closer to n.

Computational time
As was mentioned earlier using a newly developed algorithm for DWD by Wang and Zou (2018) there is a computational advantage as the computation of Principal DWD is much less than the one for PSVM. We emphasize here that when Li et al. (2011) proposed PSVM they identified that the fact that PSVM needs quadratic programming leads to higher computational cost and that was probably the only disadvantage of PSVM over earlier methods which were based on inverse moments. As Figure 3 indicates there is a huge difference in time as n increases (and p is constant) while the difference stays relatively the same as p increases (and n is constant).

Order determination
We consider the three models we discussed before. As was mentioned earlier, Model I has effective dimension 1 and Models II and III have effective dimension 2. We run 1,000 simulation experiments with n = 300, = 0.2 and H = 20. Table 2 shows the proportion of correct estimates as p increases. This is a very promising result as it demonstrates that the performance of the algorithm does not suffer a lot when the dimension is increased, instead we can see that as p increase the number of correct estimates for Models II and III decreases slightly but remains high.

Kernel PDWD
We consider models III and ) Model IV: Y = (X 2 1 + X 2 2 ) 1∕2 log((X 2 1 + X 2 2 ) 1∕2 ) + , where X ∼ N(0, I p ), ∼ N(0, 1). In this section we compare kernel PDWD (KPDWD) with kernel PSVM (KPSVM). In the same format as in the nonlinear comparisons in Li et al. (2011) we will use the absolute value of Spearman's correlation to measure the closeness of the predictors to the true predictors. We choose n = 100, = 1, p = 10, 20, 30, and h = 20. For Spearman's correlation, the numbers are between 0 and 1, where larger numbers indicate a higher performance. Using the Gaussian kernel basis, Table 3 shows that kernel PDWD outperforms kernel PSVM for both models. It is also clear that the performance of kernel PDWD remains good as p increases.

Real dataset: Concrete slump test
We now turn our attention to real data analysis. Our aim is to assess the effect of introducing random variables to the data. Consider the Concrete slump data analyzed in Yeh (1998). We have evaluated the response variable Compressive Strength. There are seven predictor variables called cement, slag, fly ash, water, superplasticizer (SP), coarse aggregate, and fine aggregate. The data TA B L E 4 Distances as extra predictors are added in the dataset. Each column adds a different number of predictors and we report the distance of the estimated Central Space (CS) from the "oracle" CS, that is, the one when only the original predictors are used which span the CS estimated by each method. Then we add extra predictors in the dataset, which are randomly distributed from a standard Normal distribution, and calculate the new 's that span the Central Space using the two methods. We calculate the distance of the new vector from the original one, that is the one that was calculated based on the original predictors. Table 4 shows the distances between the estimator and the original estimator for each of the two methods, PDWD and PSVM, and for different number of added predictors (3,10,30,50,90). We can see that the estimator of the PDWD moves a lot less than the PSVM predictor. The third line of results in Table 4 which is labeled "Compared" shows the distance between the estimated PDWD and PSVM directions. It is clear that the two directions start far away for p = 3 and they get closer (meaning they are estimating similar directions) as p increases. Then it breaks down again when p = 90.

DISCUSSION
In this paper we propose a different classification-based algorithm for SDR. The newly proposed algorithm is based on DWD proposed by Marron et al. (2007). The main advantages of the new method are, first its performance is not affected by extra noninformative predictors in our dataset and second is computationally faster than previously proposed SVM-based algorithms. The theoretical properties of the new method are studied in detail. We are able to prove that asymptotic theory holds for fixed p.
While DWD was proposed in the classification framework to address the problem of data piling that SVM is suffering from in very high dimensions, we can see in this paper that the same cannot be said for the SDR framework the PDWD and PSVM were proposed for. Instead we can see that PDWD outperforms PSVM for low-dimensional problems while the estimation of the two algorithms comes closer as the dimension of the problem increases. Also, the advantage in computational cost using PDWD as the number of observations increases it can be crucial in an era where massive datasets are becoming increasingly popular.