Residual's influence index (RINFIN), bad leverage and unmasking in high dimensional L2‐regression

In linear regression of Y on X(∈ Rp) with parameters β(∈ Rp+1), statistical inference is unreliable when observations are obtained from gross‐error model, Fϵ,G = (1 − ϵ)F + ϵG, instead of the assumed probability F;G is gross‐error probability, 0 < ϵ < 1. Residual's influence index (RINFIN) at (x, y) is introduced, with components measuring also the local influence of x in the residual and large value flagging a bad leverage case (from G), thus causing unmasking. Large sample properties of RINFIN are presented to confirm significance of the findings, but often the large difference in the RINFIN scores of the data is indicative. RINFIN is successful with microarray data, simulated, high dimensional data and classic regression data sets. RINFIN's performance improves as p increases and can be used in multiple response linear regression.


INTRODUCTION
Tukey [29, p. 60] wrote: "Procedures of diagnosis, and procedures to extract indications rather than extract conclusions, will have to play a large part in the future of data analyses and graphical techniques offer great possibilities in both areas." This philosophy is widely adopted nowadays in Data Science, realizes ASA's hope for a "post-P-value era" [31] and motivates this work.
Data cleaning should precede statistical analysis. In linear regression of Y on X and parameters , it is often erroneously assumed that the data follow probability F instead of the gross-error model F ,G = (1 − )F + G [19]; G is gross-error probability, 0 < < 1, Y ∈ R, X ∈ R p , ∈ R p+1 . A case (x, y) with x far away from the bulk of F's factor space is called "leverage" case [25] and influences the statistical analysis. In particular, a "bad" leverage case (x, y) from G forces the regression hyperplane determined by F (the F-regression) and the associated F-residuals to change drastically when x becomes more remote. The goal of this work is to provide a simple and easy to implement procedure extracting indications for flagging bad leverage cases (from G) in least squares (L 2 ) regression, without the use of P-values and hypothesis testing that lead to a hard "yes-no"-decision.
A tool indicating the influence of (x, y) in the value of a statistical functional, T(F), is T's influence function, IF(x, y; T, F) [13]. In linear L 2 -regression, the empirical influence function of a non-robustified estimator,̂i, of i measures the change of̂i when (x, y) is added in the sample, but suffers from the masking effect that is due to neighboring cases of (x, y) in the sample. For example, from (10), in simple, linear L 2 -regression with sample (x 1 , y 1 ), … , (x n , y n ), the empirical influence function of the slope at (x, y) has form C ⋅ r ⋅ (x − x n ); r is the residual of (x, y), C is independent of x, y. If (x, y) is one of few, neighboring bad leverage cases in the sample, the difference (x − x n ) will have large absolute value whereas r may be near zero, thus |r ⋅ (x − x n )| may take a moderate value masking (x, y). To the contrary, from (12), the x-derivative of the slope's influence function measures local influence of (x, y) (see (7)) and separates the factors of the influence function, obtaining instead the sum of C ⋅̂(x − x n ) and C ⋅ r, which has large absolute value even when x is masked and r is near 0;̂is the L 2 -estimate of the slope. The index, RINFIN, introduced to flag bad leverage cases in multiple, linear L 2 -regression, depends on the x-partial derivatives of the influence functions of regression coefficients and shares the advantage of the factors' separation in̂i's influence function, i = 0, … , p. The same holds for L 2 -regression with diagonal matrix of weights, W, independent of x, r.
The x-partial derivatives of regression coefficients' influence functions appear naturally in changes of regression residuals for small x-perturbations under models F and F ,x,y ; see Section 4.2. The so-obtained (population) (1) IF j denotes the influence function of the jth regression coefficient, j = 0, 1, … , p. Note that RINFIN uses also the information in the influence functions. For simple linear regression, with L 2 -residual (r 2 ), slope ( ), mean (EX), and variance (Var X) all under F. RINFINABS is obtained by replacing in the sum in (1) the squares by absolute values. Using G's averages, (x, y), RINFIN(x, y; , ) is calculated. Asymptotic properties of RINFIN(x, y; ,̂n) are also presented;̂n is 's L 2 -estimate.
In practice, sample RINFIN score, RINFIN(x, y; 1∕n,̂n), is obtained for every (x, y) in the sample. Large RINFIN scores provide indications for bad leverage cases. Since the percentage of G-cases in F ,x,y is expected to be 10% or less, potential bad leverage cases in the sample are those (x, y) with the 10% larger RINFIN scores, 1 Alternatively, the sum next to in (1) is divided by p. and especially those with the same ordering in RIN-FINABS scores. The spacings of ordered RINFIN scores are also informative. The 10% threshold value can be replaced by another value. Once a case is flagged, it can be grouped with neighboring cases and using their average, (x, y), and the group's proportion in the sample, , RINFIN scores can be re-calculated as described in Appendix C. Comparison of the findings with the ordering of the scores obtained with other indices, or with results of methods that extract conclusions using statistical significance are informative. Proposition 5 can also be used to determine significance of the RINFIN scores.
When n is smaller than p, sample RINFIN values are calculated sequentially, for the y-response and sub-vectors of x-covariates with dimension q < n;p is multiple of q. For each case, the total of its p q sample RINFIN values is its RIN-FIN score. RINFIN can also be used with multiple response linear regression, adding for (x, y) the sample RINFIN scores for each response. The RINFIN approach can be used for LASSO and ridge regression with the influence functions derived in Ollerer et al. [24].
RINFIN provides satisfactory results with high dimensional data. It is successful with the microarray data used in Zhao et al. [37,38] for which n = 120 and p = 1500. In simulations with gross-error normal mixtures F, G and fixed sample size n, the misclassification proportion of G-cases using RINFIN(x, y; 1∕n,̂n) decreases to zero as p increases, p < n. The blessing of high dimensionality is due to the "separation" of the mixtures' components measured, for example, by their Hellinger's distance, as p increases ( [34,35], Section 8, Proposition 8.1). RINFIN also identifies bad leverage cases in classic regression data sets.
Due to the flood of Big Data, new influence measures have been recently introduced. In Genton and Ruiz-Gazen [11], an observation is influential "whenever a change in its value leads to a radical change in the estimate" and the hair-plot is used for visual identification. Local and global influence measures are proposed using partial derivative of the estimate. She and Owen [28] have as goals outlier identification and robust coefficient estimation, both achieved using a non-convex sparsity criterion. Zhao et al. [37] propose a high dimensional influence measure (HIM) based on marginal correlations between the response and the individual covariates and the leave-one-out observation idea [32]. For a particular regression model, Zhao et al. [38] propose a novel procedure, for multiple influential point detection (MIP).
In Section 2, RINFIN applications are presented. In Section 3, the derivative of the influence function is introduced for measuring local influence of (x, y); it is the main tool to obtain RINFIN scores. In Section 4, local influence of (x, y) in L 2 -regression residual is studied, RINFIN and RINFINABS are defined and asymptotic properties of RIN-FIN's estimate are obtained. In Appendix A, -matrix is introduced to obtain a simple and insightful form for RIN-FIN when the X-covariates are uncorrelated. Proofs follow in Appendix B. Directions for the use of RINFIN with data are in Appendix C.

RINFIN and simulations, p < n
Data (X, Y ) from probability F follow a linear regression model with p parameters, = (1.5, 0.5, 0, 1, 0, 0, 1.5, 0, 0, 0, 1, 0, … , 0); when p < 11, 's first p coordinates are used. X is obtained from p-dimensional normal distribution,  (0, ), with 's entries Σ i,j = 0.5 |j−i| , 1 ≤ i, j ≤ p, as in Alfons et al. [1, p. 11]. For gross-error model, F ,G , the proportion is 10%. For each contaminated X (from G) the first [ ⋅ p] coordinates are independent, normal with mean and variance 1; 0 < ≤ 1, [x] denotes the integer part of x. Various values for , p, and are used and p is smaller than the sample size n. The regression errors are independent, standard normal random variables. Each of the N = 100 simulated samples has size n = 100. Cases 1-10 are contaminated and compared with those having the 10 larger sample RINFIN scores for calculating the misclassification proportion.
In Table 1, the misclassification proportion decreases as p increases except for an anomaly when p = 90 due to its proximity to p = n = 100, for which L 2 -regression breaks down. By increasing n to 150 cases this anomaly disappears, for example, for = 1 the misclassification proportion is 0.105.
In Table 2, for fixed contamination proportion in the first [ ⋅ p] x-coordinates, the RINFIN misclassification proportion decreases as p increases. The anomaly is still

RINFIN and real, high dimensional data, p > n
RINFIN is used for the microarray data in Zhao et al. [38], obtained from Chiang et al. [6] and previously analyzed by Zhao et al. [37]: 120 twelve-week-old male offspring were selected for tissue harvesting from the eyes. The microarray contains over 30,000 different probe sets. Probe gene TR32 is used as the response and the covariates are 1500 genes mostly correlated with it.
Since n = 120 < p = 1500, RINFIN values are calculated for the response TR32 and each group of 100 x-covariates partitioning the microarray data, with coordinates 100(j − 1) + 1, … , 100j, 1 ≤ j ≤ 15. For each of the 120 cases, the total of its 15 RINFIN values is its score. In Table 3, cases with the higher 16 RINFIN scores are provided; more than 10% of the cases are presented in order to get an idea of the spacings in the successive scores.
Indications for leverage cases from G in the gross-error model are given for cases 80, 95, 32, 120, and 59, after which the spacings' in the RINFIN scores are reduced. In Table 4, the highest 16 RINFINABS-scores are provided. Cases 80, 95, 32, 120, and 59 have still the same order as in Table 3, but the order of the remaining cases changes.
Cases 80, 95, 32, 120, and 59 are also supported by diagnostics HIM and MIP which are based, respectively, on analyses of correlations and covariances, whereas RIN-FIN is based on perturbations of residuals. According to Leng [23], diagnostic HIM [37] Tables 3 and 4, at the top 12% of the 120 RINFIN scores. Case 75 is identified by both RINFINs as the case with the 21st larger RINFIN score, at the top 17.5% of the RINFIN scores. RINFIN values of case 75 were not in the top 10 RINFIN values in any of the 15 groups partitioning the microarray data. RINFIN values of case 28 were ranked 9th in the group of 1-100 coordinates and 10th in the group of 101-200 coordinates. The total square distances of cases 75 and 28 from the means of the p = 1500 coordinates were at the 80th quantile of all the distances. However, their maximum square distances over all coordinates were between the 40th and 50th quantiles of all cases. RINFIN targets bad leverage cases and 28 and 75 do not fall in this category.

RINFINABS and classic, regression data sets
The top 6-8 RINFINABS scores are obtained using (35), with sum of absolute values instead of RINFIN's sum of squares, for 6 known data sets; those without references are in Rousseeuw and Leroy [25]. In the Kootenay River data, case 4 is remote and has large RINFINABS score compared with the rest (see Table 5).
In the Hadi and Simonoff [12] data, cases 1-3, 4, 17 are more distant from the origin than the remaining cases. RINFINABS scores obtained for each case and after grouping indicate cases 1-3 are bad leverage cases. Hadi and Simonoff [12] identify these as true outliers (see Table 7).
In the education data, case 50 (Alaska) is far from the origin and its RINFINABS score compared with those of the rest indicates bad leverage (see Table 8). In the salinity data, Carroll and Ruppert [5] indicate that remote case 16 and the other influential case 3 are masking case 5. RINFINABS scores support the findings for case 16 but not case's 5 masking (see Table 9).
In the modified wood data, cases 4, 6, 8, 19 are neighboring and remote in each x-coordinate. RINFINABS scores obtained for each case and with cases 4, 6, 8, 19 as group support these are bad leverage cases (see Table 10).

LOCAL INFLUENCE AND THE DERIVATIVE OF THE INFLUENCE FUNCTION
To study local perturbation of the residual, r, which provides RINFIN, the derivative of r's influence function is used. Hampel [13] when this limit exists; x(∈ R p ), F is a probability, Δ x is the probability with all its mass at x, 0 < < 1. T(F) is usually a parameter of model F, for example, the expected value of a random variable X from F with T(F) = E F (X). IF(x; T, F) determines the "bias" in the value of T at F when using instead (1 − )F + Δ x : "≈" is used since Hampel [14, p. 389] introduced also local-shiftsensitivity, * = sup as "a measure for the worst (approximate) effect of wiggling the observations"; ||⋅|| is a Euclidean distance in R p .
Local-shift-sensitivity was never fully exploited. One reason is that, in reality, it is a "global" measure as supremum over all x, y. Thus, * cannot be used to study the bias in the value of T at (1 − )F + Δ x for x's small perturbation, from x to x + h, and ||h|| small, that is Local influence (6) is measured by the partial derivatives of the influence function, as the next lemma indicates for x ∈ R and is confirmed for the residuals in Proposition 2 with x ∈ R p .

Lemma 1. Assume that F is defined on the real line and that (6) is evaluated at neighboring points x, x
when the limit exists. (7) can be interchanged without affecting the limit, for example, for any function g for which the derivative g ′ exists and

T(F) = ∫ g(y)dF(y).
IF ′ (x; T, F) is used to approximate local influence (6) for small , |h|: (8) (8) is the main tool used to approximate L 2 -residuals of gross-error models and determine RINFIN. When (8) is used, the influence function's derivative is always evaluated at F. Example 1. Consider a simple, linear regression model, Y = 0 + 1 X + e, with error e having mean zero and finite second moment, F is the joint distribution of (X, Y ).
The influence functions for the L 2 -parameters 0 (F), IF(x, y; 1 EU and Var(U) denote, respectively, U's mean and variance. The x(-partial)-derivatives of (9), (10) are Observe in (9) and (10) the multiplicative effects of r with (x − EX) and EX 2 − xEX and their conversions to additive effects in (11) and (12).

4.1.1
The assumptions (1) The error, e, has mean zero and finite second moment. (2) Case (x, y) is mixed with cases from model F with probability (model F ,x,y ).
The L 2 -regression coefficients are obtained minimizing Ee 2 ; E denotes expected value.
RINFIN has a simple form providing insight when an additional assumption is used: (3) X 1 , … , X p are uncorrelated random variables.
(3) is not necessary to use RINFIN in practice; see (34) and Remark 5.

Notation
The jth regression coefficient obtained by L 2 -minimization at model F ,u,v is denoted by j (F ,u,v ), j = 0, 1, … , p, and their vector by (F ,u,v r is also used to denote r(x, y; F).
Add h in the ith component of x, with h ∈ R, |h| small, to obtain such that (x i,h , y), (x, y + h) are small perturbations of (x, y).
The influence function of j is evaluated at (x, y) for F, thus use that is, in words, IF ′ v, is the derivative of IF j with respect to v, j = 0, 1, … , p.

Influence functions
Influence functions of L 2 -regression coefficients at F are solutions of the system: IF 0 EX i + · · · + IF EX i X + · · · + IF p EX i X p = x i r(x, y; F), Equations (17) and (18) are obtained by interchanging in the normal equations, the expected value with the partial derivatives and, after evaluation at the models H = F and H = (1 − )F + Δ (x,y) , subtracting the equations for the ith partial derivative for both models, dividing by and taking the limit as converges to zero.
The influence functions in (17) and (18) are now provided when, in addition, (3) holds. With an additional assumption on the error, e, influence functions of L 1 -regression coefficients have also been obtained ( [36], Proposition 3.2). (13) with assumptions (1)-(3) and notation (16), the influence functions of L 2 -regression coefficients at (x, y) for model F are:

Perturbations of L 2 -residuals for models F and F ,x,y
The goal is to compare small (x, y)-residual changes in L 2 regressions for F ,x,y and F: A lemma follows that is used repeatedly to calculate residuals' differences (i) and (ii). (13), (1), (2) and , |h| both small it holds:

Lemma 2. For regression model
Proposition 2. For regression model (13) with (1), (2), x i,h the perturbation of x (see (15)) and for and |h| both small: a. the difference of (x, y)-residuals at F ,x i,h ,y and F ,x,y is: b. the difference of (x, y)-residuals at F ,x,y+h and F ,x,y is: Remark 3. The right side of (22) involves influence functions and their derivatives. An index using it to detect bad leverage is less affected by masking than diagnostics based solely on influence functions, as explained in the Introduction.

Residual's influence index, RINFIN(x, y; , )
Local influence of (x, y) is determined using the distance of residuals' partial derivatives at (x, y) for model F and gross-error model F ,x,y . The larger this distance is, the larger the local influence of (x, y) is.

Local x-influence on L 2 -residuals
For (x i,h , y) and (x, y) both under model F, For gross-error models F ,x,y , F ,x i,h ,y , the difference in partial derivatives of residuals is obtained from (22) for small , From (27) and (28), the right side of (28) measures influence of x's ith coordinate in the residual's derivative and provides the motivation for defining influence.

Definition 1.
For gross-error model F ,x,y , a. the influence of x's ith coordinate in the L 2 -residual is  (17) and (18) can be written in matrix notation  is the symmetric matrix of EX i , EX i X j and 1, 1 ≤ i, j ≤ p, IF is the vector of -influence functions and q = (r(x, y; F), x 1 r(x, y; F), … , x p r(x, y; F)) T .
Using the notation, it is shown in Lemma B.2 for regression model (13) under (1), (2), (x, y) ∈ R p+1 , that and (30) is written Remark 5. To calculate the RINFIN value (34) of (x i , y i ), the rest of the sample is used to estimate ,,  * , i = 1, … , n. If one |̂i| is near zero, the effect of "bad leverage" remains in another term in (34).
Assuming in addition (3) and using (B2), a more accessible and insightful form of RINFIN is obtained.

Local y-influence on L 2 -residuals
For (x, y + h) and (x, y) both under model F, Remark 6. From (38), the y-influence index is it is maximized for cases in the extremes of the x-coordinates. Thus, RINFIN is restricted to the influence of factor space cases.

Large sample properties of RINFIN(x, y; ,̂n)
Consistency of RINFIN(x, y; ,̂n) and its asymptotic distribution follow from properties of the least squares estimatê n . For Proposition 5 the notation is changed: X(∈ R p+1 ) has 1 as first coordinate, is EXX T and x(∈ R p ) still denotes a factor space vector. Proposition 5. 2 Let (X, Y ), (X 1 , Y 1 ), … , (X n , Y n ) be independent, identically distributed random vectors with form Let̂n be the least squares estimate of .
Remark 7. RINFIN's advantage, that is, making additive the effects of x and r, remains for L 2 -regression with diagonal weight matrix, W, independent of x, r; Proposition 5 still holds with known V (W) in (42). When W depends on x, r, the decomposition of the influence function in Dollinger and Staudte [10, Theorem 3, Equation (2)] indicates that RINFIN's advantage may not hold, depending on the form of the weights.

ACKNOWLEDGMENTS
Many thanks are due to the Professor Bertrand Clarke, Editor, the AE, and referees for the comments that improved the presentation of the paper. Many thanks are also due to Professor Douglas Hawkins, for the useful comments and the encouragement about this work, and to Professor Chenlei Leng, who has kindly communicated to us earlier the results in Zhao et al. [38] and provided the microarray data used in applications. Thanks are due to Miss Eleni Yatracos, for confirming in 2017 properties of the -matrix and an erroneous referee's comment about RINFIN. 2 Provided for completeness and for interested readers, even though P-values are not used herein.

DATA AVAILABILITY STATEMENT
Data source appears in the references.

APPENDIX A. -MATRIX
To prove Proposition 1, the general form of a symmetric, (n + 1) by (n + 1) matrix  n is introduced.  n 's entries are motivated by the expected values in the equations' system (17) and (18)   Under assumption (3), the coefficients in the system of Equations (17) and (18) form  n -matrix; n is the covariates' dimension. As an illustration, for real numbers a, b, c, A, B, C, For  3 , the corresponding linear regression model with uncorrelated covariates X 1 , X 2 , X 3 provides a = EX 1 , b = EX 2 , c = EX 3 and A = EX 2 1 , B = EX 2 2 , C = EX 2 3 . Definition A.1. The  n -matrix 3 with real entries has form: a 1 a 2 … a n a 1 A 1 a 1 a 2 … a 1 a n a 2 a 2 a 1 A 2 … a 2 a n … a n a n a 1 a n a 2 … A n Notation:  n,−k denotes the matrix obtained from  n by deleting its kth column and kth row, 2 ≤ k ≤ n + 1.
Property of  n -matrix: Deleting the kth row and the kth column of  n -matrix, the obtained matrix  n,−k is  n−1 matrix formed by {1, a 1 , … , a n } − {a k−1 }, 2 ≤ k ≤ n + 1.
The cofactors of  n -matrix are needed to solve the system of Equations (17) and (18).
Observe that for 2 ≤ j ≤ n, cofactor C n+1, j is obtained from a matrix where the last column is a multiple of its first column by a n+1 , thus, C n+1, = 0, = 2, … , n.
For the matrix in cofactor C n+1, 1 , observe that in its last column a n+1 is common factor and if taken out of the determinant the remaining column is the vector generating  n , that is, {1, a 1 , … , a n }. With n − 1 successive interchanges to the left, this column becomes first and  n appears. Thus, C n+1,1 = (−1) n+2 (−1) n−1 ⋅ a n+1 | n | = −a n+1 | n |.
For C i+1,1 , i > 0, after deletion of row (i + 1) in  n the remaining of column (i + 1) in the cofactor's matrix is multiple of a i and the basic vector creating  n,−i . Column 1 of  n is also deleted and for column (i + 1) in the cofactor's matrix to become first column (i − 1) exchanges of columns are needed. Thus, For C 1,1 we express | n | as sum of cofactors along the first row of  n , C 1,1 + a 1 C 1,2 + · · · + a n C 1,n = | n | ⇒ C 1,1 = Π n k=1 (A k − a 2 k ) + a 2 1 Π k≠1 (A k − a 2 k ) + · · · + a 2 n Π k≠n (A k − a 2 k ).