New Highly Efficient High-Breakdown Estimator of Multivariate Scatter and Location for Elliptical Distributions

High-breakdown-point estimators of multivariate location and shape matrices, such as the MM-estimator with smooth hard rejection and the Rocke S-estimator, are generally designed to have high efficiency at the Gaussian distribution. However, many phenomena are non-Gaussian, and these estimators can therefore have poor efficiency. This paper proposes a new tunable S-estimator, termed the S-q estimator, for the general class of symmetric elliptical distributions, a class containing many common families such as the multivariate Gaussian, t-, Cauchy, Laplace, hyperbolic, and normal inverse Gaussian distributions. Across this class, the S-q estimator is shown to generally provide higher maximum efficiency than other leading high-breakdown estimators while maintaining the maximum breakdown point. Furthermore, its robustness is demonstrated to be on par with these leading estimators while also being more stable with respect to initial conditions. From a practical viewpoint, these properties make the S-q broadly applicable for practitioners. This is demonstrated with an example application -- the minimum-variance optimal allocation of financial portfolio investments.

1. Introduction Huber (1964) introduced what is the most common class of estimators, M-estimators.Although originally applied to the location case, Maronna (1976) expanded the definition to include multivariate location and scatter.After the sample median, perhaps the most common robust M-estimators are those using the general rho functions such as the Huber or Tukey bisquare functions.
However, the drawback of using general rho functions is that they have limited efficiency when applied to parameter estimation of nonideal probability distributions.To address this, various M-estimator approaches have been taken to iteratively reweight maximum likelihood estimator (MLE) weights based on the estimated probability density function (PDF) (e.g., Windham, 1995;Basu et al., 1998;Choi and Hall, 2000;Ferrari and Yang, 2010).
Even with these improvements, multivariate M-estimators inherently have limited robustness.For example, Maronna (1976) showed that the upper-bound on the breakdown point for p-dimensional M-estimators is (p + 1) −1 , which converges to zero with large p.To combat this weakness, Rousseeuw and Yohai (1984) introduced regression S-estimators, which Davies (1987) expanded to multivariate location and scatter.Davies also showed that the asymptotic breakdown point of S-estimators can be set to 1/2, which is the theoretical maximum of any equivariant estimator.
In practical scenarios however, estimators may have large bias at considerably lower contamination levels than the breakdown point.For many years, the Tukey bisquare was the standard rho function for S-estimators (for example, see Lopuhaä, 1989;Rocke, 1996).However, in the context of multivariate S-estimators, the bisquare is not tunable, so its robustness falls off with increasing p.For this reason, Rocke (1996) introduced the tunable biflat and translated biweight rho functions.Maronna et al. (2006, sec. 6.4.4) slightly modified the biflat, proposing the Rocke rho function.The Rocke S-estimator (shortened here to S-Rocke) is currently the recommended high-breakdown estimator for large dimensions (p ≥ 15) (Maronna and Yohai, 2017;Maronna et al., 2019, sec. 6.10).The recommended estimator for lower dimensions is the MM-estimator with the smoothed hard rejection function (MM-SHR).
There are two major shortcomings of the S-Rocke estimator that will be discussed in this paper.Firstly, it has low efficiency for small dimension, p.Although this is an inherent disadvantage of all S-estimators, it is exceptionally acute for the S-Rocke.Secondly, the S-Rocke has poor efficiency for most common non-Gaussian distributions.This is a common problem for general-purpose estimators such as the Rocke and bisquare Sestimators, the MM-SHR, and the Huber and bisquare M-estimators.Examples of common phenomena that are frequently modeled by non-Gaussian distributions include stock returns, radar sea clutter, and speech signals, which approximately follow generalized hyperbolic (Konlack Socgnia and Wilcox, 2014), K- (Ward et al., 1990), and Laplace distributions  Hyperbolic with univariate marginals ψ(χ + d) , where x ∈ R p , the location µ ∈ R p , and the p × p positive definite symmetric (PDS(p)) scatter Σ ∈ PDS(p).When the PDF is defined, it has the form f X (x) = α p |Σ| −1/2 φ (d (x, µ, Σ)) , for some generating function φ(d), and where α p is a constant that ensures f X (x) integrates to one.Table 1 lists common generating functions.When the covariance exists, it is proportional to the scatter matrix, Σ.The corresponding shape matrix is commonly defined as The PDF of d (x, µ, Σ) is given by (Kelker, 1970) where β p = α p π p/2 /Γ(p/2).Hereafter, all densities, f (d), refer to the density of d (x, µ, Σ) in ( 2), so the subscript D will be omitted.It is also generally assumed that p > 2.

S-Estimators
Given a set of n p-dimensional samples, {x 1 , ..., x n }, S-estimators of location and shape are defined as (Maronna et al., 2006, Sec. 6.4.2) for some scalar rho function, ρ(t).A proper S-estimator rho function should be a continuously differentiable, nondecreasing function in t ≥ 0 with ρ(0) = 0, and where there is a point c such that ρ(t) = ρ(∞) for t ≥ c.For simplicity, and without loss of generality, the rho functions will be normalized so ρ(∞) = 1.The parameter b is a scalar that affects the efficiency (see Section 4) and robustness of the estimator.The purpose of S-estimators is to achieve high robustness, so they are usually configured with b = 1/2 − (p + 1)/(2n), which achieves the maximum theoretical breakdown point that any affine equivariant estimator may have (see Section 5.1).To understand the derivation of the proposed estimator in the next section, note that σ in the constraint is an M-estimator of the scale of d (µ, Ω) .Local solutions of (3) can be found iteratively using the weighted sums n i=1 w where the weight function w(t) = ρ ′ (t), and where Ω is re-normalized with each iteration.For the empirical results in this paper, the estimators will all be solved using this weighted-sum algorithm.
To estimate the scatter metrix, a separate estimator of |Σ| 1/p can then be used to scale Ω using (1).Maronna et al. (2006, p. 186) discussed a simple estimator to scale Ω to Σ.When x is normally distributed, d has a chi-squared distribution with p degrees of freedom.Therefore, they suggested using Σ = Median d x 1 , µ, Ω , . . ., d x n , µ, Ω χ 2 p (0.5) −1 Ω, where χ 2 p (0.5) is the 50 th percentile of the chi-squared distribution.For the general case of elliptical distributions, we propose extending this to where F (d) is the distribution function corresponding to (2), and therefore F −1 (0.5) is the 50 th percentile of the distribution.
For the location and shape matrices, the S-estimator formulation in (3) is equivalent to the alternative one given by which requires that σ be defined such that b = E [ρ (d (x; µ, Σ) /σ)] for a consistent estimator of Σ at an assumed elliptical distribution (Rocke, 1996).While the first formulation is better for understanding the derivation of the proposed S-q estimator, this second formulation is better for defining and understanding its properties (for example, see Lopuhaä, 1989).The scale parameters in the two formulations are related asymptotically by σ = |Σ| −1/p E [ σ] , at the assumed distribution.The two most common multivariate S-estimators are the bisquare and Rocke (Maronna et al., 2019, sec. 6.4.2, 6.4.4).The S-bisquare is given by ρ bisq (t) = min {1, 1 − (1 − t) 3 } and w bisq (t) = 3(1 − t) 2 I(t ≤ 1), which does not have a tuning parameter to control efficiency and robustness.The S-Rocke is given by where the parameter γ ∈ (0, 1] tunes the estimator's efficiency and robustness.The Rocke's maximum efficiency is generally limited at γ = 1, which is extremely restricting for small p.Both ρ bisq (t) and ρ γ (t) are generic functions that do not depend on the underlying distribution.In the following section, an alternative S-estimator is defined that accounts for the underlying distribution and that generally has better performance across the most common elliptical distributions.It also does not have the same inherent restrictions for small p as ρ γ (t).

Elliptical Density-Based S-q Estimator
The rho function corresponding to the maximum likelihood estimator, σ, of the scale of d (µ, Ω) , or equivalently d (µ, Σ) , is ρ mle (t) = −t f ′ (t) /f (t) .We propose weighting this by the power transform of the density, ρq (t) = f (t) 1−q ρ mle (t), where the scalar q ≤ 1 controls the estimator robustness, with q = 1 corresponding to the maximum likelihood function, and with decreasing q increasing the estimator robustness.In most cases, this rho function is not monotone, as required by S-estimators, so it is denoted with a tilde.This rho function is equivalent to the M-Lq and other M-estimators proposed, for example, by Windham (1995); Basu et al. (1998); Choi and Hall (2000); and Ferrari and Yang (2010).However, in this particular case of estimating the scale of the squared Mahalanobis distance of an elliptically distributed random vector, the density and rho function do not need to be regenerated with each numerical iteration, i, based on the estimates µ (i) and Ω (i) .Substituting the PDF from (2), ρq (t) = − (β p φ(t)) sq t spsq t φ ′ (t) where s p = p/2 − 1 and s q = 1 − q.Taking the derivative of ρq (t), the corresponding weight function is given by wq For simplicity, when q < 1, the scalar β p can be dropped from the calculation of ρq (t) and wq (t) in ( 5) and ( 6).When φ(t) is only positive over a finite domain (e.g.Pearson Type II distribution), then we define ρq (t) and wq (t) to be zero outside this domain.
For the common elliptical distributions listed in Table 1, ρq (t) is monotone in its central region between its global extrema when using appropriate values for q (defined below).The first extremum is the minimum, which we label point a, and the second is the maximum, labeled c.The distance between a and c varies monotonically with respect to q.We use this to define a tunable, double-hard-rejection S-estimator rho function.The value of ρq (t) is held constant between zero and a at value ρq (a), which hard rejects inliers, and the value of ρq (t) is held constant above c at value ρq (c), which hard rejects outliers.The resulting monotonic function is then scaled and shifted so it ranges from zero to one.This defines the S-q estimator.Definition 1. Assuming φ(t) is twice continuously differentiable over its region of support and has one or two zeros in t ∈ (0, ∞) for q < 1, the S-q estimator is the S-estimator with the rho function given by where s 1 = (ρ q (b) − ρq (a)) −1 .The S-q estimator of Type I is the case with one zero (i.e. a = 0), and the Type II S-q estimator is the case with two zeros.
For most distributions, lim q→1 c = ∞, or at q = 1, ρq (t) is not bounded.Therefore, we do not scale or shift ρq (t) in this case, and ρ q (t) is not a proper S-estimator rho function.However, when q = 1 and b = 1, the MLE of the scale of d is obtained.The S-q weight function is the derivative of ρ q (t) and is given by Table 2 lists expressions for the inlier rejection point, a, and the outlier rejection point, c, for the common elliptical distributions in Table 1.For most of these distributions, the equation wq (t) = 0 is quadratic, which provides a closed-form solution for the values of a and c.
The asymptotic rejection probability (ARP) is defined as P r(d/ σ ≥ c) (Rocke, 1996).Table 2 can be used to determine q from a desired ARP using F −1 (ARP ).However, since w q (t) is very tapered (i.e.applying little weight to values just below c), practitioners may choose alternative approaches to tuning that allow for higher estimator efficiencies.For Pearson type VII a, c = s example, the approach used in this paper as well as in Maronna and Yohai (2017) is to tune the estimators to a desired expected efficiency, which is defined in the next section.The general definition in (7) specifies that q ≤ 1.In a few particular cases, however, there are some minor restrictions on q (when q < 1) in order to ensure that a and c are in the support of f (d).Table 3 lists these restrictions.
Figure 1 illustrates examples of the S-q functions ρq (t), ρ q (t), and w q (t) for the fivedimensional Gaussian (S-q Type II) and Laplace (S-q Type I) distributions and for various values of q.As q is decreased, the region of positive weights (area between points a and c) narrows, corresponding to increased robustness.The PDF is also plotted, illustrating how w q (t) roughly follows f (t) in the central region.
Figure 2 compares the S-q asymptotic weights with those of the MM-SHR, S-Rocke, S-bisquare, and maximum likelihood estimators, and with the corresponding PDF.The underlying model is a 10-dimensional standard Gaussian distribution.The MM-SHR and Sq estimators have been tuned to 80% asymptotic efficiency relative to the MLE.The S-Rocke estimator is tuned to its maximum efficiency, which is 77% in this instance.The estimators have been set to the maximum breakdown point, with b = 1/2, which results in the shifts of the peaks of the weight curves relative to the PDF.
From the figure, it is clear that the Gaussian MLE (i.e.sample estimator) gives uniform weight to all samples, no matter how improbable.The S-Rocke has a quadratic weight Pearson type VII q ≤ 1 Generalized hyperbolic * q ≤ 1 unless χ = 0 and λ < 1, then unknown < q ≤ 1 * Empirically inferred.Computational precision restricts q / ∈ (0.998, 1), approximately.Example S-q Rho and Weight Functions for Gaussian (Type II S-q) and Laplace (Type I S-q) Distributions.Rho functions ρ q (t) (top) and weight functions w q (t) (bottom) are plotted for the Gaussian distribution (left) and the Laplace distribution (right) for q ∈ {−2, 0.5, 0.9} and p = 5.On the top, the dotted lines depict the corresponding ρq (t) functions, scaled and shifted to match ρ q (t).On the bottom, the dashed line depicts the density function.The solid dots indicate points a, when ρq (t) = 0, and c, when ρq (t) = 1.For the 10dimensional Gaussian distribution, the plot depicts the asymptotic weights for the S-q (w q ) and MM-SHR (w shr ) estimators tuned to 80% asymptotic relative efficiency, the S-Rocke (w γ ) estimator tuned to its maximum efficiency (77% for this case), and the non-tunable MLE (w mle ) and S-bisquare (w bisq ) estimators.The estimators are set to their maximum breakdown points.
function, which is a hard cutoff that cannot capture the tails of f (d).The SHR weight function is cubic, and its shape better captures the shape of the right-half of the PDF.
However, the SHR function is designed to approximate a step function, which is poorly suited for many distributions (c.f.w shr (t) in Figure 2 with the Laplace f (d) in Figure 1).Only the S-q weight function follows the general shape of the PDF-giving less weight to less probable observations.

Consistency Properties of the S-q Estimator
As an S-estimator, the S-q estimator inherits properties from its parent class, such as affine equivariance.This section briefly summarizes properties related to its consistency.For more detailed discussion on these, see (Davies, 1987).Here, we use the alternative S-estimator formulation given by (4) under the assumptions (A1) that x ∼ f X (x, µ, Σ, φ(d)) , where x i are i.i.d., φ(d) is non-increasing, and φ(d) and −ρ q (d/σ) have common point(s) of decrease.

Asymptotic Distribution and Relative Efficiencies
In this section, the asymptotic distribution of the S-q estimator is provided.From this, measures of efficiency are then defined.Finally, the efficiency of the S-q estimator is compared with leading high-breakdown point estimators.

Asymptotic Distribution
For the asymptotic distribution of the S-q estimate, we continue to use the alternative Sestimator formulation given by (4).Lopuhaä (1997) derived the distribution of S-estimators with assumptions appropriate for the S-q estimator, that is Here, we use the following notation.The matrix I p 2 is the p 2 × p 2 identity matrix, K p 2 is the p 2 × p 2 commutation matrix, ⊗ is the Kronecker product operator, and the operator vec (Σ) stacks the columns if Σ into a column vector.
Theorem 4 (Asymptotic distribution).Given (A1) and (A2), the asymptotic distribution of the S-q estimate of µ n , Σ n is given by The vector a ∼ N (0, Γ µ ) where with with Proof.See (Lopuhaä, 1997, Corollary 2).Frahm (2009) derived the asymptotic distribution of shape matrix estimates for affine equivariant estimators.This enables us to state the asymptotic distribution of the S-q shape estimate, which is applicable using either S-estimator formulation, (3) or (4).
Theorem 5 (Shape asymptotic distribution).Given (A1) and (A2), the asymptotic distribution of the S-q estimate of shape is given by where with ζ 1 defined as in Theorem 4.

Measures of Efficiency
The asymptotic efficiency of an estimator, at an assumed distribution, is defined as the ratio of the asymptotic variance of the maximum likelihood estimate to the variance of the estimator under consideration.For multivariate estimation, this definition of efficiency is of large dimension-p × p for location and p 2 × p 2 for shape and scatter.However, for affine equivariant estimation of location and scatter of elliptical distributions, the covariance of the estimate depends only on a scalar.Specifically, ( 9), (10), and (11) are general, with only the scalars ω 1 /ω 2 2 (Bilodeau and Brenner, 1999), and ζ 1 and ζ 2 (Tyler, 1982) depending on the estimator and the generating function φ(d).Therefore, the asymptotic efficiency of the estimate µ can be alternatively defined as , and the asymptotic efficiency of the estimate Ω can alternatively be defined as It is common to define asymptotic efficiency this way (for example, see Tyler, 1983;Frahm, 2009).
Comparing the S-q estimator's efficiency to another estimator can likewise be achieved analytically using, for example, ζ 1,γ /ζ 1,q for the S-Rocke estimator, which when the quotient is greater than one, indicates that the S-q has higher asymptotic efficiency than the S-Rocke estimator.For other S-estimators, the asymptotic distribution parameters ω 1 , ω 2 , and ζ 1 are calculated the same as in Theorems 4 and 5 but using their respective weight functions.MM-estimators have the same asymptotic variance and influence function as S-estimators (Rousseeuw and Hubert, 2013).For MM-estimators, however, σ is effectively the tuning parameter, and it can be set accordingly.
In general, finite-sample performance measures are difficult to derive analytically.Instead, it is common to characterize finite-sample performance by empirically characterizing the behavior of metrics derived from the Kullback-Leibler divergence between the estimated and true distribution (for example, see Huang et al., 2006;Ferrari and Yang, 2010).For t-distributions, which includes the Gaussian distribution, the Kullback-Leibler divergence between t ν (µ, Σ) and t ν µ, Σ is given by (Abusev, 2015 Following Maronna and Yohai (2017), we then define the joint location and scatter finitesample relative efficiency as where µ mle and Σ mle are the location and scatter matrices corresponding to the maximum likelihood estimate, and where the expectation is calculated empirically using the sample mean over m Monte Carlo trials.The location and the scatter finite-sample relative efficiencies are then respectively defined as eff n ( µ, Σ; µ mle , Σ) and eff n µ, Σ; µ, Σ mle .Likewise, we define the shape matrix finite-sample relative efficiency as

Comparison of Estimator Efficiency
Any estimator must provide a good estimate in the absence of contamination and when tuned to its maximum efficiency.This section compares the maximum achievable efficiencies of the S-q, S-Rocke, and MM-SHR estimators when set to their maximum breakdown point.The results below cover large swaths of the most common elliptical families in Table 1 for a moderate dimension of p = 20.These swaths were specifically chosen to cover everyday distributions: Gaussian, Cauchy, Laplace, hyperbolic, and normal inverse Gaussian distributions.
Robust scatter matrix estimation is generally "more difficult" than the estimation of location (Maronna et al., 2019), and as Maronna and Yohai (2017) demonstrated, divergence and efficiency metrics for scatter matrix estimators are generally much worse than for the corresponding estimators of location.Likewise, due to the high dimensionality of the estimate, the underlying shape matrix is the most difficult part of estimating the scatter matrix.Additionally, many practical applications such as multivariate regression, principal components analysis, linear discriminant analysis, and canonical correlation analysis only require the shape matrix, and not the full scatter or covariance matrices (Frahm, 2009).Therefore, unless otherwise noted, the performance results in this paper are for the shape matrix, with metrics given by ( 12), ( 13), and D µ, Ω; µ, Ω .
The maximum efficiencies of the S-q and S-Rocke generally occur when their parameters q and γ are set to one-although the maximum breakdown point of the S-q is only achieved when q < 1.However, the maximum efficiency of the MM-SHR must be determined by a search as depicted in Figure 3, which plots, as an example, asymptotic efficiency versus tuning parameter for the estimators for the 20-dimensional Cauchy distribution.At the limit, as the MM-SHR parameter is increased toward infinity, all samples receive equal weight, which is the MLE for the Gaussian distribution, but not for distributions such as the Cauchy.In general, for each tunable estimator, its efficiency decreases while its robustness increases as its parameter is decreased.At the lower limit of its parameter, its weight function is a delta function that may reject all the samples and may result in zero efficiency.At this point, the robustness is high, but the weighted-sum solution depends entirely on the initial estimates µ (0) and Ω (0) .
It should be noted that although generally of high efficiency, the S-q estimate at its limit with q = 1 is not necessarily the maximum likelihood estimate for location and scatter.The MLE weight function for location and scatter is given by (Tyler, 1982)  , the MM-SHR parameter is in (0, ∞), and the S-q parameter is in (−∞, 1) for the maximum breakdown point.
whereas at q = 1, (8) gives Theorem 6 (Relation to MLE efficiency).Assuming b = E [ρ q (d (x; µ, Σ))] , the asymptotic S-q estimate with q = 1 is the maximum likelihood estimate for the location and scatter matrices for distributions where for some value y.Therefore, the S-q estimator can asymptotically achieve the Cramér-Rao lower bound for these distributions.
Proof.The S-estimator scaling of b = E [ρ q (d (x; µ, Σ))] results in an estimate that is invariant to scaling of the weight function.Therefore, this theorem follows directly from proportionally equating ( 14) to (15).
Remark 1.Although this theorem inherently assumes the alternative S-estimator formulation given by (4), it still holds true for location and shape matrices using the primary S-estimator formulation in (3).

S-q
Figure 4: Estimator Maximum Achievable Asymptotic Efficiency for Kotz Type Distribution versus Parameter s.Maximum achievable asymptotic shape efficiencies for the maximum breakdown point are plotted for parameters N = 0, and r = 1/2.The maximum absolute asymptotic shape efficiency of the S-q estimator for q = 1 is also shown.
Remark 2. If lim q→1 ρq (c) is finite, and b ≤ 1/2, then both the Cramér-Rao lower bound (maximum efficiency) and the maximum breakdown point can occur in the limit as q → 1 (see Corollary 1 in the next section).
An example family that satisfies this theorem is the Kotz type with parameter N = 0. Note, however, that lim q→1 ρq (c) = ∞, so high breakdown cannot be achieved simultaneously.This is illustrated in Figure 4, which provides the estimators' maximum achievable asymptotic shape efficiencies for the Kotz type distribution with parameters N = 0 and r = 1/2 as a function of parameter s.In this example, the S-q efficiency is plotted for its maximum absolute efficiency with q = 1 and for it approximate maximum high-breakdown efficiency with q = 0.99.As seen in the figure, the high cost of high-breakdown is particularly acute for large s.
The remainder of this paper will focus on maximum efficiency at the maximum breakdown point.Figure 4 also provides the S-Rocke and MM-SHR estimator's maximum efficiencies at their maximum breakdown points.The MM-SHR efficiency peaks at s = 1, which is expected since this is the Gaussian distribution, and the S-Rocke efficiency peaks just above this point.Their efficiencies fall off precipitously for larger and smaller values of s.The efficiency of the S-q, conversely, increases toward unity for smaller s.
The estimators' maximum achievable asymptotic efficiencies for the t-distribution as a function of the distribution parameter, ν, are plotted on the left of Figure 5.When ν = 1, the t-distribution corresponds to a Cauchy distribution, and when ν → ∞, it corresponds to the Gaussian distribution.The S-q estimator offers the highest efficiency of the three estimators for thicker tails.
The maximum achievable small-sample relative efficiencies using n = 3p are plotted on the right of Figure 5.The initial estimates were made using the Peña and Prieto (2007) kurtosis plus specific directions (KSD) estimator as recommended and provided by Maronna and Yohai (2017).Comparing these finite-sample results with the asymptotic ones on the left, it is seen that the relative results are similar.This general similarity implies that the relative performance of the asymptotic efficiencies can often be a good surrogate for the relative performance of the finite-sample efficiencies when there is no closed-form expression for the divergence in (13).
The estimators' maximum achievable asymptotic efficiencies for the variance gamma distribution with ψ = 2 are plotted as a function of parameter λ on the left of Figure 6.The plots highlight the Laplace (λ = 1) and multivariate hyperbolic (λ = (p+1)/2) distributions.The S-q exhibits good performance for the hyperbolic and remarkably good performance for the Laplace.
The estimators' maximum achievable asymptotic efficiencies for the generalized hyperbolic distribution with ψ = 2 and χ = 1 are plotted as a function of parameter λ on the right of Figure 6.The plots highlight the normal inverse Gaussian (λ = −1/2) and hyperbolic (λ = (p + 1)/2) distributions.The S-q again exhibits good performance for the hyperbolic, and it exhibits remarkably good performance for the normal inverse Gaussian.

Robustness Analysis
The robustness of the S-q estimator is now explored.First, the breakdown point is provided.The influence function is then explored.Finally, finite-sample simulation results are provided to further illustrate the robustness of the high-breakdown estimators.

Breakdown Point
The finite-sample breakdown point of a multivariate estimator of location or scatter is defined as the fraction of the samples, ǫn, that can be set to either drive µ = ∞ or drive an eigenvalue of Σ to either zero or infinity.Unlike multivariate M-estimators, which only achieve an asymptotic breakdown point of (p + 1) −1 (Maronna, 1976), S-estimators are able to achieve the maximum possible finite-sample breakdown point that any affine equivariant estimator may have (Davies, 1987, Th. 6).For the following theorem, the term samples in general position means that no more than p samples are contained in any hyperplane of dimension less than p.
Proof.As discussed above Section 2.3, q < 1 ensures a proper S-estimator with finite value c and bounded rho function.See (Davies, 1987, Th. 5).

Influence Function
The influence function (IF) of an estimator characterizes its sensitivity to an infinitesimal point contamination at z ∈ R p , standardized by the mass of the contamination, ǫ.The influence function for estimator T , at the nominal distribution F , is defined as where ǫ is the proportion of samples that are a point-mass, ∆ z , located at z.
Theorem 8 (Influence function).Assuming (A1) and (A2), the influence functions of for the S-q estimates of µ and Σ are given by where z c = z − µ and d z = z T c Σ −1 z c , and where the scalars ω 2 , λ 1 , and λ 2 were defined in Theorem 4.
By definition of S-estimators with normalized rho function, the magnitude of first term of ( 17) is clearly bounded to no more than λ −1 2 Σ.Therefore, to compare the influence functions of the S-q, S-Rocke, and MM-SHR estimators, we focus on the second term.From this term, define α ) for each estimator.Figure 7 plots α Σ (d z ) at the 10-dimensional Gaussian distribution for the estimators as depicted in Figure 2.
By definition, all highly-robust estimators have bounded influence functions, and for the three estimators considered here, their influence functions are continuous.This means that small amounts of contamination have limited effects on their estimates.The grosserror sensitivity of an estimator is the maximum of IF (z) , and in this example, the S-q demonstrates a lower gross-error sensitivity than the S-Rocke and MM-SHR estimators.By its definition, the MM-SHR has a inlier rejection point of zero, meaning inliers can negatively influence its estimates.However, proper Type II S-q functions have positive inlier rejection points, which provide robustness against inliers.
Relative to the S-Rocke and MM-SHR estimators, the S-q often has larger outlier rejection points.This is the cost of its generally higher efficiency and ability to reject inliers.However, due to its continuity, the influence near this point is still greatly attenuated.

Finite-Sample Robustness
To empirically compare the finite-sample robustness of the estimators, we employed the simulation method used by Maronna and Yohai (2017) and plot the shape matrix divergence, D µ, Ω; µ, Ω , versus shift contamination value k.For a contamination proportion ǫ, the first element of each of the ⌊ǫn⌋ contaminated samples was replaced with the value k, that is x 1 = k.The initial estimates of the weighted algorithm were determined with the KSD estimator.Figure 8 provides divergence plots for normally distributed data with ǫ = 10% contamination, for dimensions p = 5 and p = 20, and for sample sizes n = 5p and n = 100p.For the cases where p = 20, the estimators were tuned to 90% uncontaminated relative efficiency.When p = 5, the S-Rocke has poor maximum efficiency, so the estimators were tuned to match the maximum S-Rocke efficiency.
These plots show that the robustness of the S-q is on par with the other two estimators.Consistent with the results in Maronna and Yohai (2017), the relative worst-case performance of the estimators vary by such factors as dimension, sample size, and contamination percentage.For example, the S-q performs the best here for p = 5, n = 100p, but the MM-SHR is the best for p = 5, n = 5p.

Computational Aspects
This section explores computational aspects of the S-q and other high-breakdown estimators when using their weighted-sum algorithms.The estimators' stabilities are first assessed by comparing their sensitivities to the initial estimates, µ (0) and Ω (0) .The computational efficiency is then evaluated by comparing the computational convergence rates of the estimators.

Stability
The primary criticism of high-breakdown estimators is that their solutions are highly sensitive to the initial estimates µ (0) and Ω (0) due to the non-convexity of their objective functions.The S-q estimator helps mitigate this with a generally wider weight function (see for example Figure 2).To demonstrate that the S-q is more stable with respect to the initial estimates, m Gaussian Monte Carlo simulation trials were run where for each trial, Ω was estimated twice for each estimator using different initializations.For the first estimate, Ω was set to the MLE using all n samples before contamination.For the second estimate, Ω (0) was set to the MLE using just 25% of the samples before contamination, resulting in a larger expected variance of Ω 0) .The sample mean of the divergence between the two final estimates, D Ω 1 , Ω 2 , was then calculated.The same values of p, n, and ǫ were used as in the Section 5.3 simulations.The contamination method was also the same, and the value of k was set to the worst-case value for that estimator and for the values of p, n, and ǫ (see Figure 8).The results are presented in the center of Table 4.It is seen that the S-q estimator was consistently the most stable of the three estimators, and the MM-SHR was generally the most sensitive.For the near-asymptotic cases, the S-q exhibited no measurable differences between the two estimates, unlike the MM-SHR.Like the S-q, the S-Rocke had no measurable differences between the two estimates for the uncontaminated near-asymptotic cases, but under contamination, its mean divergence was roughly on par with the MM-SHR.

Computational Efficiency
To compare the relative computational efficiencies of the high-breakdown estimators, we calculated the median number of iteration required for the estimators to converge for normally distributed data for various values of p, n, and ǫ.All three estimators were set to use the same tight convergence criteria that D Ω (i) , Ω (i−1) < 10 −10 .The initial estimates  (i) , Ω (i−1) < 10 −10 .
were determined with the KSD estimator, and the estimators were tuned as in the Section 5.3 simulations.The contamination method was also the same, and the value of k was set to the worst-case value for that estimator.
The results are presented on the right of Table 4.For the large-sample (n = 100p) simulations, the S-q converges approximately as fast as the other two estimators (except for the one case where p = 20, ǫ = 10%, where the S-Rocke performs notably better).The S-Rocke estimator consistently converges fastest for all of the small-sample (n = 5p) cases, and the small-sample convergence of the S-q estimator is relatively consistent-albeit at the upper-end of the spectrum.The small-sample convergence of the MM-SHR is on par with the S-Rocke for small p, but worse than the others for large p.

Application to Financial Portfolio Optimization
A common financial application of mean and covariance matrices is in modern portfolio theory for the optimal allocation of portfolio investments.Under modern portfolio theory's mean-variance framework, a minimum-variance portfolio aims to minimize the risk (i.e.variance) of the portfolio return subject to a desired expected return (Markowitz, 1952).Mathematically, this is expressed as where α is a normalized vector of portfolio allocation for each asset, Ω r is the shape (or equivalently covariance) matrix for the asset returns, µ r is the expected returns of each asset, µ p is the desired expected portfolio return, and 1 is a vector of ones.The solution is given by (Roy, 1952;Merton, 1972) where s r is a scalar that ensures the elements of α sum to one.In this section, the performances of the MM-SHR, S-Rocke, and S-q estimators are compared for the optimal allocation of investment in the component stocks of the DOW Jones Industrial Average.For each estimator, the parameters Ω r and µ r were estimated for the daily returns from the component stocks.Then, using a desired portfolio daily return of µ p = .038%(corresponding to 10% annual return), the optimal allocations, α, were calculated using (18).Using α for each estimator, the portfolio return was then calculated for each business day of the verification period, assuming a daily re-balance of investments.Finally, each estimator's performance was characterized by the variance of these daily returns.This variance is a measure of the volatility of the portfolio.
For the S-q estimator, we noted that Konlack Socgnia and Wilcox (2014) showed that the generalized hyperbolic distribution is a good model for stock returns, and specifically the variance gamma subclass has good parameter stability over time.Although their analysis is for log returns, daily log returns are generally close to one, so the variance gamma model should also fit well for gross (i.e.linear) returns.For the variance gamma S-q estimator, a density-weighted M-estimator was used to estimate the model parameters λ and ψ.
To demonstrate the robustness of the S-q estimator, we begin by noting that the first quarter of 2020 contained a once-in-a-generation period of extremely high volatility due to the COVID-19 pandemic, as depicted in Figure 9.This volatility started on approximately February 21.Each estimator's performance was assessed by estimating the parameters Ω r and µ r using all the returns from the first quarter, then comparing the variances of the daily portfolio returns for only the pre-pandemic (prior to February 21) period.Each estimator was set to its maximum breakdown point.Each estimator was then tuned it to its maximum asymptotic efficiency with respect to the variance gamma distribution with parameters estimated using a maximum likelihood approach and using the daily returns for the years 2016-2019.
Table 5 summarizes the results, listing the variances of the daily returns.The S-q estimator performed the best with the lowest variance, which indicates high robustness.The MM-SHR performed the second best, followed by the S-Rocke.The sample estimator of mean and covariance was also included to demonstrate its poor robustness.Next, to demonstrate estimator efficiency, variances of daily returns were compared for a non-volatile period: 2016 through 2019.Using the same methodology and configuration as before, for each year and each estimator, Ω r and µ r were estimated.Then, α was calculated and applied to each day of that year.The sample variances of each year's daily portfolio returns are listed in Table 6.The S-q estimator resulted in the lowest portfolio variance for three of the four years, and the lowest variance on average, indicating high estimator efficiency.On average, the performance of the MM-SHR estimator was behind that of the S-q estimator, and the S-Rocke demonstrated substantially worse performance.

Conclusion
The S-q estimator has been introduced as a new tunable multivariate estimator of location, scatter, and shape matrices for elliptical probability distributions.This new estimator is a subclass of S-estimators, which achieve the maximum theoretical breakdown point.The S-q estimator has been compared with the leading high-breakdown estimators.Across elliptical distributions, the S-q has generally higher efficiency and stability, and its robustness is on par with these other leading estimators.Additionally, the S-q provides a monotonic and upper-bounded efficiency tuning parameter, which provides simpler tuning than the MM-SHR.The S-q is therefore a broadly applicable estimator, providing practitioners with a good general high-breakdown multivariate estimator that can be used across a broad range of practical applications, such as the optimal portfolio example.

Figure 2 :
Figure 2: Example Comparison of Weight Functions for Various Estimators.For the 10dimensional Gaussian distribution, the plot depicts the asymptotic weights for the S-q (w q ) and MM-SHR (w shr ) estimators tuned to 80% asymptotic relative efficiency, the S-Rocke (w γ ) estimator tuned to its maximum efficiency (77% for this case), and the non-tunable MLE (w mle ) and S-bisquare (w bisq ) estimators.The estimators are set to their maximum breakdown points.

Figure 3 :
Figure3: Estimator Asymptotic Relative Efficiency versus Tuning Parameter.The relative efficiencies of the estimators are plotted as a function of tuning parameter for the Cauchy distribution with p = 20.All estimators are set to the maximum breakdown point.The S-Rocke parameter is in [0, 1], the MM-SHR parameter is in (0, ∞), and the S-q parameter is in (−∞, 1) for the maximum breakdown point.

Figure 5 :
Figure 5: Estimator Maximum Achievable Efficiency for t-Distribution versus Distribution Parameter, ν.Maximum achievable asymptotic (left) and small-sample (n = 3p; right) efficiencies for the maximum breakdown point are plotted.

Figure 6 :
Figure 6: Estimator Maximum Achievable Asymptotic Efficiency for Generalized Hyperbolic Distribution versus Parameter λ.Maximum achievable asymptotic efficiencies for the maximum breakdown point are plotted for the variance gamma distribution with parameter ψ = 2 (left) and for the generalized hyperbolic distribution with parameters ψ = 2 and χ = 1 (right).

Figure 7 :
Figure 7: Example Comparison of Influence Function Parameter α Σ (d z ).For the 10dimensional Gaussian distribution, α Σ (d z ) is plotted for estimators depicted in Figure 2.

Figure 8 :
Figure 8: Estimator Divergence versus Contamination Value k for Gaussian Distribution.Gaussian shape matrix divergences are plotted for p = 5 (left) and p = 20 (right), and for small sample (n = 5p; top) and large sample (n = 100p; bottom) sizes.

Figure 9 :
Figure 9: Daily Returns of Down Jones Industrial Average and Component Stocks for 2016-2020.There are 25 component stocks in the index throughout the depicted period and plotted in the background.The overall index value is plotted on top.

Table 1 :
Summary of Common Elliptical Distributions

Table 2 :
S-q Inlier and Outlier Rejection Points for Common Elliptical Distributions

Table 3 :
Restrictions on Parameter q for Common Elliptical Distributions

Table 4 :
Estimator Stability and Computational Efficiency for Normally Distributed Data

Table 6 :
Sample Variances of Achieved Daily Returns by Year