A one-class peeling method for multivariate outlier detection with applications in phase I SPC

In phase I of statistical process control (SPC), control charts are often used as outlier detection methods to assess process stability. Many of these methods require estimation of the covariance matrix, are computationally infeasible, or have not been studied when the dimension of the data, p , is large. We propose the one-class peeling (OCP) method, a flexible framework that combines statistical and machine learning methods to detect multiple outliers in multivariate data. The OCP method can be applied to phase I of SPC, does not require covariance estimation, and is well suited to high-dimensional data sets with a high percentage of outliers. Our empirical evaluation suggests that the OCP method performs well in high dimensions and is computationally more efficient and robust than existing methodologies. We motivate and illustrate the use of the OCP method in a phase I SPC application on a N = 354, p = 1917 dimensional data set containing Wikipedia search results for National Football League (NFL) players, teams, coaches, and managers. The example data set and R functions, OCP.R and OCPLimit.R , to compute the respective OCP distances and thresholds are available in the supplementary materials.


Background
Statistical process control (SPC) applications are often divided into two phases. In phase I, the stability of a process is assessed, and a critical component of this assessment is often the application of an outlier detection method. Traditionally, control charts have been used in phase I to identify outlying observations or subgroups that deviate from the normal operations of the process. These outlying observations or subgroups are signals to a potential out-of-control (OC) event.
Overviews of phase I methods in SPC are given in Jones-Farmer et al. 1 and Chakraborti et al. 2 In phase II, the process is monitored over time for departures from the phase I conditions.
The detection of outliers is a critical component of the phase I or any data analysis process. Nearly all "real" data may be contaminated by unusual observations; thus, preprocessing the data should precede any data analysis exercise. In some cases, such as fraud detection, network intrusion, or crime identification, outlier detection may be the end-goal of the analysis. In other cases, like in phase I of SPC, outlier detection may only comprise an initial exploratory phase of an analysis.
One of the goals of a phase I analysis is to establish an in-control (IC) baseline sample; however, a requirement for many phase I and outlier detection methods to work is knowledge of the data distribution. Recently, several authors have proposed distribution-free phase IC methods to identify outliers, OC subgroups or changes in multivariate processes including Bell et al, 3 Cheng and Shiau, 4 and Capizzi. 5 Each of these methods has been studied for multivariate processes with relatively low dimensions. For example, Bell et al 3 and Capizzi 5 considered only dimensions up to p = 10, while Cheng and Shiau 4 considered cases of p = 3. In many modern processes such as real-time health monitoring, 6 additive manufacturing and image monitoring, 7 much higher dimensional data is common.
The goal of this article is to develop the one-class peeling (OCP) method to detect outliers in high dimensions that works efficiently and competitively with wide data (p > N), does not require estimation of a covariance matrix, and can be applied in phase I of SPC to establish an IC baseline sample. Like many methods, the OCP method performs best when the underlying data distribution is known; however, the method is somewhat robust to distribution misspecification. We also offer an approach for determining outliers that has suitable performance when the underlying data distribution is unknown. We apply the OCP method to detect outliers in a p = 1917 dimensional data set (N = 354) containing Wikipedia page hits for the National Football League (NFL). In addition, we conduct a simulation study to show the benefits of using the OCP method versus existing methods for outlier detection in high dimensions.

Motivating example
This article is motivated by the need for methods to determine an outlier-free baseline phase I sample for high-dimensional multivariate processes that can be used to develop a phase II monitoring scheme. We consider an updated version of the NFL Wikipedia search data that originally appeared in Weese et al. 8 Monitoring of web searches and social media data for changes in volume, content, or sentiment is becoming increasingly important; however, much of this data is high dimensional, and few methodologies exist which address the retrospective outlier detection problem in this context.
The data are based on Wikipedia activity of NFL-related pages. A dictionary of NFL team names, coaches, managers, and active players was gathered as of 15 September 2014. The number of Wikipedia searches per hour for any Wikipedia page can be downloaded from http://dumps.wikimedia.org/other/pagecounts-raw, and we did so for all terms in our dictionary for the periods ranging between 1 September 2014, 00:00 Coordinated Universal Time (UTC), and 14 September 2014, 17:00 UTC. These dates were chosen because the 2014 season had several high-and low-profile controversies that could be retrospectively identified (see Table C2). For example, Adrian Peterson, former running back of the Minnesota Vikings, was indicted on child abuse charges in September 2014. The data consist of p = 1917 variables (pages) and N = 354 observations (hours).
These data were first analyzed in Weese et al, 8 who modified the K 2 chart of Sukchotrat et al. 9 for use in phase I, and applied it to the first week of the data in order to establish a baseline sample. The phase II K 2 chart was then applied to the second week of the data, but many of the events known to occur early in this NFL season were not detected. The goal of the analysis in this article is to establish an outlier-free baseline sample in phase I so that a phase II monitoring scheme can be developed.

Background literature
The literature on multivariate outlier detection methods is vast. Although many of these methods are developed for cross-sectional data, some can be adapted to data that occurs over time and may be appropriate for use in determining a baseline sample in phase I.
Many multivariate, distance-based outlier detection methods follow two main steps: (a) robust estimation of center and scale of the data and (b) evaluation of a measure of "outlyingness," often a distance measure. One challenge with this approach is that the existence of outliers can negatively affect the sample estimates of the center and scale making it difficult to correctly identify unusual observations. A common solution is to assume the multivariate normal model and use robust estimators of the center, , and spread, Σ, and also a robust distance measure. [10][11][12][13][14][15][16][17] A well-known method is to use the minimum covariance determinant (MCD) or the minimum volume ellipsoid (MVE) estimators with robust squared distance measures to identify unusual observations. 14,18 These methods can be computationally intensive in high dimensions, 16,19,20 and the statistical properties of the distance measures are asymptotic, which can limit their ability to detect outliers in finite samples. 21,22 The finite sample reweighted minimum covariance determinant (FSRMCD) method was developed by Cerioli 23 and is based on the reweighted version of the MCD. 16,24 The FSRMCD performs well in finite samples under conditions of multivariate normality. Like other methods based on squared statistical distance measures, 19,25 the FSRMCD cannot be implemented on wide data.
Most of the distance-based methods require covariance estimation and cannot be directly applied when p > N without the use of principal components. A regularization term with the MCD approach that can be used when p > N is suggested in Fritsch et al, 26 but this method is only evaluated under multivariate normality. A robust minimum diagonal product (RMDP) estimator of and Σ is suggested by Ro et al. 27 In this method, the statistical distances used to measure outlyingness are computed from only the diagonal elements of the sample covariance matrix. This method can be applied when p > N and works well when the distribution is close to multivariate normal and the covariance matrix is sparse but has a high false positive rate when these conditions are not met. Another approach to analyze wide data is to reduce the dimensionality, often by means of principal component analysis (PCA). 28,29 For example, Filzmoser et al 20 developed a procedure known as PCOut that combines robust PCA with weighted distance measures to identify outliers. Data depth was introduced by Tukey 30 as a way to measure the depth or centrality of a point within a sample. The data depth of x ∈ R p measures how deep x is with respect to a certain probability distribution F. There are a number of data depth functions that transform x from a p-dimensional vector to a univariate depth value. These depth values vary with respect to computational complexity and mathematical properties. For outlier detection, depth values are often ranked. Ranks of simplicial depth were applied by Liu 31 to phase II SPC and studied by Stoumbos and Jones. 32 Rank-based phase I charts for subgrouped data based on data depth were considered by Bell et al. 3 While their method can be used with any measure of data depth, they specifically considered Mahalanobis depth and simplicial depth. The method in Bell et al 3 was better able to correctly classify a higher percentage of location shifts when using Mahalanobis depth than when using simplicial depth. Although Mahalanobis depth is easy to compute, it requires computation of the covariance matrix and is not feasible for cases when p > N. The order of computation, O(N p+1 ), for simplicial depth makes it impractical for applications in large samples with high dimensions. Although there have been successful attempts at decreasing the computational complexity of simplicial depth, this progress has been made in lower dimensions (p = 2). 33 A number of authors have proposed cluster-based algorithms for outlier detection, and some of these methods have been applied to SPC. For example, Thissen et al 34 considered the application of Gaussian mixture models to describe nonnormally distributed data as a mixture of clusters. A multistep algorithm for SPC to overcome the problem of masking or multiple outliers that effect the parameter estimates in a retrospective analysis was developed by Jobe and Pokojovy. 35 This method, which requires estimation of the covariance matrix, combines a modified moving average to smooth the data, multivariate density estimation, and mixture-based clustering to identify the cluster with the largest number of observations. Simulation results suggest that their method detects outliers with a higher probability than Hotelling's T 2 chart when there are multiple outliers and the shift size is large. The work of Jobe and Pokojovy 35 was extended by Jobe and Pokojovy 36 by including the FSRMCD estimator of Cerioli. 23 This cluster-based method requires two sets of simulated limits to control for family-wise and per-comparison error rates, which are determined from multivariate standard normal data. Because these methods require estimation of the covariance matrix, they are not directly applicable when p > N.
An outlier detection method similar to those based on statistical distance is based on the concept of density estimation methods. Support vector data description (SVDD) was introduced by Tax and Duin 37 as method of multivariate kernel density estimation and applied this to the outlier detection problem. SVDD, which is discussed in detail in Section 2.1, finds a flexible minimum volume boundary with radius, R around a multivariate set of data. Observations whose distance from the center, , exceeds the radius of the boundary, ||x i − || 2 > R 2 , are considered outlying. The SVDD method was adapted to SPC by Sun and Tsung, 38 which developed the k chart. It is shown in Weese et al 39 that the k chart does not perform well in distinguishing IC from OC observations in phase I.

OCP overview
Our proposed method, the OCP, borrows from the kernel-density approaches and the statistical distance approaches. The method begins with the iterative application of one-class algorithms such as the SVDD methodology proposed by Tax and Duin 37,40 to estimate the center of the data. Scaled kernel distances of each observation from the estimated center are then used to identify potential outlying observations. The OCP method does not require estimation of the covariance matrix. In addition, OCP is data driven and can be used with any kernel function that is appropriate for the data. The rest of the paper is organized as follows. In Section 2, we introduce the OCP method. In Section 3, we apply the OCP method to the NFL Wikipedia search data set. We evaluate and compare the performance of the OCP method with the modified RMDP method of Ro et al 27 using simulations in Section 4. In Section 5, we introduce an a robust approach for determining limits for the OCP when the distribution of the data is unknown. Finally, in Section 6, we discuss the flexibility and potential limitations of the proposed method and provide our concluding remarks.

OCP METHOD
The OCP method includes robust estimation of the center and the use of a kernel distance measure between each observation and the center. To detect outliers, we recommend using a threshold applied to the kernel distance measures. Observations that are the most dissimilar from the center data are flagged as potential OC events. Algorithm 1 gives the pseudocode for the OCP method.

Estimating the center of the multivariate data
The first step in using the OCP method is to determine the center of the multivariate data. We determine the center of data using an iterative peeling approach with the boundaries derived from SVDD, similar in principle to that of convex hull peeling. 41 The mean estimated by convex hull peeling is an affine equivariant estimator, but it is not robust to outliers. The robustness of estimators is often measured in terms of the finite sample replacement breakdown point (FSRBP). The FSRBP of a location estimator, t N , based on a data set, S, is the smallest fraction, m∕N, of outliers that can take the estimate "over all bounds" 42 : where the supremum is taken over all possible corrupted samples, S O = {y 1 , … , y m }, for y i ∈ R p , obtained by replacing m points from S with arbitrary values. It is shown in Donoho and Gasko 43 that convex hull peeled means can have a FSRBP of no more than 1 p+1 , which is not robust for data in very high dimensions. We introduce a more robust method for estimating the multivariate center using boundaries determined from SVDD.
In Tax and Duin, 40 SVDD is introduced as a method for both outlier detection and classification. SVDD is a one-class method that finds a flexible minimum volume boundary around a multivariate data set via optimization. Similar to support vector classification, 44 the SVDD boundary is defined only by a few points based on their proximity to the center of the data. These points are referred to as the support vectors. One obtains a SVDD hypersphere with radius R for data vectors, x i ∈ R p , i = 1, … , N, by minimizing: Here, is the center of the hypersphere, i is an error term, and C is a parameter that controls the fraction of observations outside of the boundary. The parameter, C ≤ 1 Nq , where q is the desired fraction of observations rejected. For the OCP method, we set q = 0.0001, limiting the number of observations allowed to be outside the hypersphere at each peel to practically zero.
Using Lagrange multipliers, Tax and Duin 40 showed that incorporating the constraints from Equation (1) results in the need to maximize with respect to i such that 0 ≤ i ≤ C. Maximizing Equation (2) results in a set of i such that when an observation, (1), the corresponding i = 0, and these observations are inside of the boundary. When ||x i − || 2 = R 2 , i > 0, and these observations are the support vectors providing the description of the data and, thus, the one-class boundary. Observations where ||x i − || 2 > R 2 are beyond the boundary and are also considered to be support vectors but belong to the outlier class.
To construct a boundary with a flexible shape, the inner products in Equation (2) can be replaced by a kernel (similarity) where Φ(x) maps the vector x to a different feature space. 45 Commonly, SVDD is implemented using the Gaussian kernel function, defined as where s is a scale parameter that must be specified in order to complete the optimization in Equation (2). Several authors have considered data-dependent methods for determining appropriate values of s for SVDD using the Gaussian kernel including Tax and Duin 40 and Ning and Tsung. 46 Weese et al 39 show that for a specified value of C, s ≈ p works well when the data are scaled to approximate unit variance. We present boundaries created with s = p for the remainder of the paper.
To compute the proposed one-class peeled mean,̂O CP , one constructs sequential one-class boundaries, peeling away the support vectors that create the boundaries at each iteration. After many successive peels, the remaining final observations are averaged to estimate the center of the data. Figure 1 shows the successive boundaries generated using the OCP method on toy data generated from N 2 ) and 5 outlier points (shown as circles) from outlier dis- . Notice in the second graph in Figure 1 there are two separate boundaries around the outlying observations; thus, these observations were removed in the first two iterations. The remaining SVDD boundaries peeled the support vectors from the outside resulting in two remaining observations. We refer to the minimum number of observations remaining as n. The analyst can specify any number, n < N, for the size of the reduced data set upon which to base the robust estimate. Our results suggest that peeling to a reduced set of n = 2 gives the highest FSRBP values in the cases we considered (see Appendix A).
Although it is not possible to analytically derive the FSRBP of the OCP estimate of the mean, we perform simulations to study the empirical breakdown properties of this estimator for samples from a variety of multivariate normal, lognormal, and t distributions. For a baseline comparison, the simulations also contain mean estimates from convex hull peeling. The results, given in Appendix A, suggest that empirically, the OCP estimator does not begin to breakdown until the sample contamination is around 30% to 35% for the dimensions and sample sizes considered. Note that the convex hull peeling method breaks down empirically at less than 10% contamination.

Measuring distance from the estimated center
The second step in the OCP method is to determine how close observations are to the estimated center of the data. To measure the distance between each observation, x i , and the center of the data,̂O CP , we propose the use of the Gaussian kernel distance given by It is clear from Equation (3) that the Gaussian kernel similarity, KS G , is a decreasing function of the Euclidean distance between two arbitrary points, x i and x , scaled by the width parameter s, and KS G (x i , x ) = 1 for i = .
The distance in Equation (4) 47 There are, however, some limitations that the end-user needs to be aware of. Most importantly, the Gaussian kernel similarity assumes that either the partial distributions of the variables are similar or the data have been standardized so that variables with higher variances do not dominate the similarity metric.

Determining outlyingness
To determine outlyingness, we recommend rescaling the kernel distances in Equation (4) using the following robust linear transformation Here, med(KD) is the median of the vector of N kernel distances, where N is the sample size, and MAD(KD) is the median absolute deviation of the N kernel distances. The MAD(KD) has the best possible breakdown (50% FSRBP) and is defined by 48 To detect potential OC observations, we recommend a threshold for the sRKD i from Equation (5). Because the distribution of the sRKD i varies due to the underlying data distribution, the threshold values, h, can be empirically determined to achieve an approximate type I error rate using a bisection method (see Algorithm B1 in Appendix B). Finding the threshold, h using Algorithm B1 requires some knowledge of the underlying data distribution. If the underlying data distribution is unknown, an approximate robust value for h can be used as described in Section 5.
In the next section, we illustrate the use of the OCP method, which includes a graph to help visually determine outliers. The visual implementation of this method strongly enhances its use in practice, and we recommend this approach. The OCP method is summarized as follows: 1. Determine a threshold value, h, using the bisection method (see Algorithm B1 in Appendix B). If the joint distribution of the x i is unknown, use a robust limit calculation (see Section 5). 2. Compute the robust estimate of the center of the data,̂O CP , using SVDD with the Gaussian kernel function to derive the boundaries, and remove the support vectors at each peeling step. 3. Compute the kernel distance between each vector of observations and the robust estimate of the center of the data, OCP . 4. Robustly scale the distances that are using Equation (5). 5. Flag observations with robustly scaled kernel distances larger than h as potential outliers.
The R code implementation of the method is included in the supplemental material of this article as OCP.R. Algorithm 1 shows the pseudocode of the OCP method and describes the steps for flagging outliers in our simulation starting with a contaminated sample, S. The OCP.R function has as parameters the unscaled data, S, the minimum number of observations remaining, n, defaulting to two observations, the threshold value, h, that can be determined using a robust limit (see Section 5) as default or use the OCPLimit.R function (described in Algorithm B1 in Appendix B).

ANALYSIS OF THE NFL DATA
The NFL data consist of the number of Wikipedia hits on p = 1917 pages representing NFL teams, players, coaches, managers, and active players between 1 September 2014, 00:00 UTC, and 15 September 2014, 17:00 UTC. Because of a daily cyclical pattern in the data, we first preprocessed the data and used the residuals from an additive Holt-Winters model lagged by 24 hours. 49 Our goal is to determine a baseline data set free of unusual events that could disrupt the normal Wikipedia traffic related to the NFL. This baseline sample can then be used to establish a method for monitoring for an increase/decrease in future traffic. Events that disrupt the normal Wikipedia traffic include: • Games: interest in players, teams, coaches, and managers increases as the teams play normal and playoff games, and consequently, the Wikipedia pages have higher than usual traffic. • Controversies: including player suspensions because of the use illegal substances, performance enhancing drugs, domestic disturbances, controversial postures, and other high-profile events. • Breaking news: including updates to controversies and new developments.
In this example, we are comparing the OCP method to an existing outlier detection method for high-dimensional data to determine how well it performs in identifying several disruptive events known to occur during the 2-week study period. We used a structured methodology described in Table C1 in Appendix C to identify time periods associated with games, controversies, and breaking news. For example, game events were identified as 1 hour prior to kickoff of the first NFL game until hour after the final down of the last game on NFL game days. All times were converted to UTC for consistency.
In total, 105 out of the 354 hours in the NFL data set are considered events of interest. We apply the proposed OCP method to the labeled data to determine if the flagged observations correspond to the predetermined events. To determine outlyingness using the OCP method, we use an empirically determined threshold described in Algorithm B1 in Appendix B. The data we are evaluating are prewhitened from the application of an additive Holt Winter model. An additive model was chosen because some of the p = 1917 variables had some zero counts. The residuals from the Holt Winters model show moderate departures from normality and are right skewed. There is slight correlation among the p = 1917 residual vectors; the first quartile of the absolute value of the Pearson correlation coefficients is Q 1 = .017, and the third quartile is Q 3 = .073. Although the distributions of the residuals are skewed in most cases, we evaluate outlyingness using thresholds generated from multivariate normal, lognormal, and t distributions in order to compare the performance across a variety of distributional assumptions. We select the threshold values from simulated cases with an average correlation near zero (N = 354, p = 1917, = 0) using the methodology described in Algorithm B1 in Appendix B. (see table of empirical threshold values in Appendix D). We emphasize that no exact statistical model exists for the data we are analyzing. The OCP method is intended to be an approximate, data-driven method that flags unusual events while retaining the majority of the IC data.
For comparative purposes, we include an analysis using the RMDP outlier detection method of Ro et al 27 with a modified threshold value generated according to Algorithm B1 in the Appendix. The RMDP is a classical statistical distance method, which evaluates observations based on the distance of observations from the center of the data. The RMDP distances are based on a minimum diagonal product estimator of the covariance matrix using only an outlier-free subset of data and the diagonal elements of the covariance matrix. An asymptotic cutoff value for the statistical distance is recommended by Ro et al 27 to determine outlyingness; however, we used an empirically derived thresholds to cast the RMDP method in the best possible light. The distribution of the RMDP distances are asymptotically normal in p when the underlying data distribution is multivariate normal. We found that the threshold recommended by Ro et al. 27 resulted in type I error probabilities that were much higher than desired in many cases, especially for nonnormally distributed data as originally pointed out by Ro et al. 27 Thus, we used an empirically derived thresholds for case N = 354, p = 1917, = 0 that gave simulated type I error rates of 5% under the multivariate normal, lognormal, and t distributions (see table of empirical threshold values in Appendix D). Figure 2 illustrates how the asymptotically derived limits proposed by Ro et al 27 can result in a large number of type I errors for a two-dimensional correlated t-distributed example with with 20% outlier contamination. Notice the larger number of "+" signs in the the plot on the right compared two the plots on the left.
Throughout the remainder of the paper, we use empirically derived limits for the RMDP distances instead of the asymptotic limits, which we will still refer to as the RMDP method. As a result of this, the reader should keep in mind that the RMDP performance presented in this paper does not necessarily correspond to, and improves upon, the performance of the RMDP as proposed by Ro et al, 27 especially in nonnormal, correlated distributions. We chose the RMDP method for  comparison because it works for wide data, that is, when p > N, and was shown to have superior performance against several competing methods in Ro et al. 27 Figure 3 gives a graphical display of the OCP and RMDP methods with the thresholds based on the multivariate t distribution, which we feel most closely resembles the distribution of the residuals. When using the t d =10 threshold values, the OCP method correctly classifies 88.70% of the observations and detects 65.71% of the events of interest. These results compare favorably against the RMDP method, which correctly classifies 84.18% of the the observations and detects only 50.48% of the events of interest. If the OCP method were used to identify outliers for this example, 281 observations would be retained for a baseline sample, of which 245 would be IC observations. By comparison, if the RMDP method were used to identify outliers in this case, 297 observations would be retained for the baseline sample with 245 as IC observations. Table 1 gives a summary of the results of the analysis for the thresholds derived from the three different multivariate distributions. It is important to note that, in this example, the OCP has similar threshold values and performs similarly in terms of correct classification rate, regardless of the assumption of the underlying distribution. On the other hand, the performance of the RMDP method is quite sensitive to the assumption of the underlying data distribution, changing substantially between the three distributions. For example, the number of false positives for the OCP method is 8, 4, and 4 under the assumption of normal, lognormal, and t-distributed data. The number of false positives for the RMDP method is 145, 9, and 4 under the same respective cases. Inspection of the limits in Appendix D shows that for a given sample size, dimension, and average correlation, the OCP limits are relatively stable across distributions. For example, the limits range from h = 2.448 when N = 354, p = 1917, and = 0 for normally distributed observations to h = 4.535 in the same case for observations from a multivariate t distribution. For the same cases, the empirically derived RMDP limits range from h = 1.463 to h = 52.500. The stability of the of the OCP across distributions seems to hold when the distributions are symmetric or the correlations among the variables are lower ( ≤ .5) but may not be true in skewed distributional cases with high correlation. Our simulation study in Section 4 will bring more clarity to the performance of the OCP method.
In terms of runtime, we measured the time in seconds to complete both the OCP and RMDP methods for the NFL example using an Intel Core i7-6700 clocked at 3.40 Ghz not using parallel multicore processing. The OCP method uses 15.00 seconds of total computational time, while the RMDP method uses 93.17 seconds of total computational time. The RMDP method is generally about 6 times slower than the OCP method on average for all the simulations considered in this paper. The reader should note that the runtime does not intend to measure the theoretical computational complexity of the OCP method but to merely compare the time it takes to run each method under the same circumstances with a fixed physical computational power.

SIMULATION STUDY
To study the performance of the OCP method in a variety of scenarios, an extensive simulation study was conducted.

Study design
The simulation study compares the OCP method to the RMDP method with empirically derived thresholds. The OCP and RMDP limits were derived such that the type I error is approximately 5% for each the distributions and cases considered. The thresholds for the OCP and the RMDP methods were determined using Algorithm B1 in Appendix B in order to achieve a type I error rate of about 5%. The limits for both the OCP and RMDP methods are given in Appendix D.
The study considers observations generated from multivariate normal, lognormal, and t distributions Because of the differences in the distributions and to maintain similar shift sizes across vastly different dimensions, we shifted the OC observations probabilistically. For a given probability distribution, F X (·), a p-dimensional IC sample, S I = {x 1 , … , x N−m }, of size (N −m)×p, with mean, 0 , and covariance Σ is generated. Then, a p-dimensional OC sample, S O = {y 1 , … , y m }, of size m × p from F Y (·), which has a shifted mean of 1 = 0 + , is generated. The parameter, , is chosen such that the mean of the OC distribution 0 + is unlikely to originate from F X (·). For this study, we chose a small value for , such that = .023 is the corresponding IC density function and 0 = [ 1 , … , p ] ⊤ .The two-dimensional plots in Figure 4 show the location of 1 for = 0.023 under the distributions considered. The location of the equicoordinate quantiles ( 0 + ) satisfying a = 0.023 under each distribution was obtained numerically using R package mvtnorm. 50,51 The multivariate shift can be compared with approximately a 2 mean shift in a univariate normal model.
For each case, the shift is evenly distributed over p dimensions. The contaminated sample S is then defined as S = S I ∪S O with contamination level m∕N = 0%, 5%, 20%. Table 2 gives a summary of the simulation conditions. For each possible combination of sample size, distribution, shift type, correlation, and % outliers, 1000 replications were performed.
In the IC scenarios, the primary performance measure is the average of the empirical type I error percent. We computê where FP i is the number of false positives in each of the i = 1, … , 1000 replicates for the sample size (N) and dimension (p) cases. Thêi values are averaged givinḡ= ∑ 1000 i=1̂i ∕1000. In the OC scenarios, the primary performance measures are the average of the percent of the observations that are correctly classified (%CC) as either IC or OC and the average of the empirical detection rate (DR) of OC observations. For each case, we compute  Here, TP i is the number of true positives, TN i is the number of true negatives, and FN i is the number of false negatives for the ith simulation case. These are averaged giving %CC = ∑ 1000 i=1 %CC i ∕1000 and DR = ∑ 1000 i=1 DR i ∕1000. Table 3 shows the IC performance of the OCP and RMDP methods. As previously mentioned, the OCP and RMDP methods were designed to achieve a desired type I error of 5%. Both methods achieve IC performance near the desired level when using the empirically derived threshold values. Figures 5 and 6 give a graphical summary of OC performance of the OCP and RMDP methods with 5% and 20% contamination when the data follow multivariate lognormal, normal, and t distributions with different levels of correlation. Figure 5 shows the %CC of the OCP and RMDP methods. Ideally, a method with a high value of %CC will preserve the majority of the observations in phase I, while correctly signaling those observations that should be flagged as potential outliers. In the case of 5% contamination, the OCP and RMDP methods performed similarly in terms of %CC with the OCP method performing slightly better in most cases. In the case of 20% contamination, the OCP method has higher values of %CC in most cases. An exception is when the data are skewed (lognormal), highly correlated, and the sample size is small. The performance of both methods generally deteriorates at higher correlation levels (ie, = 0.75). This is partly due to the fact that as the correlation increases, the mean shift controlling for sample size and dimensionality decreases to capture a given value of . See Figure 4 for visual a intuition in two-dimensional examples. Figure 6 shows the DR of the OCP and RMDP methods. In all but the highly correlated ( ≥ .25) lower dimensional (p ≤ 100) cases when the data are lognormally distributed, the DR of the OCP method is higher than that of the RMDP.

Simulation results
In all of the cases considered, the OC observations were generated to be probabilistically unlikely. It is interesting to also consider how the OCP method compares with the RMDP method in detecting less extreme OC observations. To investigate this, we used the same methodology described in Section 4.1 but generated shifts with differing degrees of outlyingness. In all prior results, we used a value of = .023, which corresponds to the 97.7th percentile, or about a 2 shift in a univariate normal distribution. In Figure 7, we consider 0 < < 0.5 for the case of N = 100, p = 100, and the multivariate lognormal, normal, and t distributions. In Figure 7, we see that the DR for both the OCP and RMDP methods

FIGURE 6
Summary of the detection rate as a percentage of OC samples. OCP, one-class peeling; RMDP, robust minimum diagonal product FIGURE 7 Summary of the of percent correctly classified and detection rate for varying shift sizes. OCP, one-class peeling; RMDP, robust minimum diagonal product declines as the shift size gets smaller or gets larger. The %CC behaves similarly. We also see that the OCP has equivalent or higher values of DR in all but the lognormal case with 20% contamination.

Results summary
The proposed OCP method outperforms RMDP method in terms of correctly classifying observations from the samples and detection rate of OC observations in most cases considered. A central goal of a phase I method is to correctly distinguish IC observations from OC observations. The correct classification of the IC and OC observations allows the practitioner to investigate the true outlying observations, while retaining the largest sample possible as a baseline for use in phase II.
The OCP method has higher %CC and DR values than the RMDP method in most cases, especially in higher dimensions. While the RMDP method does a better job of detecting OC conditions when the data are skewed, highly correlated, and of lower dimension, as in the lognormal case with 20% contamination, the OCP method outperforms the RMDP method in all other cases considered, especially in the higher dimensional cases. It is also important to reiterate that the RMDP method has been modified here so that the method obtains better performance in nonnormal, correlated settings.

ROBUST LIMITS
A requirement of many methods for retrospective evaluation of outlyingness is knowledge of the data distribution. It is often impossible to know the distribution of the data in practice, especially when working with data in higher dimensions. The OCP method (and the RMDP, as well as other methods) work best when the distribution is known. In the absence of distributional knowledge, practitioners can use the OCP threshold values given in Appendix D as approximate for symmetric, heavy-tailed, and skewed distributions. Alternately, we offer a simple, robust, data-driven method for determining the OCP threshold limits that works reasonably well in many situations.
We recommend using the upper fence of the traditional boxplot as an approximate robust threshold for the OCP method. We also test this approach with the RMDP method. In other words, this sample-driven method calculates the threshold as follows: Here, Q 3 is the third quartile and IQR is the interquartile range of the distances from either the OCP or RMDP method. Figure 8 shows the %CC, and Figure 9 shows the DR of the OCP and the RMDP method using the limits generated from Equation (8) for a variety of correlations and contamination levels. When the contamination rate is low (5%), both the OCP and RMDP method perform comparably in most cases, correctly classifying nearly 100% of the observations when using these robust thresholds. Both the OCP and the RMDP methods work relatively well and almost identically. However, it is important to note that in the case of very high contamination (say greater 25%), the robust thresholds are affected by the outliers, and the percent of correctly classified observations could be reduced, because the IQR would most likely be FIGURE 8 Summary of the percent correctly classified using robust threshold limits. OCP, one-class peeling; RMDP, robust minimum diagonal product FIGURE 9 Summary of the detection rate using robust threshold limits. OCP, one-class peeling; RMDP, robust minimum diagonal product affected. Another limitation is that by using robust limits we can longer control for type I error probability. Appendix E gives the empirical type I error rates for the OCP and RMDP methods using robust limits in some example cases. These results are based on 1000 simulations.

CONCLUDING REMARKS
We have proposed a new OCP method to detect outlying observations in phase I. Our proposed method works well in high dimensions and does not require covariance estimation. Because most traditional methods require estimation of the covariance matrix, they cannot be used with wide data. The OCP method allows a practitioner to perform a phase I analysis with fewer samples than would be required for a traditional method.
The OCP method uses a kernel-density approach to estimate the center of the data, a distance approach to measure the distance of each observation from the center, and a simple threshold to detect OC observations. The method works well when there is a high percentage of contamination in the data set. When implemented using the Gaussian kernel, we recommend the bandwidth parameter, s = p, provided the data are scaled to unit variance. Optimizing the s for a given data set is outside the scope of this paper but could further improve the results of the OCP method, especially for highly correlated cases. As with many multivariate methods, the SVDD boundaries can be dominated by a specific dimension if the scales of the variables are quite different. To avoid this, the practitioner might consider standardizing the data when using the OCP. In the supplementary materials, we provide an R function, OCP.R, to compute the OCP distances and an R function, OCPLimit.R to find custom thresholds to determine outlyingness. The supplementary materials can be found at https://github.com/martinwg/OCP.git. In addition, we have suggested a robust method for determining a threshold for determining outlyingness when the distribution of the data is unknown.
Although there are no existing phase I methods to serve as a benchmark for cases when p > N, we adapted the RMDP method of Ro et al 27 for use in phase I. Our results show that the OCP method results in the highest percentage of observations correctly classified as either IC or OC in most of the cases considered.
The OCP method is based on iterative peeling of boundary support vectors and requires quadratic optimization at each iteration; therefore, computational time increases with N; however, empirical tests show that the OCP method has a faster computational time compared with the RMDP method. A practitioner using the OCP method should be aware that the computational time is higher for cases where N >> p. Because the OCP method is fairly robust to the choice of q, the desired fraction of observations rejected, when the sample size, N, is large, the practitioner can choose q = 0.05 to speed computation.
Natural extensions of this research include the investigation of the OCP method with kernels other than the Gaussian kernel to estimate the center of the data and determine the distance from the center. Other kernels that are more flexible and apply to a variety of data types, specifically kernels appropriate for categorical data and data of mixed types as well as kernels used for unstructured data such as images and text may prove as useful extensions of the OCP method to a new class of phase I problems. In addition, it would be interesting to explore other methods for determining a threshold or outlyingness using the OCP method in addition to those described in our paper.
The OCP method is novel in that it departs from the traditional method of estimating the mean and covariance and measuring the distance from the center using a statistical distance. The traditional approach is restrictive to near normal models for continuous observations. Using the kernel methods as in the OCP provides a new paradigm and the opportunity for flexibility in terms of the size and types of data considered. Donoho and Gasko 43 showed that convex hull peeled means can have a FSRBP of no more than 1 p+1 , so the FSRBP does not depend on the number of observations N but instead depends on the dimensionality of the problem, with a deteriorating FSRBP with increasing dimension p. For the OCP estimator̂O CP , it is not possible to derive the FSRBP analytically. FSRBP properties are based on the premise that S is in general position, that is, N ≥ p. Therefore, we perform simulations to study and estimate the breakdown properties of̂O CP for samples from multivariate normal, lognormal, and t distributions. Obtaining estimates of the FSRBP properties through simulation is not new in the literature and has been used in the cases where analytical solutions are not available. 19 Our simulations protocol to estimate the FSRBP is described as follows: for a given probability distribution, F X (·), we generate a p-dimensional inlier sample, S I , of size N −m, with mean, = 0, and covariance I for uncorrelated distributions. For correlated distributions, the correlation matrix is generated randomly such that it is positive semi-definite with correlations uniformly ranging from [−1, 1]. For each case considered, 500 replications are summarized to obtain the breakdown percentages. We generate the p-dimensional outlier sample, S O , of size m from F Y (·), with a shifted mean + , i = 20 ii , for i = 1, … , p for the multivariate normal and t distributions, and i = e 20 ii for the multivariate lognormal distribution. The value of i was chosen such that the means of the inlier and outlier distributions are as far apart as numerically measurable. We consider breakdown when

APPENDIX A: BREAKDOWN PROPERTIES OF THE ONE-CLASS PEELED MEAN
is the OCP estimate of the mean of F X (·). The same process is considered to estimate breakdown for the convex hull peeling estimator,̂C HP . Figure A1 gives the percentage of cases where breakdown occurred for several dimensions and distributions when N = 50 for the OCP estimator̂O CP , while Figure A2 illustrates the breakdown performancêC HP for cases where N ≥ p. The results suggest that the OCP estimator,̂O CP , begins to break down at around 30% to 35% contamination for the cases considered, whilêC HP breaks down more rapidly and the estimated FSRBP depends on the dimensionality p. Similar results were obtained with other sample sizes and dimensions not shown here. Figure A3 illustrates a typical behavior of hoŵO CP breaks down if the OCP method peels to different values of n. The results suggest that the OCP method obtains the highest FSRBP values when n = 2. Simulations with other sample sizes and dimensions not shown here also validate this conclusion.

APPENDIX B: BISECTION ALGORITHM
The starting values for the bisection method are determined using an inverse prediction from a linear model for type I error rate,̃, as function of factors: N, p, h, average correlation ( ), and distribution (normal, lognormal, or t d =10 ). Empirical type I errors are calculated from the average of 1000 simulations for each of 30 factor combinations generated using an I-optimal design. The factors in the design cover values of 50 ≤ N ≤ 1000, 50 ≤ p ≤ 2000, subject to 0.5 ≤ p∕n ≤ 2, 2.5 ≤ h ≤ 5.5, and 0.1 ≤ ≤ 0.8 for normal, t d =10 , and lognormal distributions. The starting value for h in Algorithm B1 is determined by supplying the desired type I error rate, N, p, distribution (normal, heavy tailed, skewed), correlation (as an estimate of the average correlation in the data) using the linear model to generate an inverse estimate along with a 99% confidence interval. The algorithm was tested using common error levels̃∈ {0.01, 0.05, 0.10}. Numerical issues could arise for̃< 0.01. The code for generating a custom threshold value can be found in the supplementary materials. Step Action 1

APPENDIX C: NFL EXAMPLE EVENT IDENTIFICATION TABLES
Searched "NFL" on Google news in a custom search for a custom date range by day (based on PDT converted to UTC) during the two week period. 2 Considered only news stories that appeared on the first page of the search results (approximately 10 search returns). 3 If a news story was new to any period, consider it to be breaking news.
If it was classified as breaking news, then navigate to the source of the news that first reported the story to find the exact time that the story was released and converted it to UTC. 4 If the story drops during the overnight hours 1AM and 8AM start our period at 8AM EDT the morning following (convert times to to UTC). 5 If the story drops between 6am and 1am EDT consider it a potential signal starting within a 4-h lag of the time of the story hitting (times converted to UTC).   Abbreviations: OCP, one-class peeling; RMDP, robust minimum diagonal product.