A practical implementation of weighted kernel density estimation for handling shape constraints

The weighted kernel density estimator is an attractive option for shape‐restricted density estimation, because it is simple, familiar, and potentially applicable to many different shape constraints. Despite this, no reliable software implementation has appeared since the method's original proposal in 2002. We found that serious numerical and practical difficulties arise when attempting to implement the method. We overcame these difficulties and in the process discovered that the weighted method and our own recently proposed method—controlling the shape of a kernel density using an adjustment curve—can be unified in a single computational framework. This article describes our findings and introduces the R package scdensity, which can be used to easily obtain density estimates that are unimodal, bimodal, symmetric, and more. © 2018 The Authors. Stat Published by John Wiley & Sons Ltd

where K (the kernel function) is a probability density symmetric around zero, h is a positive scalar bandwidth, and p D[ p 1 , : : : , p r ] T is a vector of probability weights. The elements of s D [ s 1 , : : : , s r ] T are the kernel centres that determine the placement of the kernel functions. The standard (uniform-weighted) KDE is f x .x j p unif /, where p unif Á 1 n 1, and 1 is a vector of ones.
We can write f s .x j p/ D k s .x/ T p, where k s .x/ D[ k s 1 .x/, , k s r .
x/] T , and k s i .x/ D 1 h K x s i h . Let k 0 s .x/ and k 0 s i .x/ be defined similarly, but using the first derivative of K. Then f 0 s .x j p/ D k 0 s .x/ T p. Higher derivatives are defined in the same way.

The weighted KDE method
The weighted KDE method involves selecting p to minimize the integrated squared error (ISE) between f x .x j p/ and f x .x j p unif /, subject to shape constraints on f x .x j p/ (Hall & Huang, 2002, considered several alternatives before concluding that ISE was the best choice). The optimal weights are the solution to a quadratic program (QP).
Despite its promise as a modified KDE estimable using quadratic programming, the method has several significant problems: Definiteness. It is often the case that the matrix of the QP is ill conditioned, with many eigenvalues nearly zero, possibly negative. This makes the QP much more difficult to solve, as strict convexity cannot be assumed. We were not able to reliably solve the problem using any of the freely available QP alternatives available in R. In their work, Hall & Huang (2002) used a sequential quadratic programming (SQP) routine from the NAG optimization library (The Numerical Algorithms Group, 2018). This is a proprietary, industrial-grade library, and SQP is designed to solve non-convex problems. Feasibility. It is not guaranteed that feasible solutions will exist for a given combination of data set and shape constraint. Infeasible problems can easily arise owing to lack of data in a particular location. For example, requiring a unimodal estimate when the data are actually bimodal could lead to such a situation. Efficiency. As the sample size grows, the method becomes computationally inefficient. The dimension of the QP is n, because there is one weight per data point. Ill conditioning is also more likely to occur as n grows and x begins to contain more points that are very close together. Realism. The constraints may force the optimal estimate to directly contradict the observed data, by assigning a probability mass of zero to an interval that contains a data point. For example, this can happen in the tails of a unimodal estimate. The only way to avoid introducing a second mode may be to give a weight of zero to points in the tail.
The definiteness and feasibility problems are critical and must be addressed to provide a robust implementation. The efficiency and realism problems impact the usability of the method.

The adjusted KDE method
The adjustment curve approach involves modifying f x .x j p unif / not by changing its weights but rather by adding a function to it. The estimator is where .x/ is a vector of m "adjustment densities" evaluated at x. The coefficients a can be negative but are constrained to sum to zero to ensure the resulting estimate is a density. The estimator itself is also constrained to be non-negative. The optimal coefficients are again the solution to a QP.
The structure of .x/ is specified independently of the data. For example, m ¤ n adjustment densities can be placed on a grid covering the support of the estimate. This strategy avoids the feasibility and efficiency problems experienced by the weighted KDE, by decoupling the adjustment curve from n. It also allows the shape of f x to change even in regions where data are scarce, which resolves the realism problem. The trade-off for these improvements is some additional complexity, because the structure of .x/ needs to be specified.
When the adjustment densities are made equal to the kernel functions ( .x/ D k x .x/), the adjustment curve approach becomes equivalent to the weighted KDE method. The adaptive binning procedure, discussed later, automatically (wileyonlinelibrary.com) https://doi.org/10.1002/sta4.202 The ISI's Journal for the Rapid Dissemination of Statistics Research sets up adjustment densities that can exploit this connection, thereby eliminating the complexity trade-off. In the scdensity package, the adjustment curve is the default estimation method.

Illustration: chondrite data
The overall approach is illustrated in Figure 1, using the chondrite data. The data are measurements of the silica percentage in 22 meteorites. Minnotte (1997) and Hall & Huang (2002) applied Gaussian kernel density estimation to these data, with a bandwidth of 0.7, producing an estimate with three modes (shown in Figure 1(a)). Hall & Ooi (2004) argued that the leftmost mode is spurious, and the true distribution should have only two modes. So we take bimodality as our constraint.
Instead of attempting to enforce constraints on f x directly, we begin by finding s and Q w such that f s .x j Q w/ closely approximates f x .x j p unif /. This is performed by first letting s be a sparse grid of kernel centres and then iteratively subdividing the grid in an adaptive way, until the best approximation to f x .x j p unif / is sufficiently close. At each stage of grid refinement, the best approximation is found by solving a QP without shape constraints. We refer to this as the binning step, because f s .x j Q w/ is essentially a binned KDE with non-uniform bins.
The binned approximation and its kernel centres are shown in Figure 1(b). There are 65 centres, 43 of which received a weight of zero. The 22 non-zero-weighted centres are the ones closest to the data points. In this case, the total number of centres exceeds the sample size. This does not occur for larger n, where binning effects a significant data reduction (see Section 4).
Let Q w C be the non-zero weights and s C the corresponding kernel centres. As f s .x j Q w/ D f s C .x j Q w C / f x .x j p unif /, we can use either .s, Q w/ or .s C , Q w C / as surrogates for .x, p unif / when we enforce shape constraints.  If we use .s C , Q w C /, we obtain the weighted KDE, as shown in Figure 1(c). The method of Hall and Huang can be used directly, choosing new weights to minimize the ISE between f s C .x j p/ and f s C .x j Q w C /.
If we use .s, Q w/ and set our adjustment densities to be k s .x/, we obtain the special case of the adjusted KDE method, where the kernel functions and the adjustment densities coincide. Therefore, we can again treat the estimation problem as a weighted KDE. The result is shown in Figure 1(d). The adaptive binning procedure ensures that k s .x/ is a suitable arrangement of adjustment densities.
In both cases, the optimal weights are found by quadratic programming. The QP to be solved is the same as that of the binning step, with the addition of shape constraints. This is the estimation step.
In the estimation step, the optimization problem is only a QP if certain constraint-dependent values, such as the locations of modes or inflection points, are prespecified. We use the term important points to refer to such fixed values. In the usual situation where the important points are unknown, the QP solver must be run inside an outer optimization loop to search for them. For the chondrite data, locations of the two modes and their intervening antimode were found in this way. Further comments are given in the next section.
The chondrite example also illustrates qualitative differences between the weighted and adjusted KDE approaches to constraint handling. The weighted KDE can only ensure bimodality by assigning near-zero probability mass to the four leftmost data points. The remainder of the estimate is shifted up to compensate. The adjusted KDE turns the left mode into a shoulder and retains fidelity to the original estimate in regions distant from the constraint violation. It can do this because of the extra kernel centres that were introduced in data-poor regions.

Our implementation
Most of the shape constraints we implement assume continuity of the density's first, second, or third derivative. Of the commonly used kernel functions, only the Gaussian has well-defined derivatives up to the third order over the whole real line. In particular, none of the usual compactly supported kernels have three derivatives smoothly tending to zero at the boundaries of their support. Consequently, we employ the Gaussian kernel exclusively in our implementation.
The next section lays out the general QP that is used in both the binning and estimation steps. The following sections provide further remarks. Detailed pseudocode for the estimation procedure is given in Algorithm 1.

A QP for ISE minimization
We now develop a QP for choosing the weights to minimize the ISE between a shape-constrained KDE and any other weighted KDE. Suppose we have two vectors of kernel centres, y 2 R q and s 2 R r , and corresponding weighted KDEs, f y .x j v/ and f s .x j w/. It can be shown that the ISE between the two curves is where L ys is a q r matrix with .i, j/th element Matrices L yy and L ss are defined in the same way and are q q and r r, respectively. The rightmost equality in (4) follows from the symmetry of K and allows the ISE to be computed using .K K/. /, the convolution of the kernel with itself. The convolution is symmetric around zero, so L yy and L ss are symmetric, and L sy D L T ys . If .K K/.t/ is known analytically, the ISE can be computed without any need for numerical integration. Note that the ISE is quadratic in v and w.
Shape constraints can be enforced on f y .x j v/ by placing inequality restrictions on the estimate or its derivatives at a set of constraint-checking points. Let g D [ g 1 , g 2 , : : : , g ] T be a grid of such points, where l D g 1 < g 2 < < g D u, and [ l, u] is chosen to cover the region where constraints are to be satisfied. With sufficiently many grid points spread over the interval, constraint violations can be made negligibly small.
Shape constraints expressed this way become linear restrictions on the weights. For example, suppose we wish our KDE to be unimodal, with mode at Â. To satisfy this constraint, restrict k 0 where A q D 2 6 6 6 6 6 6 6 6 6 6 6 6 4 The constraints are satisfied by any v that produces a mode between g b and g bC1 . While this is sufficient for most purposes, the mode can be made to occur exactly at Â by adding an additional equality constraint, k 0 y .Â/ T v D 0. Other derivative-based shape constraints can be expressed using systems of inequalities similar to (5). In addition to unimodality with mode at Â, the scdensity package allows the following constraints: bimodality with stationary points at .Â 1 , Â 2 , Â 3 /; tail monotonicity to the left of Â 1 and/or to the right of Â 2 ; two inflection points of f, at .Â 1 , Â 2 /; or three inflection points of f 0 , at .Â 1 , Â 2 , Â 3 /. Any of these may be combined with constraints that the support of the estimate be bounded to the left or to the right, or that the estimate be symmetric around .
Values Â j or are the important points, which are assumed known when constructing the QP. If they are actually unknown, the QP is wrapped inside a search routine, as described in Section 3.3. The symmetry constraint requires special handling, which is reviewed in Section 5. To use multiple shape restrictions in combination, the individual constraint matrices are constructed and then stacked to form a final system of inequalities.
With all of the aforementioned data put together, the optimal weight vector for minimizing the ISE between f y and f s .x j w/, subject to constraints, is the solution in v to the QP minimize v T L yy v 2v T L ys w subject to Av Ä 0 (shape constraints) where the final two constraints are included to ensure that the KDE is a density.

The binning and estimation steps
During binning, a new set of kernel centres, s, is chosen in a data-adaptive manner. The centres must cover the entire range of the data (to accommodate the adjustment curve method), while not being too close together (for efficiency and numerical stability). At the same time, there must exist weights Q w such that f s .x j Q w/ closely approximates f x .x j p unif /.
The centres are established using a sequential partitioning approach. Initially, s is set to have M elements, covering the range [ l, u] with regular spacing. Optimal weights Q w are those that minimize the ISE between f x .x j p unif / and f s .x j w/, that is, the solution to minimize w T .nL ss /w 2w T L sx 1 subject to Iw Ä 0 (non-negativity) This QP is a special case of (6), with no shape constraints.
Centres s divide the range [ l, u] into M 1 intervals. Each interval is marked for subdivision if the function values of f s and f x differ by more than at the midpoint of the interval, and the interval is wider than some minimum value ı. In Algorithm 1, the test function i .s, Q w/ indicates whether the ith interval becomes marked. An extra kernel centre is inserted at the midpoint of every marked interval, increasing the length of s. Problem (7) is solved again, and the subdivision process continues until no intervals satisfy the test function. The lower bound ı on interval width guarantees that the process will terminate.
The value of M is chosen so that the initial grid spacing is as large as possible without being larger than h, and we set ı D h=10. These choices ensure that no data point is more than h=2 away from a kernel centre, and that no two kernel centres are closer than h=10. The tolerance on function agreement, , serves mainly to discourage unnecessary bisection of intervals. If it is set too large, the initial grid with spacing approximately h=2 will be used. If it is set too small, the densest grid with spacing approximately h=10 will be used. By default, we set to 10 3 times the maximum height of the unconstrained estimate.
As seen in the chondrite example, it is typical that many elements of Q w are zero after binning completes. They correspond to kernel centres that are distant from any point in x. Continuing to denote the non-zero weights and their centres by Q w C and s C , we can keep a unified notation in the estimation step by defining In either case, f y .x j Q v/ can act as a surrogate for f x .x j p unif /.
The final, shape-restricted estimate is f y .x j O v/, where O v is the solution to another QP. It minimizes the distance between f y .x j v/ and f y .x j Q v/: minimize v T L yy v 2v T L yy Q v subject to Av Ä 0 (shape constraints) Iv Ä 0 (non-negativity) where A is the appropriate constraint matrix as described previously and A eq is a matrix with one or more rows, used to fix the value of the estimate's derivatives at the important points. This program is another special case of (6), where the two KDE's centres coincide, and there are active shape constraints.

Additional details
It was previously observed that the matrix in the weighted KDE QP (L yy in Equation (6)) can be ill conditioned, leaving the convexity of the optimization problem in doubt. When K is the Gaussian kernel, the .i, j/th element of L yy is From this, we see that the diagonal elements of L yy are 1 2 p , and the elements of each row decrease toward zero away from the diagonal.
If y i and y j are nearly equal, then the entire ith and jth rows/columns of L yy will be nearly equal. Thus, the ill conditioning can be likened to a (multi)collinearity problem in regression. In the extreme case when y i D y j , the two rows/columns will be exactly equal, and weights i and j will influence the ISE only through their sum. In this case, the optimal weights are clearly not unique.
Fortunately, it can be shown that matrices like L yy are positive definite whenever K has finite variance and y contains no duplicate values (see Section 5 for a sketch of the proof). The ISE (3) is, then, a strictly convex function of either set of weights. This justifies a simple remedy, which Kung (2014) calls the "spectral shift approach." If the matrix is not numerically positive definite, add a small constant to its diagonal elements. In our implementation, we let min be the smallest eigenvalue of L yy and define the operation SpectralShift.L yy / D ² L yy C .j min j C 10 10 /I if min < 10 10 L yy otherwise.
We have found this remedy to resolve the definiteness problem with no negative side effects. In Algorithm 1, it is used before solving the QP in both the binning and estimation stages.
When it is necessary to search for the best locations of the important points, we use the same approach we used with the adjustment curve method (Wolters & Braun, 2018, sec. 2.2). If there is only one important point, the goldensection search is used to find an optimum. When there are two or more points, the search is performed iteratively, one component at a time. This method has been found to provide good solutions and fast run times.
Whether or not the important points are known, the binning step only needs to be performed once (symmetry is an exception to this; see Section 5). After binning, the number of kernel centres is usually small, making it feasible to repeatedly solve QP (9) in searching for the optimal locations of the important points.
The optimal weights must satisfy a small number of equality constraints and a moderate number of inequality constraints. It is not possible to ensure that constraints will be numerically feasible for all possible choices of the important points. To obtain a robust implementation, it is necessary to detect and handle infeasible constraint specifications when they arise.
The problem of finding a feasible point of a given set of linear constraints is not trivial, but it can be expressed as a linear program (LP) and is routinely tackled as the first solution stage ("phase 1 LP") in LP solvers (Antoniou & Lu, 2007). We make use of this by putting the constraints into an LP before running the QP. If the phase 1 LP does not return a solution, the QP in the estimation step is skipped and an error code is returned. This code can be handled by outer routines to give a robust user experience. During a search for the important points, detection of infeasibility allows the search to continue undisturbed.
It was mentioned previously that estimates can be constrained to have support that is bounded on the left or right. It is obviously not possible to achieve this exactly with the Gaussian kernel. Instead, our bounded-support constraint requires that f y is less than some small number, such as 0.001 times the maximum height of the unconstrained estimate.

Demonstrations
This section demonstrates the capabilities of the scdensity package on two more real data examples ( Figure 2) and summarizes the effect of sample size on the number of kernel centres and the run time ( Figure 3).

Additional examples
Both of the following examples make use of the twoInflections+ shape constraint (which requires the first derivative of the density to have exactly three inflection points). This constraint produces density estimates with two inflection points and a high degree of qualitative smoothness.
The first data set is the 756 daily log-return values of the S&P 500 stock index, previously used for unimodal density estimation by Turnbull & Ghosh (2014). It has been argued (Fergusson & Platen, 2006) that such data are well modelled by a density symmetric around zero. Figure 2(a) shows a histogram of the data set, along with an unconstrained KDE and the corresponding KDE constrained to satisfy twoInflections+ while also being symmetric around zero. The constrained estimate was obtained with a single function call: scdensity(returns, bw="SJ", constraint = c("twoInflections+","symmetric"), opts=list(pointOfSymmetry=0, ncheck=200)) This example shows how constraints can be combined and how an opts list is used to provide optional extra information (in this case, the location of the point of symmetry, and ncheck, the number of constraint-checking points). The shape-constrained estimate eliminates spurious waves in the tails of the KDE and gives a smooth and symmetric estimate without making other assumptions, for example, about the shape of the tails.
The second data set consists of 289 axon diameter measurements from a single electron microscopy image in the collection analysed by Sepehrband et al. (2016). In the original analysis, the authors compared the fit of 16 different distribution families, searching for the most suitable parametric form. Clearly, the authors believed the true distribution to be qualitatively similar to common parametric forms but did not feel there was a priori justification to prefer one form over another. Situations like this are perfectly suited to shape-restricted estimation, particularly the twoInflections+ constraint, which can guarantee "parametric-like" smoothness while allowing the density's shape to be data driven.
The scdensity estimate for the axon data set was obtained with the code scdensity(diameters, bw="SJ", constraint="twoInflections+") and the results are shown in Figure 2(b). The constrained estimate does indeed have a shape suggestive of parametric forms, while still agreeing closely with the histogram.
Also shown on the plot are two other shape-restricted estimates obtained from other R packages that one might consider in this application: log-concave estimates (unsmoothed and smoothed) from the package logcondens (Dümbgen & Rufibach, 2011), and a unimodal and continuously differentiable estimate from the package episplineDensity (Buttrey et al., 2014). The estimation functions in these packages were used with their default settings. Of these alternative estimates, the unsmoothed log-concave estimate is closest to the scdensity estimate, but it has a nondifferentiable peak and has difficulty fitting both the peak and the long right tail. The smoothed log-concave and epi-spline estimates are both inadequate, suggesting that manual adjustment of their default settings would be required to obtain a reasonable estimate. Figure 3 shows the results of a small simulation designed to study the performance of the method as the sample size increases. At each of ten sample sizes ranging from 10 to 10 5 , 250 samples were drawn from both a standard normal distribution and a t distribution with five degrees of freedom. For each sample, scdensity was used to obtain a unimodal estimate using the adjustment curve approach.

Stat
Weighted KDE for handling shape constraints  Figure 3(a) shows the median number of kernel centres after the binning step. For normally distributed data, even sample sizes of 10 5 require only about 100 centres. Only a small proportion of the centres are zero weighted. For the t-distributed data, the number of centres required is much higher, and a much higher fraction of the centres are zero weighted. This can be explained by heavy tails of the t 5 distribution, which cause samples to cover a wider range, with larger gaps between data points.
The number of centres determines the dimension of the QP and is therefore one factor determining the run time. As shown in Figure 3(b), run times are very reasonable: less than 1 second for both distributions up to sample sizes of about 10 4 . At the largest sample sizes, run times increase faster for the t 5 data than for the normal data. This is due partly to the larger number of centres, but also due to the larger search range for determining the mode location. We now provide a sketch of the proof that the matrix L yy is positive definite and then give more details about how the symmetry constraint is implemented.

Positive definiteness of the convolution matrix
The Gram matrix of a set of vectors r 1 , : : : r t in an inner product space is the t t symmetric matrix G, with G ij D hr i , r j i, where h i denotes the inner product. Such a matrix is positive semidefinite, and it is positive definite if and only if the vectors that generate the matrix are linearly independent (Horn & Johnson, 2013, thm. 7.2.10).
The set of continuous, real-valued functions with the usual addition and scalar multiplication operations is a real vector space. The set of kernel functions ¹k v .x/º 2R is in this vector space, and the integral R 1 1 k u .x/k v .x/ dx, used to compute the elements of L yy , is a valid inner product on the space. Hence, our matrices of the form L yy are Gram matrices and are at least positive semidefinite.
Positive definiteness follows if the kernel functions k y 1 .x/, : : : , k y q .x/ are linearly independent, that is, if P q iD1 a i k y i .x/ D 0, 8x ) a D 0. This is trivially true when q D 1, and we may use mathematical induction to prove that it must be true for all q. Consider j C 1 kernel functions, of which the first j are linearly independent, and for which P jC1 iD1 a i k y i .x/ D 0. If a jC1 D 0, then by the independence of k y 1 .x/, : : : k y j .x/, a D 0 is the only solution. If a jC1 ¤ 0, then k jC1 .x/ D P j iD1 b i k y i .x/, where b i D a i =a jC1 . For a compactly supported k, this produces an obvious contradiction as long as y 1 , : : : , y jC1 are distinct. When the support is R, we may add the requirement that k has finite variance; then a contradiction is found by comparing the means and variances of k jC1 .x/ and P j iD1 b i k y i .x/.

Handling symmetry
Fundamentally, symmetry around a point is an equality constraint on points equidistant from . It could be implemented as equality constraints on pairs of function values, or on pairs of weights. If the kernel centres are not positioned to be symmetric around , this approach will lead to infeasible optimization problems. Even if the kernel centres are appropriately laid out, adding a large number of equality constraints needlessly increases the size and difficulty of the QP problem.
A much better approach is to make the KDE symmetric around by construction, so that the symmetry constraints do not need to be enforced at all in the QP. To do this, let y D [ y T L y T R ] T , where y L is a decreasing vector of d kernel centres moving away from to the left and y R is an increasing vector of d centres moving away from to the right. Their corresponding elements are equidistant from the point of symmetry: .y L / i C .y R / i D 2 .
The symmetric centres y can be generated by performing the binning step exactly as described in Section 3, with the initial grid laid out such that it is symmetric around . During binning, any interval subdivided on one side of is subdivided on the other side as well. The binned estimator f y .x j Q v/ uses the symmetric bins but 2d unique weights.
During estimation, symmetry is assured by letting v D[ T T ] T , where is a vector of d non-negative weights that sum to 1=2. This guarantees equality of the weights on pairs of points in y L and y R . The objective function for optimizing the ISE between f y .x j v/ and f y .x j Q v/ can be expressed in terms of as T .L y R y R C L y L y R / T .L y L y C L y R y /Q v.
The shape constraints can similarly be expressed in terms of , and the resulting QP solved to obtain O .