## 1. Introduction

Modern non-parametric density estimation began with the introduction of a kernel density estimator in the pioneering work of Fix and Hodges (1951), which was later republished as Fix and Hodges (1989). For independent and identically distributed real-valued observations, the appealing asymptotic theory of the mean integrated squared error (MISE) was provided by Rosenblatt (1956) and Parzen (1962). This theory leads to an asymptotically optimal choice of the smoothing parameter, or bandwidth. Unfortunately, however, it depends on the unknown density *f* through the integral of the square of the second derivative of *f*. Considerable effort has therefore been focused on finding methods of automatic bandwidth selection (see Wand and Jones (1995), chapter 3, and the references therein). Although this has resulted in algorithms, e.g. Chiu (1992), that achieve the optimal rate of convergence of the relative error, namely *O*_{p}(*n*^{−1/2}), where *n* is the sample size, good finite sample performance is by no means guaranteed.

This problem is compounded when the observations take values in , where the general kernel estimator (Deheuvels, 1977) requires the specification of a symmetric, positive definite *d*×*d* bandwidth matrix. The difficulties that are involved in making the *d*(*d*+1)/2 choices for its entries mean that attention is often restricted either to bandwidth matrices that are diagonal, or even to those that are scalar multiples of the identity matrix. Despite recent progress (e.g. Duong and Hazelton (2003, 2005), Zhang *et al.* (2006), Chacón *et al.* (2010) and Chacón (2009)) significant practical challenges remain.

Extensions that adapt to local smoothness began with Breiman *et al.* (1977) and Abramson (1982). A review of several adaptive kernel methods for univariate data may be found in Sain and Scott (1996). Multivariate adaptive techniques are presented in Sain (2002), Scott and Sain (2004) and Duong (2004). There are many other smoothing methods for density estimation, e.g. methods based on wavelets (Donoho *et al.*, 1996), splines (Eubank, 1988; Wahba, 1990), penalized likelihood (Eggermont and LaRiccia, 2001) and vector support methods (Vapnik and Mukherjee, 2000). For a review, see Ćwik and Koronacki (1997). However, all suffer from the drawback that some smoothing parameter must be chosen, the optimal value of which depends on the unknown density, so achieving an appropriate level of smoothing is difficult.

In this paper, we propose a fully automatic non-parametric estimator of *f*, with no tuning parameters to be chosen, under the condition that *f* is log-concave—i.e. log (*f*) is a concave function. The class of log-concave densities has many attractive properties and has been well studied, particularly in the economics, sampling and reliability theory literature. See Section 2 for further discussion of examples, applications and properties of log-concave densities.

In Section 3, we show that, if *X*_{1},…,*X*_{n} are independent and identically distributed random vectors, then with probability 1 there is a unique log-concave density that maximizes the likelihood function,

Before continuing, it is worth noting that, without any shape constraints on the densities under consideration, the likelihood function is unbounded. To see this, we could define a sequence (*f*_{n}) of densities that represent successively close approximations to a mixture of *n*‘spikes’ (one on each *X*_{i}), such as

where *φ*_{d,Σ} denotes the *N*_{d}(0,Σ) density. This sequence satisfies *L*(*f*_{n})→∞ as *n*→∞. In fact, a modification of this argument may be used to show that the likelihood function remains unbounded even if we restrict attention to unimodal densities.

There has been considerable recent interest in shape-restricted non-parametric density estimation, but most of it has been confined to the case of univariate densities, where the computational algorithms are more straightforward. Nevertheless, as was discussed above, it is in multivariate situations that the automatic nature of the maximum likelihood estimator is particularly valuable. Walther (2002), Dümbgen and Rufibach (2009) and Pal *et al.* (2007) have proved the existence and uniqueness of the log-concave maximum likelihood estimator in one dimension and Dümbgen and Rufibach (2009), Pal *et al.* (2007) and Balabdaoui *et al.* (2009) have studied its theoretical properties. Rufibach (2007) compared different algorithms for computing the univariate estimator, including the iterative convex minorant algorithm (Groeneboom and Wellner, 1992; Jongbloed, 1998), and three others. Dümbgen *et al.* (2007) also presented an active set algorithm, which has similarities with the vertex direction and vertex reduction algorithms that were described in Groeneboom *et al.* (2008). Walther (2009) provides a nice recent review of inference and modelling with log-concave densities. Other recent related work includes Seregin and Wellner (2010), Schuhmacher *et al.* (2009), Schuhmacher and Dümbgen (2010) and Koenker and Mizera (2010). For univariate data, it is also well known that there are maximum likelihood estimators of a non-increasing density supported on [0,∞) (Grenander, 1956) and of a convex, decreasing density (Groeneboom *et al.*, 2001).

Fig. 1 gives a diagram illustrating the structure of the maximum likelihood estimator on the logarithmic scale. This structure is most easily visualized for two-dimensional data, where we can imagine associating a ‘tent pole’ with each observation, extending vertically out of the plane. For certain tent pole heights, the graph of the logarithm of the maximum likelihood estimator can be thought of as the roof of a taut tent stretched over the tent poles. The fact that the logarithm of the maximum likelihood estimator is of this ‘tent function’ form constitutes part of the proof of its existence and uniqueness.

In Sections 3.1 and 3.2, we discuss the computational problem of how to adjust the *n* tent pole heights so that the corresponding tent functions converge to the logarithm of the maximum likelihood estimator. One reason that this computational problem is so challenging in more than one dimension is the fact that it is difficult to describe the set of tent pole heights that correspond to concave functions. The key observation, which is discussed in Section 3.1, is that it is possible to minimize a modified objective function that is convex (though non-differentiable). This allows us to apply the powerful non-differentiable convex optimization methodology of the subgradient method (Shor, 1985) and a variant called Shor's *r*-algorithm, which has been implemented by Kappel and Kuntsevich (2000).

As an illustration of the estimates obtained, Fig. 2 presents plots of the maximum likelihood estimator, and its logarithm, for 1000 observations from a standard bivariate normal distribution. These plots were created using the LogConcDEAD package (Cule *et al.*, 2007, 2009) in R (R Development Core Team, 2009).

Theoretical properties of the estimator are presented in Section 4. We describe the asymptotic behaviour of the estimator both in the case where the true density is log-concave, and where this model is misspecified. In the former case, we show that converges in certain strong norms to the true density. The nature of the norm that is chosen gives reassurance about the behaviour of the estimator in the tails of the density. In the misspecified case, converges to the log-concave density that is closest to the true underlying density (in the sense of minimizing the Kullback–Leibler divergence). This latter result amounts to a desirable robustness property.

In Section 5 we present simulations to compare the finite sample performance of the maximum likelihood estimator with kernel-based methods with respect to the MISE criterion. The results are striking: even when we use the theoretical, optimal bandwidth for the kernel estimator (or an asymptotic approximation to this when it is not available), we find that the maximum likelihood estimator has a rather smaller MISE for moderate or large sample sizes, despite the fact that this optimal bandwidth depends on properties of the density that would be unknown in practice.

Non-parametric density estimation is a fundamental tool for the visualization of structure in exploratory data analysis. Our proposed method may certainly be used for this purpose; however, it may also be used as an intermediary stage in more involved statistical procedures, for instance as follows.

- (a)In classification problems, we have
*p*2 populations of interest, and we assume in this discussion that these have densities*f*_{1},…,*f*_{p}on . We observe training data of the form {(*X*_{i},*Y*_{i}):*i*=1,…,*n*} where, if*Y*_{i}=*j*, then*X*_{i}has density*f*_{j}. The aim is to classify a new observation as coming from one of the populations. Problems of this type occur in a huge variety of applications, including medical diagnosis, archaeology and ecology—see Gordon (1981), Hand (1981) or Devroye*et al.*(1996) for further details and examples. A natural approach to classification problems is to construct density estimates , where is based on the*n*_{j}observations, say, from the*j*th population, namely {*X*_{i}:*Y*_{i}=*j*}. We may then assign*z*to the*j*th population if . In this context, the use of kernel-based estimators in general requires the choice of*p*separate*d*×*d*bandwidth matrices, and the corresponding procedure based on the log-concave maximum likelihood estimates is again fully automatic. - (b)Clustering problems are closely related to the classification problems that were described above. The difference is that, in the above notation, we do not observe
*Y*_{1},…,*Y*_{n}, and must assign each of*X*_{1},…,*X*_{n}to one of the*p*populations. A common technique is based on fitting a mixture density of the form , where the mixture proportions*π*_{1},…,*π*_{p}are positive and sum to 1. We show in Section 6 that our methodology can be extended to fit a finite mixture of log-concave densities, which need not itself be log-concave—see Section 2. A simple plug-in Bayes rule may then be used to classify the points. We also illustrate this clustering algorithm on a Wisconsin breast cancer data set in Section 6, where the aim is to separate observations into benign and malignant component populations. - (c)A functional of the true underlying density may be estimated by the corresponding functional of a density estimator, such as the log-concave maximum likelihood estimator. Examples of functionals of interest include probabilities, such as ∫
_{‖}_{x}_{‖}_{1}*f*(*x*) d*x*, moments, e.g. ∫‖*x*‖^{2}*f*(*x*) d*x*, and the differential entropy, −∫*f*(*x*) log {*f*(*x*)} d*x*. It may be possible to compute the plug-in estimator based on the log-concave maximum likelihood estimator analytically, but in Section 7 we show that, even if this is not possible, we can sample from the log-concave maximum likelihood estimator , and hence in many cases of interest obtain a Monte Carlo estimate of the functional. This nice feature also means that the log-concave maximum likelihood estimator can be used in a Monte Carlo bootstrap procedure for assessing uncertainty in functional estimates. - (d)The fitting of a non-parametric density estimate may give an indication of the validity of a particular smaller model (often parametric). Thus, a contour plot of the log-concave maximum likelihood estimator may provide evidence that the underlying density has elliptical contours, and thus suggests a model that exploits this elliptical symmetry.
- (e)In the univariate case, Walther (2002) described methodology based on log-concave density estimation for addressing the problem of detecting the presence of mixing in a distribution. As an application, he cited the Pickering–Platt debate (Swales, 1985) on the issue of whether high blood pressure is a disease (in which case observed blood pressure measurements should follow a mixture distribution), or simply a label that is attached to people in the right-hand tail of the blood pressure distribution. As a result of our algorithm for computing the multi-dimensional log-concave maximum likelihood estimator, a similar test may be devised for multivariate data—see Section 8.

In Section 9, we give a brief concluding discussion and suggest some directions for future research. We defer the proofs to Appendix A and discuss structural and computational issues in Appendix B. Finally, we present in Appendix C a glossary of terms and results from convex analysis and computational geometry that appear in italics at their first occurrence in the main body of the paper.